Skip to content

Add --cookies-file to preload a Netscape cookies.txt (#215)#437

Merged
xroche merged 1 commit into
masterfrom
feat/cookies-file-215
Jun 27, 2026
Merged

Add --cookies-file to preload a Netscape cookies.txt (#215)#437
xroche merged 1 commit into
masterfrom
feat/cookies-file-215

Conversation

@xroche

@xroche xroche commented Jun 27, 2026

Copy link
Copy Markdown
Owner

New --cookies-file points HTTrack at a Netscape/Mozilla cookies.txt, so a crawl can reuse a session you already logged into in a browser. Until now the engine only picked up a file literally named cookies.txt from the output dir or the working dir, with no way to point elsewhere. This is the cheap, no-new-dependency half of #215; reading an encrypted browser SQLite profile directly is better folded into the Chromium/CDP epic #302.

Most of the machinery was already there: cookie_load parses the format into the shared jar and the request path replays every matching cookie. The user file loads last, after the mirror/CWD defaults, so it wins on a name/domain/path conflict. opt->cookies_file is appended at the tail of httrackp, so the exported ABI is unchanged (no soname bump), same as the recent strip_query field.

One caveat: cookies key on host[:port], so a bare-domain entry from a browser export matches a normal crawl of a default-port site; only an explicit-port URL needs the port baked into the cookie domain.

27_local-cookies-file.test drives a gated page that 500s without a cookie no page ever sets, reachable only once the file preloads it, plus a no-cookie control that confirms it stays gated. The local-crawl harness gains a small --cookie helper that writes a port-scoped jar.

Closes #215

Mirroring a site behind a login meant either re-implementing the auth
flow or dropping a file literally named cookies.txt into the output or
working directory, the only two places the engine looked. This adds a
CLI option to point at an arbitrary Netscape/Mozilla cookies.txt, so a
session exported from a browser (the "Get cookies.txt" extensions write
exactly this format) is replayed on the crawl and authenticated pages
come down.

The plumbing already existed: cookie_load parses the format into the
shared jar and the request path sends every matching cookie. The new
opt->cookies_file is loaded last, after the mirror/CWD defaults, so a
user-supplied value wins on a name/domain/path conflict. The field is
appended at the tail of httrackp, so the exported ABI is unchanged.

Cookies key on host[:port], so a bare-domain file matches a normal crawl
of a default-port site; only an explicit-port URL needs the port in the
cookie domain. Covered by 27_local-cookies-file.test: a gated page that
500s without a cookie no page ever sets, reachable only once the file
preloads it (with -o0 so the absence of a 500 error page is meaningful),
plus a no-cookie control. The local-crawl harness grows a --cookie helper
that writes a port-scoped jar. The copyopt self-test also gains a String
round-trip so the exported copy_htsopt path for the new field is covered.

Closes #215

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche force-pushed the feat/cookies-file-215 branch from 1da6208 to cc35193 Compare June 27, 2026 20:47
@xroche xroche merged commit 5be8ba4 into master Jun 27, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

a very good suggestion (HTTrack needs update for modern times)

1 participant