Add --cookies-file to preload a Netscape cookies.txt (#215)#437
Merged
Conversation
Mirroring a site behind a login meant either re-implementing the auth flow or dropping a file literally named cookies.txt into the output or working directory, the only two places the engine looked. This adds a CLI option to point at an arbitrary Netscape/Mozilla cookies.txt, so a session exported from a browser (the "Get cookies.txt" extensions write exactly this format) is replayed on the crawl and authenticated pages come down. The plumbing already existed: cookie_load parses the format into the shared jar and the request path sends every matching cookie. The new opt->cookies_file is loaded last, after the mirror/CWD defaults, so a user-supplied value wins on a name/domain/path conflict. The field is appended at the tail of httrackp, so the exported ABI is unchanged. Cookies key on host[:port], so a bare-domain file matches a normal crawl of a default-port site; only an explicit-port URL needs the port in the cookie domain. Covered by 27_local-cookies-file.test: a gated page that 500s without a cookie no page ever sets, reachable only once the file preloads it (with -o0 so the absence of a 500 error page is meaningful), plus a no-cookie control. The local-crawl harness grows a --cookie helper that writes a port-scoped jar. The copyopt self-test also gains a String round-trip so the exported copy_htsopt path for the new field is covered. Closes #215 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
1da6208 to
cc35193
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New
--cookies-filepoints HTTrack at a Netscape/Mozillacookies.txt, so a crawl can reuse a session you already logged into in a browser. Until now the engine only picked up a file literally namedcookies.txtfrom the output dir or the working dir, with no way to point elsewhere. This is the cheap, no-new-dependency half of #215; reading an encrypted browser SQLite profile directly is better folded into the Chromium/CDP epic #302.Most of the machinery was already there:
cookie_loadparses the format into the shared jar and the request path replays every matching cookie. The user file loads last, after the mirror/CWD defaults, so it wins on a name/domain/path conflict.opt->cookies_fileis appended at the tail ofhttrackp, so the exported ABI is unchanged (no soname bump), same as the recentstrip_queryfield.One caveat: cookies key on
host[:port], so a bare-domain entry from a browser export matches a normal crawl of a default-port site; only an explicit-port URL needs the port baked into the cookie domain.27_local-cookies-file.testdrives a gated page that 500s without a cookie no page ever sets, reachable only once the file preloads it, plus a no-cookie control that confirms it stays gated. The local-crawl harness gains a small--cookiehelper that writes a port-scoped jar.Closes #215