Strip the #fragment from a redirect Location before fetching (#204)#441
Merged
Conversation
A 302/30x Location is dereferenced, not displayed, so its #fragment is a client-side anchor that must be dropped before the target is requested. httrack kept it: the redirect followers copied r.location verbatim, so the re-request carried `GET /page.html#frag` (strict servers answer 400) and the mirror was saved under a fragment-polluted name. HTML links were already stripped at parse time; only the two Location followers were not. Drop the fragment in a small helper called at both follow sites, covering the live and cached-redirect paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
…failure The first cut of 29_local-redirect-fragment only checked the saved filename. Python's urlsplit() drops the fragment before routing, so a `#` leaked into the GET line still routed to the target and the crawl passed: the assertion was a proxy, not the wire behavior the fix targets. Make the server strict (400 on any `#` in the request-target, like the real servers in #204), so a leaked fragment now logs an error and the target is never saved. Neutering the fix makes the test fail with the exact "400 Bad Request" from the issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A 302/30x Location is dereferenced, not displayed, so a
#fragmentin it is a client-side anchor with no role in fetching the resource. httrack kept it: both redirect followers in htsparse.c copiedr.locationverbatim, so the re-request went out asGET /page.html#frag(strict servers answer 400) and the target was saved under a polluted name likepage.html#frag.html. HTML links are already cut at the#during parsing; only the two Location followers were missed.The fix drops the fragment in a small helper called right after each Location copy, covering both the live and cached-redirect paths.
Test: a new local-server route plus
29_local-redirect-fragment.test, where a 302 whose Location carries#sectionmust saveredir/target.html, notredir/target.html#section.html. It fails on the unpatched binary.Closes #204