Skip to content

Strip the #fragment from a redirect Location before fetching (#204)#441

Merged
xroche merged 2 commits into
masterfrom
fix/redirect-fragment-204
Jun 28, 2026
Merged

Strip the #fragment from a redirect Location before fetching (#204)#441
xroche merged 2 commits into
masterfrom
fix/redirect-fragment-204

Conversation

@xroche

@xroche xroche commented Jun 28, 2026

Copy link
Copy Markdown
Owner

A 302/30x Location is dereferenced, not displayed, so a #fragment in it is a client-side anchor with no role in fetching the resource. httrack kept it: both redirect followers in htsparse.c copied r.location verbatim, so the re-request went out as GET /page.html#frag (strict servers answer 400) and the target was saved under a polluted name like page.html#frag.html. HTML links are already cut at the # during parsing; only the two Location followers were missed.

The fix drops the fragment in a small helper called right after each Location copy, covering both the live and cached-redirect paths.

Test: a new local-server route plus 29_local-redirect-fragment.test, where a 302 whose Location carries #section must save redir/target.html, not redir/target.html#section.html. It fails on the unpatched binary.

Closes #204

xroche and others added 2 commits June 28, 2026 13:19
A 302/30x Location is dereferenced, not displayed, so its #fragment is a
client-side anchor that must be dropped before the target is requested.
httrack kept it: the redirect followers copied r.location verbatim, so the
re-request carried `GET /page.html#frag` (strict servers answer 400) and the
mirror was saved under a fragment-polluted name. HTML links were already
stripped at parse time; only the two Location followers were not.

Drop the fragment in a small helper called at both follow sites, covering
the live and cached-redirect paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
…failure

The first cut of 29_local-redirect-fragment only checked the saved filename.
Python's urlsplit() drops the fragment before routing, so a `#` leaked into
the GET line still routed to the target and the crawl passed: the assertion
was a proxy, not the wire behavior the fix targets. Make the server strict
(400 on any `#` in the request-target, like the real servers in #204), so a
leaked fragment now logs an error and the target is never saved. Neutering the
fix makes the test fail with the exact "400 Bad Request" from the issue.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche merged commit a62f93a into master Jun 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

urls that contain # in download list (or external txt file list) use server 'primary'

1 participant