Split -%u URL Hacks into independent www/slash/query toggles (#271)#435
Merged
Conversation
44dfa40 to
866b5b8
Compare
-%u (--urlhack) bundled three dedup normalizations under one switch: www.host == host, redundant // collapse, and query-argument reordering. A mirror that needed one but not another (e.g. keep www. distinct) had to turn the whole umbrella off. Add three opt-out sub-options, defaulting to the umbrella so existing -%u/-%u0 behavior is unchanged: --keep-www-prefix keep www.foo.com distinct from foo.com (-%j) --keep-double-slashes keep redundant // in the path (-%o) --keep-query-order keep query-argument order significant (-%y) The split is resolved once in hash_init() into norm_host/norm_slash/ norm_query and threaded through the dedup hash (htshash.c), the savename lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so all three stay consistent. fil_normalized() gains an internal fil_normalized_ex(do_slash, do_query) core; the public fil_normalized()/fil_normalized_filtered() keep their signatures. Normalization (slash/query) now follows urlhack and its sub-flags uniformly, while --strip-query stays orthogonal. So with urlhack off, strip-query strips keys without sorting the remainder; the url_savename urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer the hash uses, so a URL is always looked up under the key it was stored with (a self-lookup mismatch this otherwise introduced). http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so that part of the request needs no toggle. The opt-outs are spelled positively (--keep-*) because httrack's generic --no<opt> prefix only appends the disabling "0" for parametered options, not "single" booleans, so --nowww-dedup would silently no-op. opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, matching the strip_query addition in #112). Tested by a -#test=urlhack engine self-test (hash_url_equals over each flag combination) plus a -%u0 + --strip-query crawl case exercising the urlhack-off savename branch. Closes #271 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
866b5b8 to
600001b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
-%u(--urlhack) folded three separate dedup normalizations into one switch: treatingwww.hostandhostas the same, collapsing redundant//, and ignoring query-argument order. A site that needed one but not another could only turn the whole umbrella off. The reporter (#271) wantedwww.kept distinct, so that a multi-stepwww→non-wwwredirect would not be read as a loop, without also losing the rest.This adds three opt-out sub-options that default to the umbrella, so
-%u/-%u0behave exactly as before:--keep-www-prefix(-%j): keepwww.foo.comdistinct fromfoo.com--keep-double-slashes(-%o): keep redundant//in the path--keep-query-order(-%y): keep query-argument order significantThe effective flags are resolved once in
hash_init()and threaded through the three places that must agree: the dedup hash, the savename lookup key, and the redirect-loop diagnostic.fil_normalized()gains an internal_ex(do_slash, do_query)core; the exportedfil_normalized()/fil_normalized_filtered()signatures are untouched.Two things to look at. The opt-outs read positively (
--keep-*) instead of--no-www-dedup, because httrack's generic--no<opt>prefix only appends the disabling0for parametered options, not "single" booleans, so--nowww-dedupwould silently do nothing. Andhttp/httpsare always merged in the dedup key (the scheme is stripped regardless of-%u), so the reporter's third case needs no toggle. The savename hash also normalizes query order and//on its own, so for those two the sub-flag changes the dedup/fetch decision, not the on-disk filename.optgrows threehts_booleanfields appended at the struct tail (offsets stable, no soname bump, same shape as thestrip_queryfield in #112). Flagging the ABI touch sincehtsopt.his installed.Covered by a
-#test=urlhackengine self-test that checks each flag combination through the live hash compare, driven by01_engine-urlhack.test.Closes #271