Skip to content

Split -%u URL Hacks into independent www/slash/query toggles (#271)#435

Merged
xroche merged 1 commit into
masterfrom
fix/urlhack-split-271
Jun 27, 2026
Merged

Split -%u URL Hacks into independent www/slash/query toggles (#271)#435
xroche merged 1 commit into
masterfrom
fix/urlhack-split-271

Conversation

@xroche

@xroche xroche commented Jun 27, 2026

Copy link
Copy Markdown
Owner

-%u (--urlhack) folded three separate dedup normalizations into one switch: treating www.host and host as the same, collapsing redundant //, and ignoring query-argument order. A site that needed one but not another could only turn the whole umbrella off. The reporter (#271) wanted www. kept distinct, so that a multi-step www→non-www redirect would not be read as a loop, without also losing the rest.

This adds three opt-out sub-options that default to the umbrella, so -%u/-%u0 behave exactly as before:

  • --keep-www-prefix (-%j): keep www.foo.com distinct from foo.com
  • --keep-double-slashes (-%o): keep redundant // in the path
  • --keep-query-order (-%y): keep query-argument order significant

The effective flags are resolved once in hash_init() and threaded through the three places that must agree: the dedup hash, the savename lookup key, and the redirect-loop diagnostic. fil_normalized() gains an internal _ex(do_slash, do_query) core; the exported fil_normalized()/fil_normalized_filtered() signatures are untouched.

Two things to look at. The opt-outs read positively (--keep-*) instead of --no-www-dedup, because httrack's generic --no<opt> prefix only appends the disabling 0 for parametered options, not "single" booleans, so --nowww-dedup would silently do nothing. And http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so the reporter's third case needs no toggle. The savename hash also normalizes query order and // on its own, so for those two the sub-flag changes the dedup/fetch decision, not the on-disk filename.

opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, same shape as the strip_query field in #112). Flagging the ABI touch since htsopt.h is installed.

Covered by a -#test=urlhack engine self-test that checks each flag combination through the live hash compare, driven by 01_engine-urlhack.test.

Closes #271

@xroche xroche force-pushed the fix/urlhack-split-271 branch 2 times, most recently from 44dfa40 to 866b5b8 Compare June 27, 2026 18:12
-%u (--urlhack) bundled three dedup normalizations under one switch:
www.host == host, redundant // collapse, and query-argument reordering.
A mirror that needed one but not another (e.g. keep www. distinct) had to
turn the whole umbrella off. Add three opt-out sub-options, defaulting to
the umbrella so existing -%u/-%u0 behavior is unchanged:

  --keep-www-prefix      keep www.foo.com distinct from foo.com   (-%j)
  --keep-double-slashes  keep redundant // in the path            (-%o)
  --keep-query-order     keep query-argument order significant    (-%y)

The split is resolved once in hash_init() into norm_host/norm_slash/
norm_query and threaded through the dedup hash (htshash.c), the savename
lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so
all three stay consistent. fil_normalized() gains an internal
fil_normalized_ex(do_slash, do_query) core; the public
fil_normalized()/fil_normalized_filtered() keep their signatures.

Normalization (slash/query) now follows urlhack and its sub-flags
uniformly, while --strip-query stays orthogonal. So with urlhack off,
strip-query strips keys without sorting the remainder; the url_savename
urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer
the hash uses, so a URL is always looked up under the key it was stored
with (a self-lookup mismatch this otherwise introduced).

http/https are always merged in the dedup key (the scheme is stripped
regardless of -%u), so that part of the request needs no toggle.

The opt-outs are spelled positively (--keep-*) because httrack's generic
--no<opt> prefix only appends the disabling "0" for parametered options,
not "single" booleans, so --nowww-dedup would silently no-op.

opt grows three hts_boolean fields appended at the struct tail (offsets
stable, no soname bump, matching the strip_query addition in #112).

Tested by a -#test=urlhack engine self-test (hash_url_equals over each
flag combination) plus a -%u0 + --strip-query crawl case exercising the
urlhack-off savename branch.

Closes #271

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche force-pushed the fix/urlhack-split-271 branch from 866b5b8 to 600001b Compare June 27, 2026 18:18
@xroche xroche merged commit 669947c into master Jun 27, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider separating out the various parts of *URL Hacks* into separate options

1 participant