Skip to content

Add --strip-query to drop query parameters from dedup naming#434

Merged
xroche merged 1 commit into
masterfrom
feature/query-param-handling
Jun 27, 2026
Merged

Add --strip-query to drop query parameters from dedup naming#434
xroche merged 1 commit into
masterfrom
feature/query-param-handling

Conversation

@xroche

@xroche xroche commented Jun 27, 2026

Copy link
Copy Markdown
Owner

--strip-query drops named query parameters when computing a URL's dedup key and saved name, so two links that differ only in a tracking or session parameter (?utm_source=x vs ?utm_source=y) collapse to one local file instead of saving a separate copy per parameter value. The fetched URL is untouched (a required sid= still goes on the wire); only the local namespace collapses. The option is repeatable and takes [host/pattern=]key1,key2,.... Patterns match the normalized host/path with the same glob as the +/- filters (last match wins), and stripping is decoupled from -%u, so it still works when URL-hack normalization is off.

Everything funnels through one chokepoint, fil_normalized: an internal fil_normalized_filtered strips and then delegates, and hts_query_strip_keys resolves the per-URL key list. There is no exported ABI change (the helpers are hidden, soname stays .so.3); the strip_query field is appended at the tail of httrackp. The reordered-parameter half of #112 already worked (fil_normalized sorts query args); this adds the missing key-dropping half and covers #225.

One bug came out of review and is worth flagging: the strip rebuild dropped trailing and empty query fields, which broke idempotency under the read path's second normalization and silently missed dedup on URLs like /page?&utm=x. Fixed, with a 50-case fixpoint test. Covered by a -#test=stripquery self-test (including degenerate queries like a=&b&c==) and an end-to-end dedup crawl test.

Closes #112

@xroche xroche force-pushed the feature/query-param-handling branch from ef1b21c to c92534a Compare June 27, 2026 08:55
Two URLs that differ only in tracking or session query parameters
(?utm_source=x versus ?utm_source=y) were saved as separate files, and a
single CGI could fan out into thousands of near-duplicate pages.
fil_normalized already sorted query args, so reordered parameters dedup,
but there was no way to drop a named key.

--strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the
listed keys when computing the dedup key and the saved name. The fetched
URL is untouched, so a required sid= is still sent on the wire; only the
local namespace collapses. Patterns match the normalized host/path with
the +/- filter glob (strjoker), last match wins as in the filter list,
and stripping is decoupled from urlhack (-%u) so it never silently
no-ops with -%u0.

It all funnels through one chokepoint, fil_normalized: an internal
fil_normalized_filtered() strips then delegates, and hts_query_strip_keys
resolves the per-URL key list. The strip pass walks every query field,
including empty and trailing ones, so its output is a fixpoint under the
read path's second normalization (otherwise dedup silently misses).
Exported ABI is unchanged; the strip_query field is appended at the tail
of httrackp. Covered by a -#test=stripquery self-test (degenerate queries
like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup
crawl test.

Closes #112

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche force-pushed the feature/query-param-handling branch from c92534a to 6d1b677 Compare June 27, 2026 09:03
@xroche xroche merged commit 40a6660 into master Jun 27, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Query aware filtering (CGI, PHP, etc)

1 participant