Add --strip-query to drop query parameters from dedup naming#434
Merged
Conversation
ef1b21c to
c92534a
Compare
Two URLs that differ only in tracking or session query parameters (?utm_source=x versus ?utm_source=y) were saved as separate files, and a single CGI could fan out into thousands of near-duplicate pages. fil_normalized already sorted query args, so reordered parameters dedup, but there was no way to drop a named key. --strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the listed keys when computing the dedup key and the saved name. The fetched URL is untouched, so a required sid= is still sent on the wire; only the local namespace collapses. Patterns match the normalized host/path with the +/- filter glob (strjoker), last match wins as in the filter list, and stripping is decoupled from urlhack (-%u) so it never silently no-ops with -%u0. It all funnels through one chokepoint, fil_normalized: an internal fil_normalized_filtered() strips then delegates, and hts_query_strip_keys resolves the per-URL key list. The strip pass walks every query field, including empty and trailing ones, so its output is a fixpoint under the read path's second normalization (otherwise dedup silently misses). Exported ABI is unchanged; the strip_query field is appended at the tail of httrackp. Covered by a -#test=stripquery self-test (degenerate queries like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup crawl test. Closes #112 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
c92534a to
6d1b677
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
--strip-querydrops named query parameters when computing a URL's dedup key and saved name, so two links that differ only in a tracking or session parameter (?utm_source=xvs?utm_source=y) collapse to one local file instead of saving a separate copy per parameter value. The fetched URL is untouched (a requiredsid=still goes on the wire); only the local namespace collapses. The option is repeatable and takes[host/pattern=]key1,key2,.... Patterns match the normalized host/path with the same glob as the+/-filters (last match wins), and stripping is decoupled from-%u, so it still works when URL-hack normalization is off.Everything funnels through one chokepoint,
fil_normalized: an internalfil_normalized_filteredstrips and then delegates, andhts_query_strip_keysresolves the per-URL key list. There is no exported ABI change (the helpers are hidden, soname stays.so.3); thestrip_queryfield is appended at the tail ofhttrackp. The reordered-parameter half of #112 already worked (fil_normalizedsorts query args); this adds the missing key-dropping half and covers #225.One bug came out of review and is worth flagging: the strip rebuild dropped trailing and empty query fields, which broke idempotency under the read path's second normalization and silently missed dedup on URLs like
/page?&utm=x. Fixed, with a 50-case fixpoint test. Covered by a-#test=stripqueryself-test (including degenerate queries likea=&b&c==) and an end-to-end dedup crawl test.Closes #112