Skip to content

refactor: Extract split filter provider interface to improve flexibility for user customization.#46

Merged
anlowee merged 35 commits intoy-scope:release-0.293-clp-connectorfrom
anlowee:xwei/refactor-metadatafilter-1
Aug 25, 2025
Merged

refactor: Extract split filter provider interface to improve flexibility for user customization.#46
anlowee merged 35 commits intoy-scope:release-0.293-clp-connectorfrom
anlowee:xwei/refactor-metadatafilter-1

Conversation

@anlowee
Copy link

@anlowee anlowee commented Jul 21, 2025

Description

In this PR we mainly extract an interface of a split filter provider, so that the user can extend their own config format for their own metadata database. Since the metadata filter is only used for split filtering, we decide to rename it to split filter.

We can add a new package: split.filter, and a base class ClpSplitFilterProvider.

In this base class we keep all functions for processing scope, which is a concept that in the previous metadata filter design (i.e., is the filter for all schemas and all tables, or for tables under a certain schema, or only for the certain table). And also two abstract functions:

  1. getCustomSplitFilterOptionsClass, which is for the user to return their own implementation for CustomSplitFilterOptions class. This is also a guard so that the user won’t forget to implement the CustomSplitFilterOptions class (see below for the definition of CustomSplitFilterOptions).
  2. remapSplitFilterPushDown, which is renamed from the old function remapFilterSql. This function is for rewriting the split filter push down expression with split-filter-specific stuff, so the user can only extend this function to handle split-filter-specific logic.

For the current filter structure, we move it to a separate class ClpMetadataFilter which defined the basic structure for filters of all types of metadata database:

  1. ”columnName”: the filter name.
  2. ”customOptions”: the split-filter-specific data structure which should be implemented by the user, where we can put all split-filter-specific fields in it (e.g., ”rangeMapping”). We provide an implementation for MySql metadata database in ClpMySqlSplitFilterProvider.SplitDatabaseSpecific.
  3. ”required”: does the filter have to exist after push down.
    As discussed, we will move this field into ”customOptions” in the next PR.

We also add a new class ClpCustomSplitFilterOptionsDeserializer which is for deserializing the ”customOptions” field with the class given by getCustomSplitFilterOptionsClass.

For the changes in the config file for MySql metadata database, we only move the “rangeMapping” into the field ”customOptions”:

{
  "clp.default.table_1": [
      {
        "columnName": "msg.timestamp",
        "customOptions": {
          "rangeMapping": {
            "lowerBound": "begin_timestamp",
            "upperBound": "end_timestamp"
          }
        },
        "required": true
      },
      {
        "columnName": "file_name"
      }
  ]
}

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  1. E2E test w/ w/o metadata filters. Using pregsql datasets and configuring the timestamp as a required metadata filter:
  • SELECT 1 = 1 FROM default LIMIT 1, throw exception due to missing, because even it is not querying on any columns it still scans the table.
  • SELECT query_id + 1 FROM default LIMIT 1, throw exception due to missing.
  • SELECT * FROM default WHERE ps = 'startup' LIMIT 1, throw exception due to missing.
  • SELECT COUNT(*) FROM default WHERE timestamp > FROM_UNIXTIME(0) AND timestamp < FROM_UNIXTIME(9999999999.1234);, return correctly.
  1. Passed CI.

Summary by CodeRabbit

  • New Features
    • Introduces Split Filter support with a configurable provider (MySQL).
    • New settings: clp.split-filter-config and clp.split-filter-provider-type.
    • Enhanced split-pruning via range-mapping for numeric filters; validates required filters.
  • Refactor
    • Terminology shift from “Metadata Filter” to “Split Filter”.
    • Old clp.metadata-filter-config removed; update configurations accordingly.
    • Error identifiers updated to Split Filter terminology; adds unsupported provider error.
  • Documentation
    • Overhauled connector docs: new Split Filter configuration, examples, and provider guidance.
    • Expanded type mapping, SQL support notes, and revised configuration walkthroughs.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants