HIVE-28755: Statistics Management Task#6438
HIVE-28755: Statistics Management Task#6438DanielZhu58 wants to merge 5 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.
Changes:
- Introduce
StatisticsManagementTask(aMetastoreTaskThread) that deletes column stats older than a configured retention window. - Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
- Add a micro-benchmark entry and a new unit test class for statistics management.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java | Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing). |
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java | Wires the new benchmark into the benchmark suite (one suite name is inconsistent). |
| standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java | New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object). |
| standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java | New background task implementation (contains multiple correctness/compilation/runtime issues). |
| standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java | Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| q.setResult( | ||
| "table.database.name, " + | ||
| "table.tableName, " + | ||
| "partitionName, " + | ||
| "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")" | ||
| ); |
There was a problem hiding this comment.
MTableColumnStatistics/TAB_COL_STATS doesn't have a partitionName field/column; this JDOQL result projection is likely to fail at runtime. If partition-level stats need handling, query MPartitionColumnStatistics and project partition.partitionName (or otherwise join to MPartition) instead of referencing partitionName directly.
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
|
|
||
| // decide tableLevel based on whether this stat row is table-level or partition-level | ||
| // avoids loading table partition keys / MTable | ||
| request.setTableLevel(partName == null); | ||
| msc.deleteColumnStatistics(request); | ||
| } |
There was a problem hiding this comment.
For partition-level stats (partName != null), the request does not set part_names. On the server side, an empty part_names causes delete_column_statistics_req to delete stats for all partitions of the table. Add the specific partition name(s) to the request (and dedupe/group by table+partition to avoid repeated deletes per column-stats row).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Override | ||
| public void setConf(Configuration configuration) { | ||
| // we modify conf in setupConf(), so we make a copy | ||
| this.conf = configuration; |
There was a problem hiding this comment.
setConf says it makes a copy because the task may mutate the configuration, but it currently assigns the passed instance directly. This diverges from PartitionManagementTask (which does new Configuration(configuration)) and can cause unexpected cross-thread/shared-conf mutations; make a defensive copy here as well (or update the comment if no mutation happens).
| this.conf = configuration; | |
| this.conf = new Configuration(configuration); |
| ColumnStatistics cs = new ColumnStatistics(); | ||
| ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl); | ||
| desc.setCatName("hive"); | ||
| cs.addToStatsObj(obj); | ||
|
|
||
| client.updateTableColumnStatistics(cs); | ||
| } |
There was a problem hiding this comment.
ColumnStatisticsDesc desc is created but never attached to the ColumnStatistics object. updateTableColumnStatistics expects the stats descriptor (db/table/cat/lastAnalyzed/etc.) to be set (see other tests like TestHiveMetaStore); call cs.setStatsDesc(desc) before updating, otherwise the update can fail or write incomplete stats.
There was a problem hiding this comment.
Acknowledged and changed.
| // simulate the partitions of each table which its stats has an old "lastAnalyzed" | ||
| List<Partition> partitions = client.listPartitions(dbName, tableName); | ||
| for (Partition partition : partitions) { | ||
| Map<String, String> params = partition.getParameters(); | ||
| // to manually change the "lastAnalyzed" to an old time, ex. 400 days | ||
| params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400))); | ||
| } | ||
| client.alterPartitions(dbName, tableName, partitions); | ||
| } |
There was a problem hiding this comment.
The benchmark simulates expired stats by updating the partition parameter lastAnalyzed, but StatisticsManagementTask currently queries MTableColumnStatistics.lastAnalyzed (TAB_COL_STATS) instead of partition/table parameters. As a result, this setup (and the post-run assertion that lastAnalyzed is removed from partition parameters) is not exercising the code path being benchmarked; update the benchmark to expire the same metadata the task uses (or update the task to use the same source of truth).
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + "," | ||
| + "org.apache.hadoop.hive.metastore.StatisticsManagementTask", |
There was a problem hiding this comment.
Adding org.apache.hadoop.hive.metastore.StatisticsManagementTask to METASTORE_TASK_THREADS_ALWAYS means it will start in all deployments. Given the new configs currently default to enabling deletion, this can change runtime behavior unexpectedly after upgrade; ensure this is gated behind an explicit opt-in (or keep it out of the always-start list until explicitly configured).
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + "," | |
| + "org.apache.hadoop.hive.metastore.StatisticsManagementTask", | |
| + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask", |
There was a problem hiding this comment.
Acknowledged. Set to false by default.
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.concurrent.TimeUnit; |
There was a problem hiding this comment.
java.util.Map is imported but not used in this class; please remove the unused import to keep the file clean (and to satisfy builds that fail on unused imports/checkstyle).
| Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName, tableName) | ||
| .withType(TableType.MANAGED_TABLE) | ||
| .withColumns(createSchema(Arrays.asList("col1:string", "col2:int"))) | ||
| .withPartitionKeys(createSchema(Collections.singletonList("part_col"))) | ||
| .withParameter("columnStatsAccurate", "true"); | ||
|
|
There was a problem hiding this comment.
Table/partition stats accuracy is tracked under the COLUMN_STATS_ACCURATE parameter (see StatsSetupConst.COLUMN_STATS_ACCURATE), but the benchmark sets columnStatsAccurate. If this parameter is intended to make the metastore think column stats are present/accurate, the current key likely won't be recognized; use the correct constant/key.
| String excludeVal = (String) row[3]; // can be null | ||
|
|
||
| // exclude check uses projected param value | ||
| if (excludeVal != null) { |
There was a problem hiding this comment.
The exclusion check treats any non-null value of statistics.auto.deletion.exclude as excluded (including "false"). This makes it impossible to explicitly set the property to false and still have deletion run; compare the value to "true" (case-insensitive) instead of only checking for presence.
| if (excludeVal != null) { | |
| if ("true".equalsIgnoreCase(excludeVal)) { |
| // Compute an old timestamp in seconds, here we use 400 days ago. | ||
| long oldSeconds = (System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400)) / 1000; | ||
|
|
||
| // NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping |
There was a problem hiding this comment.
This test includes a TODO-style note about JDO field paths varying ("adjust filter if needed"). Since this is committed test code, it should be deterministic against the actual model mapping in this repo; please either remove the note or replace it with a concrete explanation of why this query/filter is correct here.
| // NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping | |
| // In this repository's JDO model, MTableColumnStatistics points to MTable via the | |
| // "table" field, and MTable exposes the table and database names as "tableName" and | |
| // "database.name". This filter therefore selects exactly the statistics rows for the | |
| // target table in the target database. |
| * If some table or partition column statistics are older than the configured retention interval | ||
| * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when this metastore task runs periodically. | ||
| */ | ||
| public class StatisticsManagementTask implements MetastoreTaskThread { |
There was a problem hiding this comment.
how if this class extends the ObjectStore? since we need the internal pm.
There was a problem hiding this comment.
Acknowledged and discussed.
|



What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?