Skip to content

HIVE-28755: Statistics Management Task#6438

Open
DanielZhu58 wants to merge 5 commits intoapache:masterfrom
DanielZhu58:HIVE-28755
Open

HIVE-28755: Statistics Management Task#6438
DanielZhu58 wants to merge 5 commits intoapache:masterfrom
DanielZhu58:HIVE-28755

Conversation

@DanielZhu58
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.

Changes:

  • Introduce StatisticsManagementTask (a MetastoreTaskThread) that deletes column stats older than a configured retention window.
  • Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
  • Add a micro-benchmark entry and a new unit test class for statistics management.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing).
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java Wires the new benchmark into the benchmark suite (one suite name is inconsistent).
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object).
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java New background task implementation (contains multiple correctness/compilation/runtime issues).
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +105
q.setResult(
"table.database.name, " +
"table.tableName, " +
"partitionName, " +
"table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"
);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MTableColumnStatistics/TAB_COL_STATS doesn't have a partitionName field/column; this JDOQL result projection is likely to fail at runtime. If partition-level stats need handling, query MPartitionColumnStatistics and project partition.partitionName (or otherwise join to MPartition) instead of referencing partitionName directly.

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +127
DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName);
request.setEngine("hive");

// decide tableLevel based on whether this stat row is table-level or partition-level
// avoids loading table partition keys / MTable
request.setTableLevel(partName == null);
msc.deleteColumnStatistics(request);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For partition-level stats (partName != null), the request does not set part_names. On the server side, an empty part_names causes delete_column_statistics_req to delete stats for all partitions of the table. Add the specific partition name(s) to the request (and dedupe/group by table+partition to avoid repeated deletes per column-stats row).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Override
public void setConf(Configuration configuration) {
// we modify conf in setupConf(), so we make a copy
this.conf = configuration;
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setConf says it makes a copy because the task may mutate the configuration, but it currently assigns the passed instance directly. This diverges from PartitionManagementTask (which does new Configuration(configuration)) and can cause unexpected cross-thread/shared-conf mutations; make a defensive copy here as well (or update the comment if no mutation happens).

Suggested change
this.conf = configuration;
this.conf = new Configuration(configuration);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

Comment on lines +179 to +185
ColumnStatistics cs = new ColumnStatistics();
ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl);
desc.setCatName("hive");
cs.addToStatsObj(obj);

client.updateTableColumnStatistics(cs);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnStatisticsDesc desc is created but never attached to the ColumnStatistics object. updateTableColumnStatistics expects the stats descriptor (db/table/cat/lastAnalyzed/etc.) to be set (see other tests like TestHiveMetaStore); call cs.setStatsDesc(desc) before updating, otherwise the update can fail or write incomplete stats.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed.

Comment on lines +725 to +733
// simulate the partitions of each table which its stats has an old "lastAnalyzed"
List<Partition> partitions = client.listPartitions(dbName, tableName);
for (Partition partition : partitions) {
Map<String, String> params = partition.getParameters();
// to manually change the "lastAnalyzed" to an old time, ex. 400 days
params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400)));
}
client.alterPartitions(dbName, tableName, partitions);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark simulates expired stats by updating the partition parameter lastAnalyzed, but StatisticsManagementTask currently queries MTableColumnStatistics.lastAnalyzed (TAB_COL_STATS) instead of partition/table parameters. As a result, this setup (and the post-run assertion that lastAnalyzed is removed from partition parameters) is not exercising the code path being benchmarked; update the benchmark to expire the same metadata the task uses (or update the task to use the same source of truth).

Copilot uses AI. Check for mistakes.
Comment on lines +1543 to +1544
+ "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + ","
+ "org.apache.hadoop.hive.metastore.StatisticsManagementTask",
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding org.apache.hadoop.hive.metastore.StatisticsManagementTask to METASTORE_TASK_THREADS_ALWAYS means it will start in all deployments. Given the new configs currently default to enabling deletion, this can change runtime behavior unexpectedly after upgrade; ensure this is gated behind an explicit opt-in (or keep it out of the always-start list until explicitly configured).

Suggested change
+ "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + ","
+ "org.apache.hadoop.hive.metastore.StatisticsManagementTask",
+ "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Set to false by default.

Comment on lines +21 to +23
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.util.Map is imported but not used in this class; please remove the unused import to keep the file clean (and to satisfy builds that fail on unused imports/checkstyle).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

Comment on lines +716 to +721
Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName, tableName)
.withType(TableType.MANAGED_TABLE)
.withColumns(createSchema(Arrays.asList("col1:string", "col2:int")))
.withPartitionKeys(createSchema(Collections.singletonList("part_col")))
.withParameter("columnStatsAccurate", "true");

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table/partition stats accuracy is tracked under the COLUMN_STATS_ACCURATE parameter (see StatsSetupConst.COLUMN_STATS_ACCURATE), but the benchmark sets columnStatsAccurate. If this parameter is intended to make the metastore think column stats are present/accurate, the current key likely won't be recognized; use the correct constant/key.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

String excludeVal = (String) row[3]; // can be null

// exclude check uses projected param value
if (excludeVal != null) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclusion check treats any non-null value of statistics.auto.deletion.exclude as excluded (including "false"). This makes it impossible to explicitly set the property to false and still have deletion run; compare the value to "true" (case-insensitive) instead of only checking for presence.

Suggested change
if (excludeVal != null) {
if ("true".equalsIgnoreCase(excludeVal)) {

Copilot uses AI. Check for mistakes.
// Compute an old timestamp in seconds, here we use 400 days ago.
long oldSeconds = (System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400)) / 1000;

// NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test includes a TODO-style note about JDO field paths varying ("adjust filter if needed"). Since this is committed test code, it should be deterministic against the actual model mapping in this repo; please either remove the note or replace it with a concrete explanation of why this query/filter is correct here.

Suggested change
// NOTE: exact JDO classes/field paths sometimes vary; adjust filter if needed based on MTableColumnStatistics mapping
// In this repository's JDO model, MTableColumnStatistics points to MTable via the
// "table" field, and MTable exposes the table and database names as "tableName" and
// "database.name". This filter therefore selects exactly the statistics rows for the
// target table in the target database.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

* If some table or partition column statistics are older than the configured retention interval
* (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when this metastore task runs periodically.
*/
public class StatisticsManagementTask implements MetastoreTaskThread {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how if this class extends the ObjectStore? since we need the internal pm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and discussed.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants