Skip to content

Conversation

@ragnard
Copy link

@ragnard ragnard commented Dec 26, 2025

Previously the pydantic @model_validator on SetStatisticsUpdate would fail because it assumed statistics was a model instance. In a "before"" validator that is not the case.

Use an "after" validator instead, where we can use instantiated and validated fields.

Before

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/.venv/lib/python3.14/site-packages/pydantic/main.py", line 716, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        obj,
        ^^^^
    ...<5 lines>...
        by_name=by_name,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/pyiceberg/table/update/__init__.py", line 191, in validate_snapshot_id
    data["snapshot_id"] = stats.snapshot_id
                          ^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'snapshot_id'

After

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
SetStatisticsUpdate(action='set-statistics', statistics=StatisticsFile(snapshot_id=1234, statistics_path='', file_size_in_bytes=0, file_footer_size_in_bytes=0, key_metadata=None, blob_metadata=[]), snapshot_id=1234)

Rationale for this change

Are these changes tested?

Yes, but only using the two-liners above.

Are there any user-facing changes?

No.

Previously the pydantic @model_validator would fail because it assumed
statistics was a model instance. In a "before"" validator that is not
the case.

Use an "after" validator instead, where we can use instantiated and
validated fields.
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops didnt mean to approve

@ragnard
Copy link
Author

ragnard commented Dec 27, 2025

@kevinjqliu Ok, do you want me to change the fix so that snapshot_id is still there, but just not automatically populated?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

Comment on lines +182 to +184
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
self.snapshot_id = self.statistics.snapshot_id
return self

nit: use direct assignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test, something like:

def test_set_statistics_update_without_snapshot_id(table_v2_with_statistics: Table) -> None:
    """Test that SetStatisticsUpdate auto-populates snapshot_id from statistics."""
    snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id

    blob_metadata = BlobMetadata(
        type="apache-datasketches-theta-v1",
        snapshot_id=snapshot_id,
        sequence_number=2,
        fields=[1],
        properties={"prop-key": "prop-value"},
    )

    statistics_file = StatisticsFile(
        snapshot_id=snapshot_id,
        statistics_path="s3://bucket/warehouse/stats.puffin",
        file_size_in_bytes=124,
        file_footer_size_in_bytes=27,
        blob_metadata=[blob_metadata],
    )

    # Create update without explicitly providing snapshot_id
    update = SetStatisticsUpdate(statistics=statistics_file)

    # Validator should auto-populate snapshot_id from statistics
    assert update.snapshot_id == snapshot_id

    new_metadata = update_table_metadata(
        table_v2_with_statistics.metadata,
        (update,),
    )

    assert len(new_metadata.statistics) == 2
    updated_statistics = [stat for stat in new_metadata.statistics if stat.snapshot_id == snapshot_id]
    assert len(updated_statistics) == 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants