Skip to content

[WIP]Data schema generation #13

Draft
FanG-817 wants to merge 11 commits into
wolfdancer:mainfrom
FanG-817:data_schema_generation
Draft

[WIP]Data schema generation #13
FanG-817 wants to merge 11 commits into
wolfdancer:mainfrom
FanG-817:data_schema_generation

Conversation

@FanG-817

@FanG-817 FanG-817 commented Nov 27, 2025

Copy link
Copy Markdown
Contributor

This PR currently works. I continue improving the validation part.

Some good suggestions from @amad-person that might help to provide more error information for self-correction:

  • If a user is creating the python object e.g. Column, a ValueError is raised with the specific error
  • If a user is creating a JSON object, the same error is caught and passed up as a StructureError, which only shows where the JSON has an error, and not why
  • The StructureError is what I think Claude is seeing. Making it see the traceback (which will have the ValueError) might be helpful for self-correction

@FanG-817 FanG-817 marked this pull request as draft November 27, 2025 01:40
@FanG-817 FanG-817 changed the title Data schema generation [WIP]Data schema generation Nov 27, 2025
commit 3196018
Author: Shane Duan <18065+wolfdancer@users.noreply.github.com>
Date:   Thu Nov 27 19:18:56 2025 -0800

    Fix requirements.txt to use --find-links for Rockfish SDK installation (wolfdancer#14)

    Changed from --extra-index-url to --find-links in requirements.txt to properly install the rockfish SDK from the custom package repository at https://packages.rockfish.ai.

    Also updated CLAUDE.md to document this requirement and added .claude/ to .gitignore.

    Fixes wolfdancer#9

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    Co-authored-by: Shane <wolfdancer@users.noreply.github.com>
    Co-authored-by: Claude <noreply@anthropic.com>

commit 2c686cc
Author: Shane Duan <18065+wolfdancer@users.noreply.github.com>
Date:   Wed Nov 26 13:32:52 2025 -0800

    Remove query_dataset tool from MCP server as the execute_query tool can do the same job and work with LLM much better.More specifically, the query_dataset requires the table name to be "my_table", which LLM is just not used to.  Instead, execute_query uses the dataset id as the table name, and that works out of box with LLM. (wolfdancer#11)

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    Co-authored-by: Claude <noreply@anthropic.com>

commit b22e44b
Merge: 0260809 b129acf
Author: Shane Duan <18065+wolfdancer@users.noreply.github.com>
Date:   Tue Nov 25 19:34:01 2025 -0800

    Merge pull request wolfdancer#7 from FanG-817/feature/workflow_rf_tab_gan

    Add SDK synthetic data generation workflow

commit 0260809
Merge: c5ba902 81d51be
Author: Shane Duan <18065+wolfdancer@users.noreply.github.com>
Date:   Mon Nov 24 20:05:50 2025 -0800

    Merge pull request wolfdancer#6 from FanG-817/refactor/standardize-api-url-naming

    update all "base_url" to "api_url" with backward compatibility as part of the deprecation process
@FanG-817 FanG-817 changed the base branch from main to add-dataset-query-functionality December 2, 2025 23:58
@FanG-817 FanG-817 changed the base branch from add-dataset-query-functionality to main December 2, 2025 23:59
@FanG-817

FanG-817 commented Dec 3, 2025

Copy link
Copy Markdown
Contributor Author

I have updated extracting and parsing the structure error:

def extract_structure_error_details(exc: Exception) -> Dict[str, Any]:
"""
Extract error message and location from StructureError exception chain.
Simplified version that extracts only essential information:
- error_message: The actual error text from the exception
- location: The JSON path where the error occurred
Args:
exc: The exception to parse (typically StructureError)
Returns:
{
"error_count": N,
"summary": "Found N validation error(s)",
"errors": [
{
"error_message": "spike_magnitude (5.0) must be in [0, 1]",
"location": "$.entities[0].columns[1]"
},
...
]
}
Example:
>>> try:
... rf.converter.structure(ts_dict, TimeseriesParams)
... except Exception as e:
... details = extract_structure_error_details(e)
... for err in details['errors']:
... print(f"{err['location']}: {err['error_message']}")
"""
errors = []
def collect_errors(current_exc, location="$"):
"""Recursively collect error messages and locations."""
# Check for sub-exceptions (IterableValidationError)
if hasattr(current_exc, "exceptions") and current_exc.exceptions:
# Extract location info from wrapper message
msg = str(current_exc)
index_match = re.search(r"@ index (\d+)", msg)
if index_match:
index = index_match.group(1)
if "list[Column]" in msg:
new_location = f"{location}.columns[{index}]"
elif "list[Entity]" in msg:
new_location = f"{location}.entities[{index}]"
else:
new_location = f"{location}[{index}]"
else:
new_location = location
for sub_exc in current_exc.exceptions:
collect_errors(sub_exc, new_location)
# Check for __cause__ chain
elif hasattr(current_exc, "__cause__") and current_exc.__cause__:
# Extract location from StructureError if present
if type(current_exc).__name__ == "StructureError":
msg = str(current_exc)
loc_match = re.search(r"@ (\$\.[^\s]+)", msg)
if loc_match:
location = loc_match.group(1)
collect_errors(current_exc.__cause__, location)
# Leaf error - add to list
else:
# Skip wrapper exceptions
if type(current_exc).__name__ not in (
"StructureError",
"ClassValidationError",
"IterableValidationError",
"ExceptionGroup",
):
errors.append({"error_message": str(current_exc), "location": location})
# Extract top-level location from StructureError
top_location = "$"
if type(exc).__name__ == "StructureError":
top_msg = str(exc)
loc_match = re.search(r"@ (\$\.[^\s]+)", top_msg)
if loc_match:
top_location = loc_match.group(1)
# Collect all errors
collect_errors(exc, top_location)
# If no errors found, use original exception
if not errors:
errors = [{"error_message": str(exc), "location": top_location}]
# Return consistent format
error_word = "error" if len(errors) == 1 else "errors"
return {
"error_count": len(errors),
"summary": f"Found {len(errors)} validation {error_word}",
"errors": errors,
}

It helps a lot for self-correction by understanding the exact error.
See screenshot:
image

@FanG-817

FanG-817 commented Dec 3, 2025

Copy link
Copy Markdown
Contributor Author

Next steps:

Right now, I have another script validators.py to validate additional rules beside using rf.converter.structure(). I need to check whether we already have those checks in cuttlefish & rockfish-sdk and whether we need to add these into rockfish modules directly.

Additional validation tickets are created in rockfish repo. We have been adding missing validations so validators.py has been removed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant