Skip to content
This repository was archived by the owner on Jun 3, 2026. It is now read-only.

Add validation text improvements#230

Open
vakrahul wants to merge 1 commit into
XortexAI:mainfrom
vakrahul:validation-text
Open

Add validation text improvements#230
vakrahul wants to merge 1 commit into
XortexAI:mainfrom
vakrahul:validation-text

Conversation

@vakrahul
Copy link
Copy Markdown
Contributor

@vakrahul vakrahul commented Jun 3, 2026

Text Validation Implementation (Issue #227)

Summary

Implemented missing payload range checks and empty string validation in src/utils/text.py to catch edge cases early and prevent resource wastage.

Implementation Details

Files Modified

  1. src/utils/text.py

    • Added logging import and logger setup

    • Added ValidationError import from src.utils.exceptions

    • Added MAX_STRING_LENGTH = 10 * 1024 * 1024 constant (10 MB limit)

    • Added validate_input_text() utility function that checks:

      • None input → raises ValidationError
      • Non-string types → raises TypeError
      • Empty/whitespace-only strings → raises ValidationError
      • Oversized payloads (>10MB) → logs warning and raises ValidationError
    • Updated 5 parsing functions with validation guard clauses:

      • parse_raw_response_to_classifications()
      • parse_raw_response_to_profiles()
      • parse_raw_response_to_events()
      • parse_raw_response_to_event()
      • parse_raw_response_to_image()
  2. tests/unit/test_utils.py

    • Added 3 test classes with 30+ test cases:
      • TestTextValidationEmptyAndNull: Tests empty/null/whitespace rejection (15 tests)
      • TestTextValidationTypeChecks: Tests type checking (4 tests)
      • TestTextValidationLengthConstraints: Tests max length enforcement (4 tests)
      • TestTextValidationRegressions: Tests valid inputs still work (8+ tests)

Validation Rules

  • Empty Check: Rejects empty strings and whitespace-only input
  • Type Check: Rejects non-string inputs with clear error message
  • Length Check: Rejects strings > 10MB with logging warning
  • Early Exit: All validation happens at function entry points
  • Clear Errors: Each check provides descriptive error messages

Testing Strategy

  • Comprehensive coverage of edge cases: None, empty, whitespace, oversized
  • Type validation for each parsing function
  • Regression tests ensuring valid inputs still parse correctly
  • NO_EVENT keyword handling preserved
  • All 30+ tests grouped into logical test classes

Key Design Decisions

  1. 10MB threshold: Reasonable for text processing pipelines (not too restrictive)
  2. Logging on overflow: Warns about suspicious large inputs before failing
  3. Guard clauses at entry: Prevents downstream processing of invalid data
  4. Consistent error handling: Uses existing ValidationError and TypeError
  5. No performance regression: Validation is lightweight and early-exit friendly

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces input validation to several text parsing functions in src/utils/text.py to check for null, empty, type, and length constraints, along with comprehensive unit tests in tests/unit/test_utils.py. Feedback was provided to correct the terminology in the validation error and warning messages, as len(content) in Python measures characters rather than bytes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/utils/text.py
Comment on lines +46 to +54
if len(content) > MAX_STRING_LENGTH:
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Python, len(string) returns the number of characters (code points), not the number of bytes. If the input contains multi-byte characters (such as emojis or non-ASCII text), the actual byte size will be larger than len(content). To avoid misleading log and error messages, we should refer to this limit in terms of characters rather than bytes.

Suggested change
if len(content) > MAX_STRING_LENGTH:
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
)
if len(content) > MAX_STRING_LENGTH:
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} characters, Max: {MAX_STRING_LENGTH} characters"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} characters)"
)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR adds input validation to the five parse_raw_response_to_* functions in src/utils/text.py, catching None, empty/whitespace-only, wrong-type, and oversized inputs early via a new validate_input_text helper, backed by 30+ unit tests.

  • validate_input_text raises ValidationError for None/empty and TypeError for non-strings; a configurable MAX_STRING_LENGTH (10 \u00d7 1024 \u00d7 1024) guards against oversized payloads.
  • Five parse functions now call validate_input_text as a guard clause at entry, with updated docstrings listing the new exceptions.
  • Test suite adds four well-organized test classes covering rejection of bad inputs and regression coverage for valid inputs including the NO_EVENT sentinel path.

Confidence Score: 4/5

Safe to merge with minor fixes; validation logic is correct and well-tested.

The len() check counts Unicode characters, not UTF-8 bytes, so the '10 MB' comment and 'bytes' labels in the log and error message are inaccurate for multi-byte Unicode input. Additionally, parse_raw_response_to_event runs validate_input_text twice by calling the helper then delegating to parse_raw_response_to_events. Neither issue causes incorrect rejection or acceptance of input.

The len()-vs-bytes labelling in src/utils/text.py and the double-validation path in parse_raw_response_to_event are the two spots worth a second look.

Important Files Changed

Filename Overview
src/utils/text.py Adds validate_input_text helper and guards all five parse functions; len() counts characters not bytes (mislabelled as bytes in constant comment, log, and error), and parse_raw_response_to_event validates twice.
tests/unit/test_utils.py Adds 30+ tests across four well-structured classes; parse_raw_response_to_event is missing from the length-constraint and type-check test classes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Caller] --> B["parse_raw_response_to_*(content)"]
    B --> C["validate_input_text(content, fn_name)"]
    C --> D{content is None?}
    D -->|Yes| E[raise ValidationError: cannot be None]
    D -->|No| F{"isinstance(content, str)?"}
    F -->|No| G[raise TypeError: must be string]
    F -->|Yes| H{"content.strip() falsy?"}
    H -->|Yes| I[raise ValidationError: cannot be empty]
    H -->|No| J{"len > MAX_STRING_LENGTH?"}
    J -->|Yes| K[log warning + raise ValidationError]
    J -->|No| L[validation passed]
    L --> M[parse function body runs]
    M --> N[Return structured result]
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Add validation text improvements" | Re-trigger Greptile

Comment thread src/utils/text.py
Comment on lines +20 to +21
# Maximum string length threshold (10 MB)
MAX_STRING_LENGTH = 10 * 1024 * 1024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comment, log message, and error message all label this limit as "bytes", but Python's len() on a str counts Unicode code points (characters), not UTF-8 bytes. A string consisting entirely of 4-byte emoji characters would be ~40 MB in UTF-8 yet only count as ~10 M characters — sailing well past the intended 10 MB byte guard. Either use len(content.encode('utf-8')) if bytes truly matter, or update the constant name and all messages to say "characters."

Suggested change
# Maximum string length threshold (10 MB)
MAX_STRING_LENGTH = 10 * 1024 * 1024
# Maximum string length threshold (10 MB, measured in characters)
MAX_STRING_LENGTH = 10 * 1024 * 1024

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/utils/text.py
Comment on lines +47 to +54
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The log message says "bytes" but len(content) counts characters, not bytes. For multi-byte Unicode the reported number will be smaller than the actual byte size, which is misleading when debugging oversized payloads.

Suggested change
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
)
logger.warning(
f"{function_name}: Input text exceeds maximum length. "
f"Length: {len(content)} chars, Max: {MAX_STRING_LENGTH} chars"
)
raise ValidationError(
f"{function_name}: Input text exceeds maximum allowed length "
f"({len(content)} > {MAX_STRING_LENGTH} chars)"
)

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/utils/text.py
Comment on lines +316 to 319
validate_input_text(content, "parse_raw_response_to_event")

events = parse_raw_response_to_events(content)
return events[0] if events else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 parse_raw_response_to_event calls validate_input_text and then immediately delegates to parse_raw_response_to_events, which calls validate_input_text a second time on the same content. The validation is redundant — removing it here keeps a single authoritative check inside the function that does the actual parsing.

Suggested change
validate_input_text(content, "parse_raw_response_to_event")
events = parse_raw_response_to_events(content)
return events[0] if events else None
events = parse_raw_response_to_events(content)
return events[0] if events else None

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Cursor Fix in Codex Fix in Claude Code

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant