Add validation text improvements#230
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces input validation to several text parsing functions in src/utils/text.py to check for null, empty, type, and length constraints, along with comprehensive unit tests in tests/unit/test_utils.py. Feedback was provided to correct the terminology in the validation error and warning messages, as len(content) in Python measures characters rather than bytes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if len(content) > MAX_STRING_LENGTH: | ||
| logger.warning( | ||
| f"{function_name}: Input text exceeds maximum length. " | ||
| f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes" | ||
| ) | ||
| raise ValidationError( | ||
| f"{function_name}: Input text exceeds maximum allowed length " | ||
| f"({len(content)} > {MAX_STRING_LENGTH} bytes)" | ||
| ) |
There was a problem hiding this comment.
In Python, len(string) returns the number of characters (code points), not the number of bytes. If the input contains multi-byte characters (such as emojis or non-ASCII text), the actual byte size will be larger than len(content). To avoid misleading log and error messages, we should refer to this limit in terms of characters rather than bytes.
| if len(content) > MAX_STRING_LENGTH: | |
| logger.warning( | |
| f"{function_name}: Input text exceeds maximum length. " | |
| f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes" | |
| ) | |
| raise ValidationError( | |
| f"{function_name}: Input text exceeds maximum allowed length " | |
| f"({len(content)} > {MAX_STRING_LENGTH} bytes)" | |
| ) | |
| if len(content) > MAX_STRING_LENGTH: | |
| logger.warning( | |
| f"{function_name}: Input text exceeds maximum length. " | |
| f"Length: {len(content)} characters, Max: {MAX_STRING_LENGTH} characters" | |
| ) | |
| raise ValidationError( | |
| f"{function_name}: Input text exceeds maximum allowed length " | |
| f"({len(content)} > {MAX_STRING_LENGTH} characters)" | |
| ) |
|
| Filename | Overview |
|---|---|
| src/utils/text.py | Adds validate_input_text helper and guards all five parse functions; len() counts characters not bytes (mislabelled as bytes in constant comment, log, and error), and parse_raw_response_to_event validates twice. |
| tests/unit/test_utils.py | Adds 30+ tests across four well-structured classes; parse_raw_response_to_event is missing from the length-constraint and type-check test classes. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Caller] --> B["parse_raw_response_to_*(content)"]
B --> C["validate_input_text(content, fn_name)"]
C --> D{content is None?}
D -->|Yes| E[raise ValidationError: cannot be None]
D -->|No| F{"isinstance(content, str)?"}
F -->|No| G[raise TypeError: must be string]
F -->|Yes| H{"content.strip() falsy?"}
H -->|Yes| I[raise ValidationError: cannot be empty]
H -->|No| J{"len > MAX_STRING_LENGTH?"}
J -->|Yes| K[log warning + raise ValidationError]
J -->|No| L[validation passed]
L --> M[parse function body runs]
M --> N[Return structured result]
Reviews (1): Last reviewed commit: "Add validation text improvements" | Re-trigger Greptile
| # Maximum string length threshold (10 MB) | ||
| MAX_STRING_LENGTH = 10 * 1024 * 1024 |
There was a problem hiding this comment.
The comment, log message, and error message all label this limit as "bytes", but Python's
len() on a str counts Unicode code points (characters), not UTF-8 bytes. A string consisting entirely of 4-byte emoji characters would be ~40 MB in UTF-8 yet only count as ~10 M characters — sailing well past the intended 10 MB byte guard. Either use len(content.encode('utf-8')) if bytes truly matter, or update the constant name and all messages to say "characters."
| # Maximum string length threshold (10 MB) | |
| MAX_STRING_LENGTH = 10 * 1024 * 1024 | |
| # Maximum string length threshold (10 MB, measured in characters) | |
| MAX_STRING_LENGTH = 10 * 1024 * 1024 |
| logger.warning( | ||
| f"{function_name}: Input text exceeds maximum length. " | ||
| f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes" | ||
| ) | ||
| raise ValidationError( | ||
| f"{function_name}: Input text exceeds maximum allowed length " | ||
| f"({len(content)} > {MAX_STRING_LENGTH} bytes)" | ||
| ) |
There was a problem hiding this comment.
The log message says
"bytes" but len(content) counts characters, not bytes. For multi-byte Unicode the reported number will be smaller than the actual byte size, which is misleading when debugging oversized payloads.
| logger.warning( | |
| f"{function_name}: Input text exceeds maximum length. " | |
| f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes" | |
| ) | |
| raise ValidationError( | |
| f"{function_name}: Input text exceeds maximum allowed length " | |
| f"({len(content)} > {MAX_STRING_LENGTH} bytes)" | |
| ) | |
| logger.warning( | |
| f"{function_name}: Input text exceeds maximum length. " | |
| f"Length: {len(content)} chars, Max: {MAX_STRING_LENGTH} chars" | |
| ) | |
| raise ValidationError( | |
| f"{function_name}: Input text exceeds maximum allowed length " | |
| f"({len(content)} > {MAX_STRING_LENGTH} chars)" | |
| ) |
| validate_input_text(content, "parse_raw_response_to_event") | ||
|
|
||
| events = parse_raw_response_to_events(content) | ||
| return events[0] if events else None |
There was a problem hiding this comment.
parse_raw_response_to_event calls validate_input_text and then immediately delegates to parse_raw_response_to_events, which calls validate_input_text a second time on the same content. The validation is redundant — removing it here keeps a single authoritative check inside the function that does the actual parsing.
| validate_input_text(content, "parse_raw_response_to_event") | |
| events = parse_raw_response_to_events(content) | |
| return events[0] if events else None | |
| events = parse_raw_response_to_events(content) | |
| return events[0] if events else None |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Text Validation Implementation (Issue #227)
Summary
Implemented missing payload range checks and empty string validation in
src/utils/text.pyto catch edge cases early and prevent resource wastage.Implementation Details
Files Modified
src/utils/text.py
Added
loggingimport and logger setupAdded
ValidationErrorimport fromsrc.utils.exceptionsAdded
MAX_STRING_LENGTH = 10 * 1024 * 1024constant (10 MB limit)Added
validate_input_text()utility function that checks:ValidationErrorTypeErrorValidationErrorValidationErrorUpdated 5 parsing functions with validation guard clauses:
parse_raw_response_to_classifications()parse_raw_response_to_profiles()parse_raw_response_to_events()parse_raw_response_to_event()parse_raw_response_to_image()tests/unit/test_utils.py
TestTextValidationEmptyAndNull: Tests empty/null/whitespace rejection (15 tests)TestTextValidationTypeChecks: Tests type checking (4 tests)TestTextValidationLengthConstraints: Tests max length enforcement (4 tests)TestTextValidationRegressions: Tests valid inputs still work (8+ tests)Validation Rules
Testing Strategy
Key Design Decisions
ValidationErrorandTypeError