Add validation text improvements by vakrahul · Pull Request #230 · XortexAI/XMem

vakrahul · 2026-06-03T18:05:54Z

Text Validation Implementation (Issue #227)

Summary

Implemented missing payload range checks and empty string validation in src/utils/text.py to catch edge cases early and prevent resource wastage.

Implementation Details

Files Modified

src/utils/text.py
- Added logging import and logger setup
- Added ValidationError import from src.utils.exceptions
- Added MAX_STRING_LENGTH = 10 * 1024 * 1024 constant (10 MB limit)
- Added validate_input_text() utility function that checks:
  - None input → raises ValidationError
  - Non-string types → raises TypeError
  - Empty/whitespace-only strings → raises ValidationError
  - Oversized payloads (>10MB) → logs warning and raises ValidationError
- Updated 5 parsing functions with validation guard clauses:
  - parse_raw_response_to_classifications()
  - parse_raw_response_to_profiles()
  - parse_raw_response_to_events()
  - parse_raw_response_to_event()
  - parse_raw_response_to_image()
tests/unit/test_utils.py
- Added 3 test classes with 30+ test cases:
  - TestTextValidationEmptyAndNull: Tests empty/null/whitespace rejection (15 tests)
  - TestTextValidationTypeChecks: Tests type checking (4 tests)
  - TestTextValidationLengthConstraints: Tests max length enforcement (4 tests)
  - TestTextValidationRegressions: Tests valid inputs still work (8+ tests)

Validation Rules

Empty Check: Rejects empty strings and whitespace-only input
Type Check: Rejects non-string inputs with clear error message
Length Check: Rejects strings > 10MB with logging warning
Early Exit: All validation happens at function entry points
Clear Errors: Each check provides descriptive error messages

Testing Strategy

Comprehensive coverage of edge cases: None, empty, whitespace, oversized
Type validation for each parsing function
Regression tests ensuring valid inputs still parse correctly
NO_EVENT keyword handling preserved
All 30+ tests grouped into logical test classes

Key Design Decisions

10MB threshold: Reasonable for text processing pipelines (not too restrictive)
Logging on overflow: Warns about suspicious large inputs before failing
Guard clauses at entry: Prevents downstream processing of invalid data
Consistent error handling: Uses existing ValidationError and TypeError
No performance regression: Validation is lightweight and early-exit friendly

gemini-code-assist

Code Review

This pull request introduces input validation to several text parsing functions in src/utils/text.py to check for null, empty, type, and length constraints, along with comprehensive unit tests in tests/unit/test_utils.py. Feedback was provided to correct the terminology in the validation error and warning messages, as len(content) in Python measures characters rather than bytes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T18:06:51Z

+    if len(content) > MAX_STRING_LENGTH:
+        logger.warning(
+            f"{function_name}: Input text exceeds maximum length. "
+            f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
+        )
+        raise ValidationError(
+            f"{function_name}: Input text exceeds maximum allowed length "
+            f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
+        )


In Python, len(string) returns the number of characters (code points), not the number of bytes. If the input contains multi-byte characters (such as emojis or non-ASCII text), the actual byte size will be larger than len(content). To avoid misleading log and error messages, we should refer to this limit in terms of characters rather than bytes.

Suggested change

if len(content) > MAX_STRING_LENGTH:

logger.warning(

f"{function_name}: Input text exceeds maximum length. "

f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"

)

raise ValidationError(

f"{function_name}: Input text exceeds maximum allowed length "

f"({len(content)} > {MAX_STRING_LENGTH} bytes)"

)

if len(content) > MAX_STRING_LENGTH:

logger.warning(

f"{function_name}: Input text exceeds maximum length. "

f"Length: {len(content)} characters, Max: {MAX_STRING_LENGTH} characters"

)

raise ValidationError(

f"{function_name}: Input text exceeds maximum allowed length "

f"({len(content)} > {MAX_STRING_LENGTH} characters)"

)

greptile-apps · 2026-06-03T18:09:23Z

Greptile Summary

This PR adds input validation to the five parse_raw_response_to_* functions in src/utils/text.py, catching None, empty/whitespace-only, wrong-type, and oversized inputs early via a new validate_input_text helper, backed by 30+ unit tests.

validate_input_text raises ValidationError for None/empty and TypeError for non-strings; a configurable MAX_STRING_LENGTH (10 \u00d7 1024 \u00d7 1024) guards against oversized payloads.
Five parse functions now call validate_input_text as a guard clause at entry, with updated docstrings listing the new exceptions.
Test suite adds four well-organized test classes covering rejection of bad inputs and regression coverage for valid inputs including the NO_EVENT sentinel path.

Confidence Score: 4/5

Safe to merge with minor fixes; validation logic is correct and well-tested.

The len() check counts Unicode characters, not UTF-8 bytes, so the '10 MB' comment and 'bytes' labels in the log and error message are inaccurate for multi-byte Unicode input. Additionally, parse_raw_response_to_event runs validate_input_text twice by calling the helper then delegating to parse_raw_response_to_events. Neither issue causes incorrect rejection or acceptance of input.

The len()-vs-bytes labelling in src/utils/text.py and the double-validation path in parse_raw_response_to_event are the two spots worth a second look.

Important Files Changed

Filename	Overview
src/utils/text.py	Adds validate_input_text helper and guards all five parse functions; len() counts characters not bytes (mislabelled as bytes in constant comment, log, and error), and parse_raw_response_to_event validates twice.
tests/unit/test_utils.py	Adds 30+ tests across four well-structured classes; parse_raw_response_to_event is missing from the length-constraint and type-check test classes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Caller] --> B["parse_raw_response_to_*(content)"]
    B --> C["validate_input_text(content, fn_name)"]
    C --> D{content is None?}
    D -->|Yes| E[raise ValidationError: cannot be None]
    D -->|No| F{"isinstance(content, str)?"}
    F -->|No| G[raise TypeError: must be string]
    F -->|Yes| H{"content.strip() falsy?"}
    H -->|Yes| I[raise ValidationError: cannot be empty]
    H -->|No| J{"len > MAX_STRING_LENGTH?"}
    J -->|Yes| K[log warning + raise ValidationError]
    J -->|No| L[validation passed]
    L --> M[parse function body runs]
    M --> N[Return structured result]

_{Reviews (1): Last reviewed commit: "Add validation text improvements" | Re-trigger Greptile}

greptile-apps · 2026-06-03T18:09:28Z

+# Maximum string length threshold (10 MB)
+MAX_STRING_LENGTH = 10 * 1024 * 1024


The comment, log message, and error message all label this limit as "bytes", but Python's len() on a str counts Unicode code points (characters), not UTF-8 bytes. A string consisting entirely of 4-byte emoji characters would be ~40 MB in UTF-8 yet only count as ~10 M characters — sailing well past the intended 10 MB byte guard. Either use len(content.encode('utf-8')) if bytes truly matter, or update the constant name and all messages to say "characters."

Suggested change

# Maximum string length threshold (10 MB)

MAX_STRING_LENGTH = 10 * 1024 * 1024

# Maximum string length threshold (10 MB, measured in characters)

MAX_STRING_LENGTH = 10 * 1024 * 1024

greptile-apps · 2026-06-03T18:09:28Z

+        logger.warning(
+            f"{function_name}: Input text exceeds maximum length. "
+            f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"
+        )
+        raise ValidationError(
+            f"{function_name}: Input text exceeds maximum allowed length "
+            f"({len(content)} > {MAX_STRING_LENGTH} bytes)"
+        )


The log message says "bytes" but len(content) counts characters, not bytes. For multi-byte Unicode the reported number will be smaller than the actual byte size, which is misleading when debugging oversized payloads.

Suggested change

logger.warning(

f"{function_name}: Input text exceeds maximum length. "

f"Length: {len(content)} bytes, Max: {MAX_STRING_LENGTH} bytes"

)

raise ValidationError(

f"{function_name}: Input text exceeds maximum allowed length "

f"({len(content)} > {MAX_STRING_LENGTH} bytes)"

)

logger.warning(

f"{function_name}: Input text exceeds maximum length. "

f"Length: {len(content)} chars, Max: {MAX_STRING_LENGTH} chars"

)

raise ValidationError(

f"{function_name}: Input text exceeds maximum allowed length "

f"({len(content)} > {MAX_STRING_LENGTH} chars)"

)

greptile-apps · 2026-06-03T18:09:30Z

+    validate_input_text(content, "parse_raw_response_to_event")
+
    events = parse_raw_response_to_events(content)
    return events[0] if events else None


parse_raw_response_to_event calls validate_input_text and then immediately delegates to parse_raw_response_to_events, which calls validate_input_text a second time on the same content. The validation is redundant — removing it here keeps a single authoritative check inside the function that does the actual parsing.

Suggested change

validate_input_text(content, "parse_raw_response_to_event")

events = parse_raw_response_to_events(content)

return events[0] if events else None

events = parse_raw_response_to_events(content)

return events[0] if events else None

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Add validation text improvements

84fd3ae

vakrahul requested review from ishaanxgupta and ved015 as code owners June 3, 2026 18:05

github-actions Bot added the tests label Jun 3, 2026

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation text improvements#230

Add validation text improvements#230
vakrahul wants to merge 1 commit into
XortexAI:mainfrom
vakrahul:validation-text

vakrahul commented Jun 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Maximum string length threshold (10 MB)
		MAX_STRING_LENGTH = 10 * 1024 * 1024

Conversation

vakrahul commented Jun 3, 2026

Text Validation Implementation (Issue #227)

Summary

Implementation Details

Files Modified

Validation Rules

Testing Strategy

Key Design Decisions

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant