Skip to content

v0.9.3: Robustness improvements, security hardening, and documentation enhancements#43

Merged
biostochastics merged 10 commits into
mainfrom
development
Feb 2, 2026
Merged

v0.9.3: Robustness improvements, security hardening, and documentation enhancements#43
biostochastics merged 10 commits into
mainfrom
development

Conversation

@biostochastics
Copy link
Copy Markdown
Owner

Summary

This PR includes significant improvements across multiple areas:

Parser & Documentation

  • Enhanced documentation extraction across 9+ tree-sitter parsers (SQL, GraphQL, HCL, GLSL, HLSL, Solidity, WAT, Crystal, Elixir)
  • Extended CommentPatterns with 16+ language entries for single-line and block comments
  • BaseParser robustness improvements (8 fixes including regex injection prevention, bounds checking, state management)

Security Hardening

  • Path traversal prevention with symlink escape detection
  • Semgrep version exact matching to prevent spoofing
  • Apiiro commit hash verification
  • Refined secrets pattern detection to reduce false positives
  • Binary detection with Latin-1 fallback for legacy encodings

Bug Fixes

  • Prevented premature temp directory cleanup for GitHub repos
  • Fixed CLI keys command Rich table truncation with --show-values
  • Fixed test assertions for dashboard mode output suppression
  • Corrected Apiiro ruleset test mocks with proper commit hash

CI/CD

  • Disabled AI summary tests (requires API keys not available in CI)
  • Added tests badge (1550+ passing)

Documentation

  • Comprehensive docstring improvements across 7+ files
  • Google-style docstring standardization
  • Exception attribute documentation with examples

Test plan

  • All 1550+ tests pass locally
  • Pre-commit hooks pass (ruff, mypy, formatting)
  • CI workflow updated to skip API-dependent tests

- Add explicit OpenAI API key validation during provider initialization
- Fix ProcessPoolExecutor resource leak with proper exception handling
- Replace unsafe dict unpacking with Pydantic model_validate() in worker
- Fix jsonschema import bug that never actually verified availability
- Increase PBKDF2 iterations to 210,000 per OWASP 2024 recommendations
- Replace magic numbers with named constants (KILOBYTE, MEGABYTE)
- Add GitHub token best practices documentation to README
- Update type hints to modern Python 3.10+ syntax in scripts/tools
- Return TemporaryDirectory object from collect_git_repo() so caller
  manages cleanup lifecycle, preventing deletion before validation/parsing
- Add finally block in run_codeconcat() to ensure cleanup after processing
- Update tests to expect None instead of empty string on error returns
- Add tempfile import to main.py for proper type annotation
…sers

- Add doc_comments queries to 9 parsers: SQL, GraphQL, HCL, GLSL, HLSL,
  Solidity, WAT, Crystal, and Elixir
- Extend CommentPatterns in pattern_library.py with 16+ language entries
  for single-line and block comments
- Add PHPDoc tag processing using clean_jsdoc_tags for consistent
  @param/@return extraction
- Implement Elixir @doc/@moduledoc attribute extraction with proper
  module attribute handling
- Update Julia parser to capture both triple-quoted docstrings and
  line/block comments
BaseParser Improvements (8 fixes):
- Fixed IndexError in extract_docstring() with bounds checking
- Fixed regex injection in _create_pattern() with re.escape()
- Fixed brace counting in strings with _count_braces_outside_strings()
- Fixed type annotations (Pattern[str], str | None)
- Added _reset() method for parser state management

CLI Fixes:
- Fixed Rich table truncating API keys with --show-values flag
- Fixed test assertions for dashboard mode output suppression

Test Fixes:
- Corrected binary file detection test expectations (Latin-1 fallback)
- Fixed Apiiro ruleset commit hash mock values
- Added test_doc_extraction_improvements.py
- Added test_security_hardening.py

Documentation:
- Comprehensive docstring improvements across 7+ files
- Google-style docstring standardization
- Exception attribute documentation
- Add test_doc_extraction_improvements.py with 32 tests covering:
  - Elixir @doc/@moduledoc extraction
  - Julia triple-quoted docstrings, line comments, and block comments
  - PHP PHPDoc with tag processing
  - GraphQL description extraction
  - SQL, HCL, Solidity, WAT, Crystal comment handling
  - CommentPatterns class verification
  - doc_comments query presence in all enhanced parsers
- Update test_tree_sitter_graphql_parser.py to expect 6 queries
  (including the new doc_comments query)
@codeant-ai
Copy link
Copy Markdown
Contributor

codeant-ai Bot commented Feb 2, 2026

CodeAnt AI is reviewing your PR.

@codeant-ai
Copy link
Copy Markdown
Contributor

codeant-ai Bot commented Feb 2, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Compression metrics bug
    The code computes original_lines after item.content has been replaced by the compressed content, making the "original" count incorrect. That leads to wrong compression percentages and inaccurate messages. Also the percent calculation divides by original_lines without guarding against zero, which can raise ZeroDivisionError for empty files.

  • Inconsistent path usage
    After validating and resolving resolved_path you continue to use Path(file_path) for filesystem operations (stat, existence checks, reporter.add_finding, and passing the path to security checks). This creates TOCTOU and inconsistency risks (resolved path vs original may differ, symlink handling may be bypassed). Prefer using the canonical resolved_path for further operations to avoid race conditions and surprising behavior.

  • Regex sensitivity
    The updated exec pattern adds word-boundary anchors and a longer alternation. This can change matching characteristics (case sensitivity, boundary behavior around punctuation) and may produce false negatives for some variants (e.g., mixed-case tokens or tokens adjacent to non-word characters). Verify the pattern matches intended token forms across target languages and consider whether case-insensitive matching is required.

  • Potential sensitive data exposure
    The generated markdown includes full file paths and file summaries taken
    verbatim from file.file_path and file.summary. That may leak
    repository-internal paths or long/large summaries. Consider sanitizing
    /truncating output and respecting a redact/allowlist policy.

  • Import compatibility
    The new import from tree_sitter import Query is unconditional. Older or alternate
    tree-sitter installations may not expose Query (similar to how QueryCursor
    required a guarded import). This can cause an ImportError at runtime on some
    environments. Consider providing a fallback or guard to preserve backward
    compatibility.

  • Flaky Test Risk
    The new binary-detection test relies on implementation-specific heuristics (UTF-8 decode fallback to Latin-1 and a percentage threshold for control characters). Small sample sizes and platform/file-system differences can make the test flaky: a tiny file of only control bytes may exceed the control-character threshold and be detected as binary even if the intent is to treat them as text after a Latin-1 fallback.

  • Tight Implementation Assumption
    The test assumes concrete behavior details (e.g., that files with high bytes are always treated as text after Latin-1 fallback and that ASCII control bytes are treated as text). These assertions tightly couple the test to the current internal algorithm and could break if detection thresholds or fallback logic are tuned for security.

  • Ambiguous symlink assertion
    The assertion that verifies symlink handling iterates result items and uses assert "external_link" not in path or result.get("verified") is False. This is hard to reason about and may pass in unexpected cases. Prefer explicit checks for presence / verification status of the symlink key (relative path) to avoid false-negatives or flakiness.

  • Platform-dependent symlink creation
    Several tests rely on creating symlinks and skip with pytest.skip if OSError occurs. These tests may be skipped on Windows or CI where symlink creation requires privileges. Consider alternative ways to test symlink behaviour or mark tests explicitly as platform-dependent to avoid implicit test-suite variability.

  • Hardcoded Commit Hash
    The test sets a hardcoded commit hash for git rev-parse output. If the canonical APIIRO_RULESET_COMMIT
    constant (used by production code) changes, this test will diverge and become brittle. Prefer referencing
    the authoritative constant or deriving the value so the test stays correct when the expected commit is updated.

Comment thread codeconcat/base_types.py
format: str = Field(
"markdown", description="Output format: 'markdown', 'json', 'xml', or 'text'"
)
xml_processing_instructions: bool = Field(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The format field is documented as supporting only 'markdown', 'json', 'xml', or 'text', but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against VALID_FORMATS at model level prevents this. [possible bug]

Severity Level: Major ⚠️
- ❌ Writer selection may raise at runtime.
- ⚠️ CLI user sees unclear crash when choosing invalid format.
- ⚠️ Automation/config validation lacks early failure.
Suggested change
xml_processing_instructions: bool = Field(
@field_validator("format", mode="before")
@classmethod
def _validate_format(cls, value: str | None) -> str:
"""Ensure output format is one of the supported values and normalise casing."""
if value is None or str(value).strip() == "":
return "markdown"
normalised = str(value).strip().lower()
if normalised not in VALID_FORMATS:
allowed = ", ".join(sorted(VALID_FORMATS))
raise ValueError(
f"Invalid output format '{value}'. Must be one of: {allowed}."
)
return normalised
Steps of Reproduction ✅
1. Create a configuration that sets format to an unsupported value (e.g., format: "html")
in the CLI/YAML and load it into CodeConCatConfig. The fields shown in
codeconcat/base_types.py:794 define output and format.

2. With the current model there is no validator for format, so CodeConCatConfig.format
will contain "html" (or any arbitrary string) after instantiation.

3. When the writer selection code later branches on config.format to pick a renderer
(writer selection locations: writer modules that import CodeConCatConfig and switch on
.format), an unexpected string causes no matching writer selection branch and will trigger
either a runtime KeyError/AttributeError or fall through to an error path.

4. The proposed validator _validate_format fails fast during config instantiation,
preventing invalid formats from propagating to the writer selection stage and producing a
clear ValueError on startup instead of a less explicit runtime failure when generating
output.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** codeconcat/base_types.py
**Line:** 798:798
**Comment:**
	*Possible Bug: The `format` field is documented as supporting only `'markdown'`, `'json'`, `'xml'`, or `'text'`, but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against `VALID_FORMATS` at model level prevents this.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

("lmstudio", "LM Studio"),
("llamacpp_server", "llama.cpp Server"),
("local_server", "Local OpenAI-Compatible"),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The list_keys command does not include the llamacpp provider in its providers table, while other commands (set, reset, change-password, export) now treat llamacpp as a valid provider. This means a user can successfully store an API key for llamacpp but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding llamacpp to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys. [logic error]

Severity Level: Major ⚠️
- ⚠️ CLI 'keys list' hides stored llama.cpp keys causing confusion.
- ⚠️ Inconsistent provider surface across set/reset/export commands.
- ❌ Users may think keys lost when listing returns nothing.
Suggested change
]
("llamacpp", "llama.cpp (deprecated)"),
Steps of Reproduction ✅
1. Store an API key for the llama.cpp provider:

   - Run the CLI command `codeconcat keys set llamacpp` (handled by set_key in
   codeconcat/cli/commands/keys.py).

   - The set_key function verifies valid_providers which includes "llamacpp" (see
   keys.py:130-156, valid_providers list around line 141-156).

   - Manager stores the key via APIKeyManager.set_key (set_key uses manager.set_key at
   keys.py:186-189).

2. List configured API keys:

   - Run the CLI command `codeconcat keys list` which calls list_keys in
   codeconcat/cli/commands/keys.py.

   - list_keys constructs the providers table from the providers list defined at
   keys.py:89-103 and then iterates over it (see loop at keys.py:105-118).

3. Observe the mismatch:

   - Because "llamacpp" is not present in the list_keys providers table (keys.py:89-103),
   the list command never queries manager.get_key("llamacpp") and thus never displays the
   stored llama.cpp key, even though it exists in storage (set_key stored it at
   keys.py:186-189).

4. Concrete verification pointers (from repository grep/read):

   - Presence of "llamacpp" in set/reset/export flows: see valid_providers in set_key
   (keys.py:141-156), providers list in reset_keys/include in export_keys (keys.py:260-276
   and keys.py:467-481).

   - Absence in list_keys providers: keys.py:89-103 (the existing_code snippet above).

   - Therefore a stored key for "llamacpp" will be accepted by set_key but not shown by
   list_keys; the proposed change adds ("llamacpp", "llama.cpp (deprecated)") to
   keys.py:89-103 to align behavior.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** codeconcat/cli/commands/keys.py
**Line:** 103:103
**Comment:**
	*Logic Error: The `list_keys` command does not include the `llamacpp` provider in its `providers` table, while other commands (`set`, `reset`, `change-password`, `export`) now treat `llamacpp` as a valid provider. This means a user can successfully store an API key for `llamacpp` but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding `llamacpp` to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

@codeant-ai
Copy link
Copy Markdown
Contributor

codeant-ai Bot commented Feb 2, 2026

CodeAnt AI finished reviewing your PR.

- Eliminate silent failures by narrowing exception handlers in security.py,
  main.py, and local_collector.py to catch specific exceptions with logging
- Remove unittest.mock from production code in keys.py, replace with direct
  module patching for getpass override
- Fix test suite: correct skip conditions, fixture names, and assertions in
  tree-sitter and parser tests
- Add format validator to CodeConCatConfig for output format normalization
- Fix setup_semgrep to return False on version mismatch
- Update docstrings in base_parser.py and openai_provider.py for accuracy
- Add types-cachetools to pre-commit mypy additional_dependencies
Tests for cache key generation, token estimation, and cost calculation
instantiate OpenAIProvider which validates API key on init. Skip these
in CI where API keys are not available.
Add ParserInitializationError handling to WAT and Crystal tests in
test_doc_extraction_improvements.py so they skip gracefully in CI
where these grammars fail to build (requires git clone and C compiler).
@biostochastics biostochastics merged commit 8cb4da8 into main Feb 2, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant