v0.9.3: Robustness improvements, security hardening, and documentation enhancements by biostochastics · Pull Request #43 · biostochastics/CodeConCat

biostochastics · 2026-02-02T07:23:57Z

Summary

This PR includes significant improvements across multiple areas:

Parser & Documentation

Enhanced documentation extraction across 9+ tree-sitter parsers (SQL, GraphQL, HCL, GLSL, HLSL, Solidity, WAT, Crystal, Elixir)
Extended CommentPatterns with 16+ language entries for single-line and block comments
BaseParser robustness improvements (8 fixes including regex injection prevention, bounds checking, state management)

Security Hardening

Path traversal prevention with symlink escape detection
Semgrep version exact matching to prevent spoofing
Apiiro commit hash verification
Refined secrets pattern detection to reduce false positives
Binary detection with Latin-1 fallback for legacy encodings

Bug Fixes

Prevented premature temp directory cleanup for GitHub repos
Fixed CLI keys command Rich table truncation with --show-values
Fixed test assertions for dashboard mode output suppression
Corrected Apiiro ruleset test mocks with proper commit hash

CI/CD

Disabled AI summary tests (requires API keys not available in CI)
Added tests badge (1550+ passing)

Documentation

Comprehensive docstring improvements across 7+ files
Google-style docstring standardization
Exception attribute documentation with examples

Test plan

All 1550+ tests pass locally
Pre-commit hooks pass (ruff, mypy, formatting)
CI workflow updated to skip API-dependent tests

- Add explicit OpenAI API key validation during provider initialization - Fix ProcessPoolExecutor resource leak with proper exception handling - Replace unsafe dict unpacking with Pydantic model_validate() in worker - Fix jsonschema import bug that never actually verified availability - Increase PBKDF2 iterations to 210,000 per OWASP 2024 recommendations - Replace magic numbers with named constants (KILOBYTE, MEGABYTE) - Add GitHub token best practices documentation to README - Update type hints to modern Python 3.10+ syntax in scripts/tools

- Return TemporaryDirectory object from collect_git_repo() so caller manages cleanup lifecycle, preventing deletion before validation/parsing - Add finally block in run_codeconcat() to ensure cleanup after processing - Update tests to expect None instead of empty string on error returns - Add tempfile import to main.py for proper type annotation

@return

…sers - Add doc_comments queries to 9 parsers: SQL, GraphQL, HCL, GLSL, HLSL, Solidity, WAT, Crystal, and Elixir - Extend CommentPatterns in pattern_library.py with 16+ language entries for single-line and block comments - Add PHPDoc tag processing using clean_jsdoc_tags for consistent @param/@return extraction - Implement Elixir @doc/@moduledoc attribute extraction with proper module attribute handling - Update Julia parser to capture both triple-quoted docstrings and line/block comments

BaseParser Improvements (8 fixes): - Fixed IndexError in extract_docstring() with bounds checking - Fixed regex injection in _create_pattern() with re.escape() - Fixed brace counting in strings with _count_braces_outside_strings() - Fixed type annotations (Pattern[str], str | None) - Added _reset() method for parser state management CLI Fixes: - Fixed Rich table truncating API keys with --show-values flag - Fixed test assertions for dashboard mode output suppression Test Fixes: - Corrected binary file detection test expectations (Latin-1 fallback) - Fixed Apiiro ruleset commit hash mock values - Added test_doc_extraction_improvements.py - Added test_security_hardening.py Documentation: - Comprehensive docstring improvements across 7+ files - Google-style docstring standardization - Exception attribute documentation

@moduledoc

- Add test_doc_extraction_improvements.py with 32 tests covering: - Elixir @doc/@moduledoc extraction - Julia triple-quoted docstrings, line comments, and block comments - PHP PHPDoc with tag processing - GraphQL description extraction - SQL, HCL, Solidity, WAT, Crystal comment handling - CommentPatterns class verification - doc_comments query presence in all enhanced parsers - Update test_tree_sitter_graphql_parser.py to expect 6 queries (including the new doc_comments query)

codeant-ai · 2026-02-02T07:24:02Z

CodeAnt AI is reviewing your PR.

codeant-ai · 2026-02-02T07:28:10Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review Compression metrics bug The code computes `original_lines` after `item.content` has been replaced by the compressed content, making the "original" count incorrect. That leads to wrong compression percentages and inaccurate messages. Also the percent calculation divides by `original_lines` without guarding against zero, which can raise ZeroDivisionError for empty files. Inconsistent path usage After validating and resolving `resolved_path` you continue to use `Path(file_path)` for filesystem operations (stat, existence checks, reporter.add_finding, and passing the path to security checks). This creates TOCTOU and inconsistency risks (resolved path vs original may differ, symlink handling may be bypassed). Prefer using the canonical `resolved_path` for further operations to avoid race conditions and surprising behavior. Regex sensitivity The updated exec pattern adds word-boundary anchors and a longer alternation. This can change matching characteristics (case sensitivity, boundary behavior around punctuation) and may produce false negatives for some variants (e.g., mixed-case tokens or tokens adjacent to non-word characters). Verify the pattern matches intended token forms across target languages and consider whether case-insensitive matching is required. Potential sensitive data exposure The generated markdown includes full file paths and file summaries taken verbatim from `file.file_path` and `file.summary`. That may leak repository-internal paths or long/large summaries. Consider sanitizing /truncating output and respecting a redact/allowlist policy. Import compatibility The new import `from tree_sitter import Query` is unconditional. Older or alternate tree-sitter installations may not expose `Query` (similar to how `QueryCursor` required a guarded import). This can cause an ImportError at runtime on some environments. Consider providing a fallback or guard to preserve backward compatibility. Flaky Test Risk The new binary-detection test relies on implementation-specific heuristics (UTF-8 decode fallback to Latin-1 and a percentage threshold for control characters). Small sample sizes and platform/file-system differences can make the test flaky: a tiny file of only control bytes may exceed the control-character threshold and be detected as binary even if the intent is to treat them as text after a Latin-1 fallback. Tight Implementation Assumption The test assumes concrete behavior details (e.g., that files with high bytes are always treated as text after Latin-1 fallback and that ASCII control bytes are treated as text). These assertions tightly couple the test to the current internal algorithm and could break if detection thresholds or fallback logic are tuned for security. Ambiguous symlink assertion The assertion that verifies symlink handling iterates result items and uses `assert "external_link" not in path or result.get("verified") is False`. This is hard to reason about and may pass in unexpected cases. Prefer explicit checks for presence / verification status of the symlink key (relative path) to avoid false-negatives or flakiness. Platform-dependent symlink creation Several tests rely on creating symlinks and skip with pytest.skip if OSError occurs. These tests may be skipped on Windows or CI where symlink creation requires privileges. Consider alternative ways to test symlink behaviour or mark tests explicitly as platform-dependent to avoid implicit test-suite variability. Hardcoded Commit Hash The test sets a hardcoded commit hash for git rev-parse output. If the canonical `APIIRO_RULESET_COMMIT` constant (used by production code) changes, this test will diverge and become brittle. Prefer referencing the authoritative constant or deriving the value so the test stays correct when the expected commit is updated.

codeant-ai · 2026-02-02T07:30:34Z

+    format: str = Field(
+        "markdown", description="Output format: 'markdown', 'json', 'xml', or 'text'"
+    )
+    xml_processing_instructions: bool = Field(


Suggestion: The format field is documented as supporting only 'markdown', 'json', 'xml', or 'text', but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against VALID_FORMATS at model level prevents this. [possible bug]

Severity Level: Major ⚠️

- ❌ Writer selection may raise at runtime. - ⚠️ CLI user sees unclear crash when choosing invalid format. - ⚠️ Automation/config validation lacks early failure.

Suggested change

xml_processing_instructions: bool = Field(

@field_validator("format", mode="before")

@classmethod

def _validate_format(cls, value: str | None) -> str:

"""Ensure output format is one of the supported values and normalise casing."""

if value is None or str(value).strip() == "":

return "markdown"

normalised = str(value).strip().lower()

if normalised not in VALID_FORMATS:

allowed = ", ".join(sorted(VALID_FORMATS))

raise ValueError(

f"Invalid output format '{value}'. Must be one of: {allowed}."

)

return normalised

Steps of Reproduction ✅

1. Create a configuration that sets format to an unsupported value (e.g., format: "html") in the CLI/YAML and load it into CodeConCatConfig. The fields shown in codeconcat/base_types.py:794 define output and format. 2. With the current model there is no validator for format, so CodeConCatConfig.format will contain "html" (or any arbitrary string) after instantiation. 3. When the writer selection code later branches on config.format to pick a renderer (writer selection locations: writer modules that import CodeConCatConfig and switch on .format), an unexpected string causes no matching writer selection branch and will trigger either a runtime KeyError/AttributeError or fall through to an error path. 4. The proposed validator _validate_format fails fast during config instantiation, preventing invalid formats from propagating to the writer selection stage and producing a clear ValueError on startup instead of a less explicit runtime failure when generating output.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** codeconcat/base_types.py **Line:** 798:798 **Comment:** *Possible Bug: The `format` field is documented as supporting only `'markdown'`, `'json'`, `'xml'`, or `'text'`, but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against `VALID_FORMATS` at model level prevents this. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

codeant-ai · 2026-02-02T07:30:34Z

+        ("lmstudio", "LM Studio"),
+        ("llamacpp_server", "llama.cpp Server"),
+        ("local_server", "Local OpenAI-Compatible"),
    ]


Suggestion: The list_keys command does not include the llamacpp provider in its providers table, while other commands (set, reset, change-password, export) now treat llamacpp as a valid provider. This means a user can successfully store an API key for llamacpp but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding llamacpp to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys. [logic error]

Severity Level: Major ⚠️

- ⚠️ CLI 'keys list' hides stored llama.cpp keys causing confusion. - ⚠️ Inconsistent provider surface across set/reset/export commands. - ❌ Users may think keys lost when listing returns nothing.

Suggested change

]

("llamacpp", "llama.cpp (deprecated)"),

Steps of Reproduction ✅

1. Store an API key for the llama.cpp provider: - Run the CLI command `codeconcat keys set llamacpp` (handled by set_key in codeconcat/cli/commands/keys.py). - The set_key function verifies valid_providers which includes "llamacpp" (see keys.py:130-156, valid_providers list around line 141-156). - Manager stores the key via APIKeyManager.set_key (set_key uses manager.set_key at keys.py:186-189). 2. List configured API keys: - Run the CLI command `codeconcat keys list` which calls list_keys in codeconcat/cli/commands/keys.py. - list_keys constructs the providers table from the providers list defined at keys.py:89-103 and then iterates over it (see loop at keys.py:105-118). 3. Observe the mismatch: - Because "llamacpp" is not present in the list_keys providers table (keys.py:89-103), the list command never queries manager.get_key("llamacpp") and thus never displays the stored llama.cpp key, even though it exists in storage (set_key stored it at keys.py:186-189). 4. Concrete verification pointers (from repository grep/read): - Presence of "llamacpp" in set/reset/export flows: see valid_providers in set_key (keys.py:141-156), providers list in reset_keys/include in export_keys (keys.py:260-276 and keys.py:467-481). - Absence in list_keys providers: keys.py:89-103 (the existing_code snippet above). - Therefore a stored key for "llamacpp" will be accepted by set_key but not shown by list_keys; the proposed change adds ("llamacpp", "llama.cpp (deprecated)") to keys.py:89-103 to align behavior.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** codeconcat/cli/commands/keys.py **Line:** 103:103 **Comment:** *Logic Error: The `list_keys` command does not include the `llamacpp` provider in its `providers` table, while other commands (`set`, `reset`, `change-password`, `export`) now treat `llamacpp` as a valid provider. This means a user can successfully store an API key for `llamacpp` but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding `llamacpp` to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

codeant-ai · 2026-02-02T07:30:39Z

CodeAnt AI finished reviewing your PR.

- Eliminate silent failures by narrowing exception handlers in security.py, main.py, and local_collector.py to catch specific exceptions with logging - Remove unittest.mock from production code in keys.py, replace with direct module patching for getpass override - Fix test suite: correct skip conditions, fixture names, and assertions in tree-sitter and parser tests - Add format validator to CodeConCatConfig for output format normalization - Fix setup_semgrep to return False on version mismatch - Update docstrings in base_parser.py and openai_provider.py for accuracy - Add types-cachetools to pre-commit mypy additional_dependencies

Tests for cache key generation, token estimation, and cost calculation instantiate OpenAIProvider which validates API key on init. Skip these in CI where API keys are not available.

Add ParserInitializationError handling to WAT and Crystal tests in test_doc_extraction_improvements.py so they skip gracefully in CI where these grammars fail to build (requires git clone and C compiler).

biostochastics added 7 commits January 28, 2026 19:02

docs: add tests badge (1550+ passing)

08ad874

ci: disable AI summary tests (requires API keys)

6901de2

codeant-ai Bot reviewed Feb 2, 2026

View reviewed changes

biostochastics added 3 commits February 1, 2026 23:41

ci: skip AI provider tests that require API keys

74a60d0

Tests for cache key generation, token estimation, and cost calculation instantiate OpenAIProvider which validates API key on init. Skip these in CI where API keys are not available.

ci: skip WAT/Crystal parser tests when grammar unavailable

977c6e6

Add ParserInitializationError handling to WAT and Crystal tests in test_doc_extraction_improvements.py so they skip gracefully in CI where these grammars fail to build (requires git clone and C compiler).

biostochastics merged commit 8cb4da8 into main Feb 2, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.3: Robustness improvements, security hardening, and documentation enhancements#43

v0.9.3: Robustness improvements, security hardening, and documentation enhancements#43
biostochastics merged 10 commits into
mainfrom
development

biostochastics commented Feb 2, 2026

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Uh oh!

codeant-ai Bot Feb 2, 2026

Uh oh!

codeant-ai Bot Feb 2, 2026

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    xml_processing_instructions: bool = Field(
+    @field_validator("format", mode="before")
+    @classmethod
+    def _validate_format(cls, value: str | None) -> str:
+        """Ensure output format is one of the supported values and normalise casing."""
+        if value is None or str(value).strip() == "":
+            return "markdown"
+        normalised = str(value).strip().lower()
+        if normalised not in VALID_FORMATS:
+            allowed = ", ".join(sorted(VALID_FORMATS))
+            raise ValueError(
+                f"Invalid output format '{value}'. Must be one of: {allowed}."
+            )
+        return normalised

Conversation

biostochastics commented Feb 2, 2026

Summary

Parser & Documentation

Security Hardening

Bug Fixes

CI/CD

Documentation

Test plan

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Nitpicks 🔍

Uh oh!

codeant-ai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant