v0.9.3: Robustness improvements, security hardening, and documentation enhancements#43
Conversation
- Add explicit OpenAI API key validation during provider initialization - Fix ProcessPoolExecutor resource leak with proper exception handling - Replace unsafe dict unpacking with Pydantic model_validate() in worker - Fix jsonschema import bug that never actually verified availability - Increase PBKDF2 iterations to 210,000 per OWASP 2024 recommendations - Replace magic numbers with named constants (KILOBYTE, MEGABYTE) - Add GitHub token best practices documentation to README - Update type hints to modern Python 3.10+ syntax in scripts/tools
- Return TemporaryDirectory object from collect_git_repo() so caller manages cleanup lifecycle, preventing deletion before validation/parsing - Add finally block in run_codeconcat() to ensure cleanup after processing - Update tests to expect None instead of empty string on error returns - Add tempfile import to main.py for proper type annotation
…sers - Add doc_comments queries to 9 parsers: SQL, GraphQL, HCL, GLSL, HLSL, Solidity, WAT, Crystal, and Elixir - Extend CommentPatterns in pattern_library.py with 16+ language entries for single-line and block comments - Add PHPDoc tag processing using clean_jsdoc_tags for consistent @param/@return extraction - Implement Elixir @doc/@moduledoc attribute extraction with proper module attribute handling - Update Julia parser to capture both triple-quoted docstrings and line/block comments
BaseParser Improvements (8 fixes): - Fixed IndexError in extract_docstring() with bounds checking - Fixed regex injection in _create_pattern() with re.escape() - Fixed brace counting in strings with _count_braces_outside_strings() - Fixed type annotations (Pattern[str], str | None) - Added _reset() method for parser state management CLI Fixes: - Fixed Rich table truncating API keys with --show-values flag - Fixed test assertions for dashboard mode output suppression Test Fixes: - Corrected binary file detection test expectations (Latin-1 fallback) - Fixed Apiiro ruleset commit hash mock values - Added test_doc_extraction_improvements.py - Added test_security_hardening.py Documentation: - Comprehensive docstring improvements across 7+ files - Google-style docstring standardization - Exception attribute documentation
- Add test_doc_extraction_improvements.py with 32 tests covering: - Elixir @doc/@moduledoc extraction - Julia triple-quoted docstrings, line comments, and block comments - PHP PHPDoc with tag processing - GraphQL description extraction - SQL, HCL, Solidity, WAT, Crystal comment handling - CommentPatterns class verification - doc_comments query presence in all enhanced parsers - Update test_tree_sitter_graphql_parser.py to expect 6 queries (including the new doc_comments query)
|
CodeAnt AI is reviewing your PR. |
Nitpicks 🔍
|
| format: str = Field( | ||
| "markdown", description="Output format: 'markdown', 'json', 'xml', or 'text'" | ||
| ) | ||
| xml_processing_instructions: bool = Field( |
There was a problem hiding this comment.
Suggestion: The format field is documented as supporting only 'markdown', 'json', 'xml', or 'text', but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against VALID_FORMATS at model level prevents this. [possible bug]
Severity Level: Major ⚠️
- ❌ Writer selection may raise at runtime.
- ⚠️ CLI user sees unclear crash when choosing invalid format.
- ⚠️ Automation/config validation lacks early failure.| xml_processing_instructions: bool = Field( | |
| @field_validator("format", mode="before") | |
| @classmethod | |
| def _validate_format(cls, value: str | None) -> str: | |
| """Ensure output format is one of the supported values and normalise casing.""" | |
| if value is None or str(value).strip() == "": | |
| return "markdown" | |
| normalised = str(value).strip().lower() | |
| if normalised not in VALID_FORMATS: | |
| allowed = ", ".join(sorted(VALID_FORMATS)) | |
| raise ValueError( | |
| f"Invalid output format '{value}'. Must be one of: {allowed}." | |
| ) | |
| return normalised | |
Steps of Reproduction ✅
1. Create a configuration that sets format to an unsupported value (e.g., format: "html")
in the CLI/YAML and load it into CodeConCatConfig. The fields shown in
codeconcat/base_types.py:794 define output and format.
2. With the current model there is no validator for format, so CodeConCatConfig.format
will contain "html" (or any arbitrary string) after instantiation.
3. When the writer selection code later branches on config.format to pick a renderer
(writer selection locations: writer modules that import CodeConCatConfig and switch on
.format), an unexpected string causes no matching writer selection branch and will trigger
either a runtime KeyError/AttributeError or fall through to an error path.
4. The proposed validator _validate_format fails fast during config instantiation,
preventing invalid formats from propagating to the writer selection stage and producing a
clear ValueError on startup instead of a less explicit runtime failure when generating
output.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** codeconcat/base_types.py
**Line:** 798:798
**Comment:**
*Possible Bug: The `format` field is documented as supporting only `'markdown'`, `'json'`, `'xml'`, or `'text'`, but there is no validation enforcing this, so invalid formats can slip through to the writer selection logic and cause runtime errors or undefined behaviour; validating against `VALID_FORMATS` at model level prevents this.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.| ("lmstudio", "LM Studio"), | ||
| ("llamacpp_server", "llama.cpp Server"), | ||
| ("local_server", "Local OpenAI-Compatible"), | ||
| ] |
There was a problem hiding this comment.
Suggestion: The list_keys command does not include the llamacpp provider in its providers table, while other commands (set, reset, change-password, export) now treat llamacpp as a valid provider. This means a user can successfully store an API key for llamacpp but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding llamacpp to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys. [logic error]
Severity Level: Major ⚠️
- ⚠️ CLI 'keys list' hides stored llama.cpp keys causing confusion.
- ⚠️ Inconsistent provider surface across set/reset/export commands.
- ❌ Users may think keys lost when listing returns nothing.| ] | |
| ("llamacpp", "llama.cpp (deprecated)"), |
Steps of Reproduction ✅
1. Store an API key for the llama.cpp provider:
- Run the CLI command `codeconcat keys set llamacpp` (handled by set_key in
codeconcat/cli/commands/keys.py).
- The set_key function verifies valid_providers which includes "llamacpp" (see
keys.py:130-156, valid_providers list around line 141-156).
- Manager stores the key via APIKeyManager.set_key (set_key uses manager.set_key at
keys.py:186-189).
2. List configured API keys:
- Run the CLI command `codeconcat keys list` which calls list_keys in
codeconcat/cli/commands/keys.py.
- list_keys constructs the providers table from the providers list defined at
keys.py:89-103 and then iterates over it (see loop at keys.py:105-118).
3. Observe the mismatch:
- Because "llamacpp" is not present in the list_keys providers table (keys.py:89-103),
the list command never queries manager.get_key("llamacpp") and thus never displays the
stored llama.cpp key, even though it exists in storage (set_key stored it at
keys.py:186-189).
4. Concrete verification pointers (from repository grep/read):
- Presence of "llamacpp" in set/reset/export flows: see valid_providers in set_key
(keys.py:141-156), providers list in reset_keys/include in export_keys (keys.py:260-276
and keys.py:467-481).
- Absence in list_keys providers: keys.py:89-103 (the existing_code snippet above).
- Therefore a stored key for "llamacpp" will be accepted by set_key but not shown by
list_keys; the proposed change adds ("llamacpp", "llama.cpp (deprecated)") to
keys.py:89-103 to align behavior.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** codeconcat/cli/commands/keys.py
**Line:** 103:103
**Comment:**
*Logic Error: The `list_keys` command does not include the `llamacpp` provider in its `providers` table, while other commands (`set`, `reset`, `change-password`, `export`) now treat `llamacpp` as a valid provider. This means a user can successfully store an API key for `llamacpp` but will never see it listed in the "Configured API Keys" table, which hides real configuration state and is confusing/inconsistent behavior. Adding `llamacpp` to the table fixes this mismatch so that all providers that can be managed by the keys commands are also shown when listing keys.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
CodeAnt AI finished reviewing your PR. |
- Eliminate silent failures by narrowing exception handlers in security.py, main.py, and local_collector.py to catch specific exceptions with logging - Remove unittest.mock from production code in keys.py, replace with direct module patching for getpass override - Fix test suite: correct skip conditions, fixture names, and assertions in tree-sitter and parser tests - Add format validator to CodeConCatConfig for output format normalization - Fix setup_semgrep to return False on version mismatch - Update docstrings in base_parser.py and openai_provider.py for accuracy - Add types-cachetools to pre-commit mypy additional_dependencies
Tests for cache key generation, token estimation, and cost calculation instantiate OpenAIProvider which validates API key on init. Skip these in CI where API keys are not available.
Add ParserInitializationError handling to WAT and Crystal tests in test_doc_extraction_improvements.py so they skip gracefully in CI where these grammars fail to build (requires git clone and C compiler).
Summary
This PR includes significant improvements across multiple areas:
Parser & Documentation
Security Hardening
Bug Fixes
CI/CD
Documentation
Test plan