Static complexity analysis and cross-language comparison for C, C++, Rust, Java, Python, R, Perl, and Fortran. Built to catch bugs and incomplete translations between an original codebase and its Rust port.
This code has been generated by LLM without a specific source. The code is under MIT license but there is a risk that it contains code copied from an unknown source. Such analysis has not been performed
cargo build --release
# The binary lives at target/release/ccc-rs
# 1. Analyze a C tree and a Rust tree into JSON reports.
./target/release/ccc-rs analyze path/to/c_src -l c --recurse -o c.json
./target/release/ccc-rs analyze path/to/rust_src -l rust --recurse -o rust.json
# 2. Compare, sorted by deviation (most concerning first).
./target/release/ccc-rs compare rust.json c.json --top 25
# 3. List C functions with no Rust counterpart (and partial/stubbed matches).
./target/release/ccc-rs missing rust.json c.json
# 4. Rank functions within one report by complexity.
./target/release/ccc-rs sort c.json --by composite --top 25
# 5. Diff constants (magic numbers, strings) per matched function.
./target/release/ccc-rs constants-diff rust.json c.json
# 6. Train a linear+heuristic model and predict Rust metrics from C metrics.
./target/release/ccc-rs predict train pairs_dir/ --model model.json
./target/release/ccc-rs predict apply --model model.json --source c.json --against rust.json
# 7. Compare structs/classes/records across the reports. Features are
# field counts bucketed by type category (int/float/pointer/string/…).
./target/release/ccc-rs compare-structs rust.json c.json --top 25
./target/release/ccc-rs missing-structs rust.json c.json
# 8. Starting from a Rust and/or original-language function, walk recursive
# upstream caller sets on both sides, flag translation-table mismatches,
# and rank translated upstream pairs by complexity mismatch.
./target/release/ccc-rs upstream rust.json c.json --rust-fn leaf_caller
./target/release/ccc-rs upstream rust.json c.json --other-fn mm_map_frag_core
# 9. Compare the translated call graphs globally to find structural rewiring:
# changed caller/callee neighborhoods, missing translated edges, and
# recursion-group mismatches.
./target/release/ccc-rs call-graph-diff rust.json c.jsonpairs_dir/ for training must contain files named <base>.rust.json and <base>.c.json (or .cpp.json) for each matched pair.
| Language | -l value |
File extensions | Grammar |
|---|---|---|---|
| C | c |
.c, .h |
tree-sitter-c |
| C++ | cpp |
.cc, .cpp, .cxx, .hpp, .hh, .hxx |
tree-sitter-cpp |
| Rust | rust |
.rs |
tree-sitter-rust |
| Java | java |
.java |
tree-sitter-java |
| Python | python |
.py |
tree-sitter-python |
| R | r |
.r, .R |
tree-sitter-r |
| Perl | perl |
.pl, .pm, .t |
tree-sitter-perl-next |
| Fortran | fortran |
.f, .f90, .f95, .f03, .f08, .for, .ftn |
tree-sitter-fortran |
Language is auto-detected from file extension when -l is omitted. Cross-language compare works across any pair (see the earlier minimap2 and fastqc-rs examples — both Rust↔C and Rust↔Java).
- Rust:
#[no_mangle]/#[link_name = "..."]extracted intooriginal_nameso FFI bindings match the foreign symbol automatically. Rust doc comments usingMatches C++ \Qualified::name(args)`are also extracted intooriginal_name` for faithful ports that keep idiomatic Rust names. - C/C++: treats both
gotoandgnu_asm_expressionas signals;asm!/ inline asm lines counted separately asloc_asm. - Java: method polymorphism (same method name across many classes) produces many same-name matches; use a mapping file to disambiguate.
- Python:
elif_clausecounted flat, not nested.yield/raise_statementboth count as early returns. - R: functions are anonymous; the name comes from the enclosing
<-/=/->assignment.||/&&/|/&all treated as short-circuit for cyclomatic purposes. - Perl:
last/next/redoclassified asgoto_count(they're non-local jumps).elsifis a flat decision. POD blocks count as comments. - Fortran:
elseif_clause/elsewhere_clause/case_statementarms all flat.cycle/exit→ goto,return→ return,stop→ return.
ccc-rs <subcommand>
| Subcommand | Purpose |
|---|---|
analyze <path> |
Parse a file or directory into a JSON Report. `-l c |
compare <rust.json> <other.json> |
Matches functions and lists top deviations. Output columns read metric(other_value -> rust_value Δ=weighted_contribution). Flags: --mapping map.toml, --top N, `--format table |
missing <rust.json> <other.json> |
Functions in C not matched to anything in Rust (plus "partial" — matched but Rust LOC is a stub-sized fraction of C). --stub-loc-ratio 0.2 (default). |
sort <report.json> |
Sort functions in one report. `--by cognitive |
constants-diff <rust.json> <other.json> |
Per matched pair, shows integer/float/string/char/bool constants present on only one side. Ranked by divergence score. |
compare-structs <rust.json> <other.json> |
Matches structs/classes by name and ranks by deviation. Features are per-type-category field counts (int, float, bool, char, string, pointer, array, collection, other) plus total field count. Flags: --mapping, --top N, --format table|json. |
missing-structs <rust.json> <other.json> |
Structs present in other but not matched on the Rust side (and vice versa). |
predict train <pairs_dir> --model model.json |
Fits one linear model per target metric via closed-form OLS over matched pairs. |
predict apply --model model.json --source c.json [--against rust.json] |
Predicts expected Rust metrics from C; with --against, also reports z-scores of actual-minus-predicted for outlier detection. |
order <path> |
Emit functions in bottom-up porting order as CSV (callees before callers). path is a source file, source directory (use --recurse), or a report.json. Mutually recursive groups are labelled so they can be translated together. Flags: -l, --recurse, -o file.csv, --strict, --merge prev.csv. |
order-annotate <csv> --source other.json --rust rust.json |
Append Rust counterpart columns (rust_name, rust_file, rust_line_start, match_strategy) to a CSV from order. Accepts --mapping map.toml. |
upstream <rust.json> <other.json> |
Resolve a Rust seed and/or original-language seed, compute recursive upstream caller sets on both sides, flag non-overlap through the 1:1 pairing table, and list translated upstream pairs ordered by mismatch. Selectors: --rust-fn/--rust-path/--rust-line/--rust-class and the corresponding --other-* flags. If only one side is supplied, the counterpart seed is inferred from the pairing table when possible. Accepts --mapping map.toml, --strict, --format table|json. |
call-graph-diff <rust.json> <other.json> |
Compare the translated call graphs globally. Reports translated direct-call edges that exist only on one side, per-pair caller/callee neighborhood mismatches, and recursion/SCC-shape differences. Ordered by most structurally mismatched pairs first. Accepts --mapping map.toml, --strict, --format table|json. |
# 1. Emit the porting order for a C tree (or a pre-built report).
./target/release/ccc-rs order path/to/c_src --recurse -l c -o order.csv
# 2. Edit `translated` from FALSE to TRUE as each function is ported.
# 3. Re-run later (e.g. after source edits) and preserve the flags:
./target/release/ccc-rs order path/to/c_src --recurse -l c --merge order.csv -o order.csv
# 4. Join against a Rust report to see which Rust function each row maps to:
./target/release/ccc-rs analyze path/to/rust_src -l rust --recurse -o rust.json
./target/release/ccc-rs analyze path/to/c_src -l c --recurse -o c.json
./target/release/ccc-rs order-annotate order.csv --source c.json --rust rust.json -o annotated.csvCSV schema:
| column | meaning |
|---|---|
name |
function name |
file |
source file (as recorded in the report) |
line_start |
first line of the function |
scc_id |
blank for non-recursive functions; shared integer for members of a recursion group |
scc_kind |
self for direct self-recursion, mutual for mutual recursion, else blank |
translated |
starts FALSE; edit to TRUE as you port. Preserved across re-runs via --merge. |
Callee resolution is name-based — Call.callee strings are reduced to the bare identifier (see "Known limitations"). Same-named functions in different files cause ambiguity; by default order adds an edge to every candidate (a safe over-approximation: it can pull a dependency earlier, never later). Pass --strict to drop ambiguous edges instead. The stderr summary reports counts for ambiguous and unresolved call sites so you know how much name-only resolution is costing you.
When names don't match cleanly across the port, provide a mapping (TOML or JSON):
# map.toml
[[entries]]
rust = "parse_header"
other = "mm_parse_header"
[[entries]]
rust = "Aligner::map_frag"
other = "mm_map_frag_core"--mapping map.toml is accepted by compare, missing, constants-diff, and predict apply.
If the same function name appears in several modules (e.g. a decode
helper in every message type), add a rust_path and/or other_path
constraint. Each is a path suffix matched on path components against
Location.file, so you write the relative path the way you'd see it in
the repo:
[[entries]]
rust = "decode"
rust_path = "format/messages/datatype.rs"
other = "H5O__dtype_decode"
[[entries]]
rust = "decode"
rust_path = "format/messages/link.rs"
other = "H5O__link_decode"
[[entries]]
rust = "decompress"
rust_path = "filters/scaleoffset.rs"
other = "H5Z__scaleoffset_decompress"
other_path = "src/H5Zscaleoffset.c"A bare rust = "name" (no rust_path) keeps the previous behavior:
the first unused candidate by name is paired. Path matching is on whole
path components, so messages/datatype.rs matches
/abs/.../src/format/messages/datatype.rs but atype.rs does not.
When the same name appears several times in one file (e.g. three
new methods across impl blocks in a single model.rs), path
matching can't tell them apart. Prefer pinning by enclosing class:
[[entries]]
rust = "new"
rust_path = "src/model.rs"
rust_class = "Cluster" # matches `impl Cluster { fn new ... }`
other = "__init__"
other_path = "gecco/model.py"
other_class = "Cluster" # matches Python `class Cluster: def __init__`rust_class / other_class compare against FunctionAnalysis. enclosing_type, which is the impl-target in Rust (trait-impl target,
not the trait name — impl Display for Cluster lands under Cluster),
and the nearest class ancestor in Python/Java/C++. A bare name like
Cluster matches exactly; use "" to require no enclosing type
(i.e. a free function / module-level fn).
Class pinning survives adding/moving/reordering functions within a
file. A line-pinning fallback is also available via rust_line /
other_line (compared against Location.line_start) for cases where
no class applies — but expect to revisit those entries when source
shifts.
Functions are matched Rust ↔ other, in priority order. The chosen strategy is recorded per pair:
- Mapping — explicit entry in the mapping file.
- FfiAttribute — Rust
#[no_mangle],#[link_name = "…"], orMatches C++ \...`doc comments equal the other-language function name. Extracted into the Rust report'soriginal_name` field. - QualifiedMethod — C++
Type::methodmatches Rustimpl Type { fn method(...) }; constructors matchnew/default, destructors matchdropwhen present. - ExactName — identical names.
- Normalized — snake/camelCase folded, trivial suffixes like
_impl,_inner,_rs,_cstripped. - Fingerprint — same
(arity, return_count, log2(loc))and a shared token of ≥ 4 chars. Deliberately conservative; spurious matches from short names were the #1 noise source.
Stored in FunctionAnalysis.metrics:
loc_code,loc_comments,loc_asm— lines attributed to code, comments, inline asm respectively.inputs,outputs— parameter count, return arity (tuple/out-params flattened).branches,loops— raw counts.max_loop_nesting,max_if_nesting,max_combined_nesting.calls_unique,calls_total.cyclomatic— McCabe, base 1 + one per decision point.cognitive— Sonar-style; penalizes nesting; else-if chains do not compound (fixed bug).halstead—{n1, n2, big_n1, big_n2, volume, difficulty}.early_returns,goto_count,unsafe_blocks.binary_operators— the binary operator set: occurrence counts for the arithmetic (+ - * / %), shift (<< >>), bitwise (& | ^ ~), and logical (&& || !) operators. Counted by symbol from each operator node, so a binarya & blands inbit_andeven in languages that treat it as element-wise-logical, while a unary*p/&x/-xis excluded (only~/!are recorded from prefix position). The 14 counts feed thecomparedeviation score (each as its ownop_*dimension, default weight 0.5) and prediction features. This is especially important for tracking floating-point problems: a mismatch in the arithmetic counts (+ - * /) between an original and its port is a strong signal that the order or set of float operations changed — exactly the kind of reassociation, dropped term, or substituted operation that silently shifts rounding and precision.
Also captured per function: enclosing_type (impl-target in Rust, class in Python/Java/C++; None for free functions), constants (each with kind, textual form, parsed value, byte span), calls (callee name, count, span), types_used, signature, attributes (free-form language-specific bag: static, inline, no_mangle, cfg, etc.).
Top level Report:
{
"schema_version": 1,
"language": "c" | "cpp" | "rust" | "java" | "python" | "r" | "perl" | "fortran" | "unknown",
"source_file": "path/to/file",
"source_hash": "16-hex-char FNV-1a",
"functions": [FunctionAnalysis, ...],
"structs": [StructAnalysis, ...]
}StructAnalysis records one struct/class/record/union/derived_type:
{
"name": "Point",
"kind": "struct" | "class" | "union" | "record" | "interface" | "enum" | "derived_type",
"location": { ... },
"fields": [
{"name": "x", "ty": {"text": "f64"}, "category": "float"},
{"name": "label", "ty": {"text": "String"}, "category": "string"}
],
"metrics": {
"field_count": 7,
"int_count": 1, "float_count": 2, "bool_count": 0, "char_count": 0,
"string_count": 1, "pointer_count": 1, "array_count": 1,
"collection_count": 1, "other_count": 0
},
"attributes": { "pub": "true" }
}Each field's category is one of int, float, bool, char, string, pointer, array, collection, other — a language-neutral bucketing of the textual type so that a Rust u32, a C uint32_t, and a Python int all land in int. See classify_type in src/core.rs.
Constant is tagged:
{"kind": "int", "value": 255, "text": "0xFF", "span": [start, end]}
{"kind": "float", "value": 3.14, "text": "3.14", "span": [start, end]}
{"kind": "string", "value": "hi", "span": [start, end]}
{"kind": "char", "value": "\\n", "span": [start, end]}
{"kind": "bool", "value": true, "span": [start, end]}schema_version is a breaking-change gate — bump it when fields are removed or semantics change.
- Add a module
src/lang_<name>.rsfollowingsrc/lang_c.rsas a template. Declare it insrc/lib.rs. - Add the matching tree-sitter grammar to
Cargo.toml(e.g.tree-sitter-java). - Implement
walker::LanguageSpec— map tree-sitter node kinds toNodeClass, and providefunction_name,call_callee,signature, optionallyoriginal_name/attributes. - Implement
analyzer::LanguageAnalyzer— parse with tree-sitter, callwalker::collect_functionsthenwalker::analyze_functionper node, thenwalker::finalize_early_returns. - Register in
src/main.rs::build_registry()and add aLangArgvariant.
The walker is language-agnostic; you only tell it how to classify nodes.
- Tree-sitter sees tokens, not semantics. C macros, templates, and preprocessor-heavy code produce approximate metrics. A
static inlinefunction in a SIMD header like_mm_setzero_si128shows up as a regular function. Filter these out by path or attribute when auditing. call_expressioncallee naming is language-specific. For Rust,foo::bar::baz()is reduced tobaz,"s".into()tointo,x.y()toy. Use spans if you need the original text.- Fingerprint matching is conservative by design. Functions without clear name overlap go to the
missinglist even if they are real translations. Provide a mapping file for those. - Prediction model is per-metric OLS with a small ridge term. With few training pairs the residuals are tiny (near-memorization); add more pairs for meaningful z-scores.
- Integer constants are stored as
i64(noti128) becauseserde_jsoncan't round-tripi128without extra features. Oversized C constants are wrapped.
Cargo.toml
src/lib.rs public module list
src/main.rs CLI (binary "ccc-rs")
src/core.rs shared types, JSON schema, versioning
src/analyzer.rs LanguageAnalyzer trait + Registry
src/walker.rs generic tree-sitter visitor + LanguageSpec trait
src/lang_c.rs C/C++ analyzer
src/lang_rust.rs Rust analyzer
src/lang_java.rs Java analyzer
src/lang_python.rs Python analyzer
src/lang_r.rs R analyzer
src/lang_perl.rs Perl analyzer
src/lang_fortran.rs Fortran analyzer
src/compare/ matching, deviation, constants_diff, sort
src/predict/ OLS linear model + heuristic rules
Adding a new language means one new lang_<name>.rs that implements LanguageSpec, plus a registry entry.
Use tree-sitter as the uniform backbone — same visitor shape across languages, Rust bindings, robust to partial code. Keep libclang as an optional backend for C/C++ when macro-accurate results are needed (feature-gated). Rust-side, syn gives better type info than tree-sitter for the Rust analyzer specifically.
struct Report {
schema_version: u32,
language: Language,
source_file: PathBuf,
source_hash: String, // so compare can warn on staleness
functions: Vec<FunctionAnalysis>,
}
struct FunctionAnalysis {
name: String,
original_name: Option<String>, // Rust: from #[link_name]/#[no_mangle]/mapping file
mangled: Option<String>,
location: Location, // file, byte range, line range, col
signature: Signature, // inputs: Vec<Param>, outputs: Vec<TypeRef>
metrics: Metrics,
constants: Vec<Constant>, // with kind + textual form + source span
calls: Vec<Call>, // callee name + count + span
types_used: Vec<TypeRef>, // locals, fields touched, generics
attributes: BTreeMap<String, String>, // free-form extension bag per language
}
struct Metrics {
loc_code: u32,
loc_comments: u32,
loc_asm: u32, // inline asm blocks / asm! macros
inputs: u32,
outputs: u32, // tuple/out-params flattened
branches: u32, // if/else-if/match arms/ternary/&&/||
loops: u32,
max_loop_nesting: u32,
max_if_nesting: u32,
max_combined_nesting: u32,
calls_unique: u32,
calls_total: u32,
// extras worth adding:
cyclomatic: u32,
cognitive: u32, // Sonar-style; penalizes nesting
halstead: Halstead, // n1,n2,N1,N2 → volume, difficulty
early_returns: u32,
goto_count: u32, // goto (C), cycle/exit (Fortran), last/next/redo (Perl) — non-local jumps
unsafe_blocks: u32, // Rust only
binary_operators: BinaryOperatorSet, // +,-,*,/,%,<<,>>,&,|,^,~,&&,||,! counted by symbol
}The attributes bag lets each language stash language-specific flags (e.g. virtual, template, async) without bloating the core struct.
trait LanguageAnalyzer {
fn language(&self) -> Language;
fn extensions(&self) -> &[&str];
fn analyze_file(&self, path: &Path) -> Result<Report>;
fn analyze_source(&self, src: &str, path: &Path) -> Result<Report>;
}Each implementation is essentially a tree-sitter visitor that emits FunctionAnalysis per function node. Shared helpers in complexity-core handle nesting stacks, constant literal parsing, comment/code line counting from byte ranges.
Separate from analysis, in complexity-compare. Strategies tried in order:
- Explicit mapping file (YAML/TOML):
{ rust: "parse_header", c: "ph_parse" }. - FFI/doc original name:
#[no_mangle],#[link_name], or Rust doc comments of the formMatches C++ \Qualified::name(args)`→ useoriginal_name` directly. - Qualified method convention: C++
Type::method↔ Rustimpl Type { fn method(...) }, with constructors mapped tonew/defaultand destructors todropwhen implemented. - Name normalization: snake/camel-fold, strip
_impl,_inner, module prefixes. - Signature + metric fingerprint: arity, return kind, LOC bucket — break ties.
Output always records which strategy matched, so the user can audit.
ccc-rs analyze <path> [-l <lang>] [-o report.json] [--recurse]
ccc-rs compare <rust.json> <other.json> [--mapping map.yaml]
[--sort deviation|name] [--top N] [--format table|json]
ccc-rs missing <rust.json> <other.json> # in other, not in rust
ccc-rs sort <report.json> [--by cognitive|cyclomatic|combined-nesting|loc|composite]
ccc-rs constants-diff <rust.json> <other.json> # grouped by function, by kind
ccc-rs predict train --pairs dir/ --model model.json
ccc-rs predict apply --model model.json --source other.json [--against rust.json]
--format json everywhere so results chain into CI / dashboards.
Per matched pair, compute a weighted normalized difference:
dev = Σ_i w_i * |m_rust_i - m_other_i| / max(1, scale_i)
where scale_i is the 95th-percentile of that metric across the file (so a 10-line function with 2 loops isn't dwarfed by a 500-line one), and weights default to {cyclomatic: 2, cognitive: 2, combined_nesting: 2, calls_total: 1, loc: 1, constants: 1.5}. Weights configurable via TOML. Sort desc and show top N with a side-by-side metric table — these are the functions most likely to be mistranslated.
Set difference over matched-name keys; also flag partial matches: function exists but metric dev is above a threshold, "looks like a stub" (e.g. LOC < 20% of original). Partial matches often signal incomplete translation more than absent functions.
Useful sort keys:
cognitive— best single "is this hard to read" signalcyclomatic— classiccombined-nesting * loc— surfaces deeply-nested mid-sized functions that humans struggle withhalstead.difficulty— catches expression-heavy code without much control flowcomposite— default: z-score sum of cognitive, combined-nesting, calls_total, constants_count
Group per matched function. For each constant kind (int/float/string/char/bool), compare multisets:
- Exact equality → OK.
- Same kind, different value → potential translation bug (magic number drift).
- Missing on one side → highlight (often indicates a branch was dropped or an error message lost).
- Integer radix differences collapsed (
0xFFvs255are equal).
Output: per function, a three-column diff (rust-only, both, other-only) sorted by function deviation.
complexity-predict trains one linear model per target metric (cyclomatic_rust ~ f(C features), cognitive_rust ~ …, etc.), using matched pairs from an existing translated codebase as training data.
Features for each function pair:
- All source-language metrics
- LOC bucket (log-binned) and goto count
- Counts of:
switchcases, macro expansions flagged, pointer-heavy signatures, inline asm - Indicator: function is
static/ hasextern "C"
struct Model {
per_metric: BTreeMap<MetricName, LinearFit>, // coeffs + intercept + residual std
heuristics: Vec<HeuristicRule>,
}
struct LinearFit { coefs: Vec<f64>, intercept: f64, feature_order: Vec<String>, rmse: f64 }Heuristic adjustments applied after the linear step (ordered, composable):
- C
switchover small int → Rustmatch: branches stay ~equal, cyclomatic drops by ~1 (default arm). - C
goto cleanuppattern → Rust?: early-returns += N, branches -= N. - C macro with embedded control flow → Rust inline code: expect LOC inflation on Rust side.
- C
malloc/freepairs → Rust drop: calls_total -= 2k, unsafe_blocks likely 0.
The model outputs predicted Rust metrics + residual std per function. When a real Rust file is supplied via --against, flag functions where (actual - predicted) / residual_std > 2.5 — these are statistical outliers: the translation did something unexpectedly divergent. This complements raw deviation, which doesn't know which differences are normal for this codebase.
Train with OLS (use linfa or nalgebra + closed-form) — keeps the model interpretable and the coefficients inspectable as JSON.
The only per-language work is:
- A tree-sitter grammar dependency.
- An implementation of
LanguageAnalyzer(~one file, mostly a visitor). - Optional language-specific attributes in the
attributesbag. - Optional heuristic rules in the prediction model.
Compare/sort/diff/predict commands work unchanged because they consume only the shared JSON schema. Java, Python, R, Perl, and Fortran were each added as a single src/lang_<name>.rs file following this pattern.
- Version the JSON schema (
schema_versionfield) — you'll change it. - Store source ranges, not just line numbers — lets you re-open in editor / build rich HTML reports later.
- Record each constant's textual form alongside its parsed value —
0xFFvs255is sometimes the thing you want to see in the diff. - Don't try to resolve
#include/macros for the C analyzer v1. Analyze at the token/AST level only; add libclang later as an opt-in backend when the macro blindness becomes the limiting factor. - Matching by name is brittle across large refactors — the mapping file escape hatch is essential; design for it from day one, not retrofitted.