ECO: Add C++ Graph Edit Distance implementation with MCS optimization#3813
ECO: Add C++ Graph Edit Distance implementation with MCS optimization#3813alirezazd wants to merge 26 commits intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
167e123 to
24d50c2
Compare
grebe
left a comment
There was a problem hiding this comment.
First of all: thank you! This is exciting to see. Looks like a lot of good work and a big improvement.
Here's some initial feedback, I'm still working my way through. I think I have enough here to get the ball rolling, though, so I'm sending it out. Feel free to reach out directly or go back and forth here if anything I said was unclear or if you disagree.
There was a problem hiding this comment.
Deps shouldn't be under xls/, they should be managed from dependency_support/ or (ideally) BCR. tinyxml2 looks like it's in BCR. scipy_lsap.cpp is a bit trickier b/c our python deps aren't super well integrated into bazel (it's essentially pip installed, which afaik largely hides the internals). We typically try to avoid holding on to third_party/ source in here. I believe OR-tools has some implementations of linear assignment stuff and we already pull it in as a dep - is a pain to move to using it?
There was a problem hiding this comment.
I initially used OR-tools LASP solver but the performance took a hit. I'll do more experiments and will try to replace the external LSAP solver with something more convenient.
Regarding TinyXML, I think it would be nice to get rid custom parses (ir2nx, ir2gxl) and solely use XLS IR and generate graph directly from it before we merge this PR and after the current code is stable.
There was a problem hiding this comment.
Re: using XLS IR, I think it's fine to land this w/ tinyxml as long as you use the BCR version, but agree it would be better not to have custom parsing.
|
@grebe Ready for next round of review. I addressed most of the comments but the cost functions are still in a separate header, if you prefer them to live inside ged header, I can move them in a follow-up commit. |
There was a problem hiding this comment.
Re: using XLS IR, I think it's fine to land this w/ tinyxml as long as you use the BCR version, but agree it would be better not to have custom parsing.
grebe
left a comment
There was a problem hiding this comment.
Sorry for the long wait.
Hopefully this isn't too annoying, but... I think it would be good to move this into contrib (xls/contrib/eco).
We're getting close!
- Move graph data structures: graph.h/cpp -> graph.h/cc - Move GXL parser: gxl_parser.h/cpp -> gxl_parser.h/cc - Move LAP solver: lap.h/cc -> lap_solver.h/cc - Move GED algorithm: ged.h/cc - Move MCS algorithm: mcs.h/cpp -> mcs.h/cc - Move cost functions: cost_functions.h -> ged_cost_functions.h - Move main binary: main.cpp -> ged_main.cc - Move IR patch gen: ir_patch_gen.h/cc - Update all header guards to XLS_ECO_* format - Update all includes to use xls/eco/ prefix - Rename .cpp -> .cc for XLS convention This flattens the directory structure and makes the codebase more consistent with XLS conventions. The mcs_ged/ directory now only contains BUILD file and third-party libs/.
- Move tinyxml2 files into libs/tinyxml2/ subdirectory - Add scipy_lap in libs/scipy_lap/ subdirectory - Update BUILD paths for new lib structure - Each third-party lib now has its own directory
- graph.h/cc: XLSGraph, XLSNode, XLSEdge for representing XLS IR as graphs - gxl_parser.h/cc: Parse GXL (Graph eXchange Language) format - Foundation for cpp graph edit distance algorithm
- lap_solver.cc: Wrapper for scipy LAP solver - Used by GED algorithm for optimal node matching
- ged.h/cc: Core GED algorithm with cost matrix and LAP integration - Supports node/edge substitution, insertion, deletion costs - Handles pinned nodes (nodes that must match)
- Add cc_library targets: graph, gxl_parser, lap_solver, ged, mcs - Add cc_binary: ged_main - Add ir_patch_gen_lib target - All GED infrastructure now buildable
- mcs.h/cc: Find maximum common subgraph between two graphs - Preprocessing step that accelerates GED by identifying node correspondences - Reduces GED search space significantly
- ged_main.cc: CLI tool to compute GED between two GXL graphs - Supports MCS preprocessing option - Outputs edit operations and cost
- ir_patch_gen.h/cc: Convert GED results to IR patch protobuf - ir_patch.proto: Updated proto definitions - patch_ir.h/cc/main.cc: Apply patches to XLS IR - ir_patch_gen.py: Python utilities for patch generation
- Replace requirement() with @xls_pip_deps// for old Python tools - Update Receive node construction to match new upstream API signature - Add payload_type parameter required by new Receive constructor
This commit contains two types of changes: 1. Upstream updates to ir2nx.py merged during rebase (large diff) 2. Our deprecation TODOs added to Python GED toolchain: - ir_diff.py: NetworkX GED → ged.h/cc (10-100x faster) - ir_diff_main.py: Python tool → ged_main.cc CLI - ir_patch_gen.py: Python patcher → patch_ir.h/cc - ir2nx.py: NetworkX converter → xls_ir_to_cytoscape.cc - ir2gxl.py: GXL exporter → gxl_parser.h/cc (from upstream)
Replace the SciPy LSAP path with a native C++ LSAP solver. Switch GED cost construction from sparse-style handling to dense matrix paths. Add C++ ECO components for GED, MCS, graph modeling, GXL parsing, and patch generation. Introduce eco-specific Starlark build defs and wire patching plus equivalence validation targets. Standardize main-flag handling and add optional execution statistics reporting. Add ECO regression fixtures and test targets for crc32, riscv, apfloat, fir, histogram, and vector_core flows.
Use path-based equivalence report flag and write reports for mismatch results Replace run_shell wrapper with direct Bazel run Remove short flag aliases from patch_ir_main Normalize ECO build rule args/report handling and clean stale target references
Fixed init parsing bug in IrParser Fixed a critical bug in MakePruneFunction
|
@grebe Sorry for late response I was busy with some academic tasks. I applied most of the recent comments, currently working on moving the P.S. The test |
Relocate the ECO package from xls/eco to xls/contrib/eco and update references. This includes Bazel labels, Starlark loads, C++ include paths, Python imports, runfiles paths, and tooling path filters.
There was a problem hiding this comment.
This should use BCR rather than checking this in
| "Path to the IR file to patch."); // NOLINT | ||
| namespace { | ||
|
|
||
| using namespace xls; |
There was a problem hiding this comment.
Nit: more typically, we'll do
namespace xls {
namespace {
...
}
}
|
@grebe Do you have a preference here:
Happy to go either way, |
This PR adds a C++ implementation of Graph Edit Distance (GED) for the ECO system, replacing the existing Python/NetworkX approach.
What's included
ged_main) and IR patch generationRegarding the new C++ GED
The Python version works but is slow on large graphs. This C++ implementation is faster, integrates directly with XLS IR APIs, and supports advanced features like pinned nodes and MCS optimization.
Migration
Python tools (ir_diff_main.py, ir2gxl.py) are marked for deprecation but still functional. They'll be removed in a future PR once the C++ version is proven stable.
Testing
Builds cleanly and includes tests with real DSLX examples (crc32, apfloat_fmac, riscv_simple, etc).
@grebe Please consider reviewing this PR. I tried to organize it into separate commits for convenience.