An open-source FP8 tensor accelerator — SystemVerilog RTL → synthesis → place-and-route → 7nm GDSII, on a 100% open-source toolchain. Named after Ptah, the Egyptian creator god and patron of craftsmen & architects.
🟢 First GDS is out — a full multi-tile FP8 matmul runs end-to-end through real SystemVerilog (chip_top), bit-exact against a golden numpy model, and the mac_cell tile has a complete, signoff-clean 7nm GDSII on ASAP7: 250 MHz with +1928 ps setup / +13 ps hold slack (zero violations, zero masked), DRC clean, 750 µm². Phase 6 of 10 done. Watch this repo: the fight with real silicon physics lands here as it happens.

The routed mac_cell macro — RTL→GDSII on a laptop, 100% open tools (full story in docs/HARDENING.md)
| 📐 PLAN.md | Roadmap contract — architecture decisions, 10 phases, risks |
| ✅ STEPS.md | Live execution checklist, ticked per PR |
| 🏛️ docs/ARCHITECTURE.md | Block diagram, memory spaces, engines, module map |
| 📜 docs/ISA.md | The six instructions, barriers, canonical kernels |
| 🛠️ docs/DEVELOPMENT.md | Read before contributing — workflow & numeric contracts |
| 📊 docs/ENGINEERING.md | Honest status + differentiators vs prior art |
config.py— single source of truth for every design parametergolden/— bit-exact fp8 (e4m3 and e5m2) encode/decode + matmul referencepymodel/— full cycle-level machine; e2e matmuls bit-exact, REPEAT K-loops, async-STORE overlap provenrtl/— 13 SystemVerilog modules, from the fp8/fp32 arithmetic leaves up tochip_top— every one verified bit-exact under Verilator + cocotbchip_top.sv— the whole accelerator: push an instruction stream, it runs a multi-tile matmul and writes results to DRAM, bit-exact vs golden
# Python side (no HW tools needed)
pip install numpy pytest && pytest # 37 tests, < 1 s
# RTL side — 13 units incl. the full chip
sudo apt-get install verilator && pip install 'cocotb<2'
cd rtl/tb && make all_leaves # 52 RTL tests89 tests green (52 RTL + 37 Python).
Made with ❤️ by Lord1Egypt