Releases: MeridianAlgo/FinAI
v1.0.2 - Observability / Loss Graphs
@
Meridian.AI v1.0.2 — Observability / Loss Graphs
Fixes empty/sparse Comet loss graphs. An audit of the Comet project found many runs logged 0 or 1 loss points, so the graphs were effectively blank.
Root cause
- Loss was only logged every 5 optimizer steps; runs that died early (OOM/429) or stalled after a few steps produced 0–1 points.
- Each hourly run was a separate Comet experiment, so there was no continuous training curve.
Fixes
- Log metrics every optimizer step (console printing stays throttled).
- Always log an initial datapoint at the first forward pass, so even a killed run contributes a point.
- One continuous experiment across runs by default — the trainer resumes a persistent Comet experiment (key in
checkpoint/comet_experiment.json), so loss/perplexity form a single continuous curve overglobal_step. SetCOMET_CONTINUOUS=0for old behaviour. - Added
ewc_lossmetric;tokens_per_seclogged every step so throughput stalls are visible. log_stepsconfigurable viaLOG_STEPS.
Known issue surfaced
Some runs do only ~5–7 optimizer steps in the 80-min window (vs ~120 healthy) — a throughput collapse, likely dataset-stream stalls. Now visible via the per-step tokens_per_sec metric; next to investigate.
Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@
v1.0.1 - HF 429 Hotfix
@
Meridian.AI v1.0.1 — HuggingFace 429 Hotfix
Fixes the hourly CI run repeatedly dying with HTTP 429 (Too Many Requests) while downloading the Qwen/Qwen2.5-0.5B base model.
Root cause
When the checkpoint pull didnt land a local model.safetensors, train.py re-downloaded the base model from HuggingFace every run. Shared GitHub Actions IPs are aggressively rate-limited, so config.json HEAD requests returned 429, exhausted the built-in retries, and the run crashed with OSError: couldnt connect. Nothing cached the base model between runs.
Fixes
- Persistent HF cache — workflow caches
HF_HOMEviaactions/cache. Base model + tokenizer download once and are reused; on 429, transformers falls back to the cached copy. - Resilient loaders — base-model and tokenizer loads pass
HF_TOKEN, retry with exponential backoff, and fall back tolocal_files_only=True. - Tokenizer prefers the checkpoint — loads from
./checkpointwhen present (zero Hub calls). - Checkpoint pull retries 3× with backoff; added
HF_HUB_DOWNLOAD_TIMEOUT=60.
Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@
v1.0.0 - Production
@
Meridian.AI v1.0.0 — Production
First production release of Meridian.AI — a finance-specialized Qwen2.5-0.5B that continually fine-tunes itself every hour on free GitHub Actions infrastructure using Elastic Weight Consolidation (EWC).
All earlier tagged builds (v1.0.0-smollm2, v2.0.0-qwen, v5.1.0, v5.1.1, v6.0.0) were pre-production test / research iterations and have been retired.
Fixed — hourly CI dying with exit code 143
- Root cause: the run was SIGTERM-killed (
128 + 15) by the runner during the first backward pass — the forward pass succeeded ([CASCADE CHECK] Initial Loss ...printed), then backprop pushed peak RAM past the ~16 GB ceiling. The kill hit the whole step process tree, so the workflowexit 0safety net never ran. - Why guards missed it: RAM guards only check between micro-steps and cannot intercept a spike inside a single
backward(). The v6.0.0BLOCK_SIZE 256 → 512jump is what started tripping this. - Fix:
BLOCK_SIZE 512 → 384,SOFT_RAM_GB 12.5 → 11.0(earlier truncation),MAX_RAM_GB 14.5 → 14.0(more headroom).
Docs
- README bumped to
1.0.0 Productionwith a new Training Status & Observability section and an exit-143 troubleshooting entry. docs/training_pipeline.mdanddocs/setup_and_usage.mdrefreshed to v1.0.0 defaults.- CHANGELOG reframes all prior versions as pre-production test builds.
Model checkpoints: https://huggingface.co/meridianal/FinAI
Full changelog: see CHANGELOG.md
@