Skip to content

Releases: MeridianAlgo/FinAI

v1.0.2 - Observability / Loss Graphs

05 Jun 19:31

Choose a tag to compare

@

Meridian.AI v1.0.2 — Observability / Loss Graphs

Fixes empty/sparse Comet loss graphs. An audit of the Comet project found many runs logged 0 or 1 loss points, so the graphs were effectively blank.

Root cause

  • Loss was only logged every 5 optimizer steps; runs that died early (OOM/429) or stalled after a few steps produced 0–1 points.
  • Each hourly run was a separate Comet experiment, so there was no continuous training curve.

Fixes

  • Log metrics every optimizer step (console printing stays throttled).
  • Always log an initial datapoint at the first forward pass, so even a killed run contributes a point.
  • One continuous experiment across runs by default — the trainer resumes a persistent Comet experiment (key in checkpoint/comet_experiment.json), so loss/perplexity form a single continuous curve over global_step. Set COMET_CONTINUOUS=0 for old behaviour.
  • Added ewc_loss metric; tokens_per_sec logged every step so throughput stalls are visible.
  • log_steps configurable via LOG_STEPS.

Known issue surfaced

Some runs do only ~5–7 optimizer steps in the 80-min window (vs ~120 healthy) — a throughput collapse, likely dataset-stream stalls. Now visible via the per-step tokens_per_sec metric; next to investigate.

Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@

v1.0.1 - HF 429 Hotfix

05 Jun 19:19

Choose a tag to compare

@

Meridian.AI v1.0.1 — HuggingFace 429 Hotfix

Fixes the hourly CI run repeatedly dying with HTTP 429 (Too Many Requests) while downloading the Qwen/Qwen2.5-0.5B base model.

Root cause

When the checkpoint pull didnt land a local model.safetensors, train.py re-downloaded the base model from HuggingFace every run. Shared GitHub Actions IPs are aggressively rate-limited, so config.json HEAD requests returned 429, exhausted the built-in retries, and the run crashed with OSError: couldnt connect. Nothing cached the base model between runs.

Fixes

  • Persistent HF cache — workflow caches HF_HOME via actions/cache. Base model + tokenizer download once and are reused; on 429, transformers falls back to the cached copy.
  • Resilient loaders — base-model and tokenizer loads pass HF_TOKEN, retry with exponential backoff, and fall back to local_files_only=True.
  • Tokenizer prefers the checkpoint — loads from ./checkpoint when present (zero Hub calls).
  • Checkpoint pull retries 3× with backoff; added HF_HUB_DOWNLOAD_TIMEOUT=60.

Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@

v1.0.0 - Production

05 Jun 18:59

Choose a tag to compare

@

Meridian.AI v1.0.0 — Production

First production release of Meridian.AI — a finance-specialized Qwen2.5-0.5B that continually fine-tunes itself every hour on free GitHub Actions infrastructure using Elastic Weight Consolidation (EWC).

All earlier tagged builds (v1.0.0-smollm2, v2.0.0-qwen, v5.1.0, v5.1.1, v6.0.0) were pre-production test / research iterations and have been retired.

Fixed — hourly CI dying with exit code 143

  • Root cause: the run was SIGTERM-killed (128 + 15) by the runner during the first backward pass — the forward pass succeeded ([CASCADE CHECK] Initial Loss ... printed), then backprop pushed peak RAM past the ~16 GB ceiling. The kill hit the whole step process tree, so the workflow exit 0 safety net never ran.
  • Why guards missed it: RAM guards only check between micro-steps and cannot intercept a spike inside a single backward(). The v6.0.0 BLOCK_SIZE 256 → 512 jump is what started tripping this.
  • Fix: BLOCK_SIZE 512 → 384, SOFT_RAM_GB 12.5 → 11.0 (earlier truncation), MAX_RAM_GB 14.5 → 14.0 (more headroom).

Docs

  • README bumped to 1.0.0 Production with a new Training Status & Observability section and an exit-143 troubleshooting entry.
  • docs/training_pipeline.md and docs/setup_and_usage.md refreshed to v1.0.0 defaults.
  • CHANGELOG reframes all prior versions as pre-production test builds.

Model checkpoints: https://huggingface.co/meridianal/FinAI

Full changelog: see CHANGELOG.md
@