05 Jun 19:31

41b096f

v1.0.2 - Observability / Loss Graphs Latest

Latest

Meridian.AI v1.0.2 — Observability / Loss Graphs

Fixes empty/sparse Comet loss graphs. An audit of the Comet project found many runs logged 0 or 1 loss points, so the graphs were effectively blank.

Root cause

Loss was only logged every 5 optimizer steps; runs that died early (OOM/429) or stalled after a few steps produced 0–1 points.
Each hourly run was a separate Comet experiment, so there was no continuous training curve.

Fixes

Log metrics every optimizer step (console printing stays throttled).
Always log an initial datapoint at the first forward pass, so even a killed run contributes a point.
One continuous experiment across runs by default — the trainer resumes a persistent Comet experiment (key in checkpoint/comet_experiment.json), so loss/perplexity form a single continuous curve over global_step. Set COMET_CONTINUOUS=0 for old behaviour.
Added ewc_loss metric; tokens_per_sec logged every step so throughput stalls are visible.
log_steps configurable via LOG_STEPS.

Known issue surfaced

Some runs do only ~5–7 optimizer steps in the 80-min window (vs ~120 healthy) — a throughput collapse, likely dataset-stream stalls. Now visible via the per-step tokens_per_sec metric; next to investigate.

Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@

Assets 2

05 Jun 19:19

MeridianAlgo-Developer

v1.0.1

92266c9

v1.0.1 - HF 429 Hotfix

Meridian.AI v1.0.1 — HuggingFace 429 Hotfix

Fixes the hourly CI run repeatedly dying with HTTP 429 (Too Many Requests) while downloading the Qwen/Qwen2.5-0.5B base model.

Root cause

When the checkpoint pull didnt land a local model.safetensors, train.py re-downloaded the base model from HuggingFace every run. Shared GitHub Actions IPs are aggressively rate-limited, so config.json HEAD requests returned 429, exhausted the built-in retries, and the run crashed with OSError: couldnt connect. Nothing cached the base model between runs.

Fixes

Persistent HF cache — workflow caches HF_HOME via actions/cache. Base model + tokenizer download once and are reused; on 429, transformers falls back to the cached copy.
Resilient loaders — base-model and tokenizer loads pass HF_TOKEN, retry with exponential backoff, and fall back to local_files_only=True.
Tokenizer prefers the checkpoint — loads from ./checkpoint when present (zero Hub calls).
Checkpoint pull retries 3× with backoff; added HF_HUB_DOWNLOAD_TIMEOUT=60.

Full changelog: https://github.com/MeridianAlgo/FinAI/blob/main/docs/CHANGELOG.md
@

Assets 2

05 Jun 18:59

MeridianAlgo-Developer

v1.0.0

bab080f

v1.0.0 - Production

Meridian.AI v1.0.0 — Production

First production release of Meridian.AI — a finance-specialized Qwen2.5-0.5B that continually fine-tunes itself every hour on free GitHub Actions infrastructure using Elastic Weight Consolidation (EWC).

All earlier tagged builds (v1.0.0-smollm2, v2.0.0-qwen, v5.1.0, v5.1.1, v6.0.0) were pre-production test / research iterations and have been retired.

Fixed — hourly CI dying with `exit code 143`

Root cause: the run was SIGTERM-killed (128 + 15) by the runner during the first backward pass — the forward pass succeeded ([CASCADE CHECK] Initial Loss ... printed), then backprop pushed peak RAM past the ~16 GB ceiling. The kill hit the whole step process tree, so the workflow exit 0 safety net never ran.
Why guards missed it: RAM guards only check between micro-steps and cannot intercept a spike inside a single backward(). The v6.0.0 BLOCK_SIZE 256 → 512 jump is what started tripping this.
Fix: BLOCK_SIZE 512 → 384, SOFT_RAM_GB 12.5 → 11.0 (earlier truncation), MAX_RAM_GB 14.5 → 14.0 (more headroom).

Docs

README bumped to 1.0.0 Production with a new Training Status & Observability section and an exit-143 troubleshooting entry.
docs/training_pipeline.md and docs/setup_and_usage.md refreshed to v1.0.0 defaults.
CHANGELOG reframes all prior versions as pre-production test builds.

Model checkpoints: https://huggingface.co/meridianal/FinAI

Full changelog: see CHANGELOG.md
@

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Meridian.AI v1.0.2 — Observability / Loss Graphs

Root cause

Fixes

Known issue surfaced

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Meridian.AI v1.0.1 — HuggingFace 429 Hotfix

Root cause

Fixes

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Meridian.AI v1.0.0 — Production

Fixed — hourly CI dying with `exit code 143`

Docs

Uh oh!

Releases: MeridianAlgo/FinAI

v1.0.2 - Observability / Loss Graphs

Meridian.AI v1.0.2 — Observability / Loss Graphs

Root cause

Fixes

Known issue surfaced

Uh oh!

v1.0.1 - HF 429 Hotfix

Meridian.AI v1.0.1 — HuggingFace 429 Hotfix

Root cause

Fixes

Uh oh!

v1.0.0 - Production

Meridian.AI v1.0.0 — Production

Fixed — hourly CI dying with exit code 143

Docs

Uh oh!

Fixed — hourly CI dying with `exit code 143`