A flagship orchestration of Llama-2-13B, fine-tuned on the FineWeb-Edu corpus with integrated Hybrid RAG.
InfoSage AI is a professional grade AI ecosystem designed to bridge the gap between enterprise-scale LLM training and privacy conscious local inference. By leveraging Meta's Llama-2-13B architecture and a 1-million sample fine-tune of the FineWeb-Edu dataset, InfoSage provides a high-fidelity educational assistant that runs entirely on your hardware.
Training Guide β’ Asset Migration β’ Local Setup β’ Technical Specs
The current LLM landscape is bifurcated between high latency API models and underpowered local alternatives. InfoSage AI introduces a third path: Cloud-to-Local Hybridization.
We utilize the raw power of NVIDIA H100 clusters in the cloud to perform precision fine-tuning through QLoRA, then deploy the resulting intelligence to local RTX hardware using sophisticated 4-bit quantization. This allows a 13-billion parameter model to fit comfortably within an 8GB VRAM envelope while maintaining enterprise-grade reasoning.
| Feature | Technical Implementation | Impact |
|---|---|---|
| Hybrid RAG | Multi-layered FAISS + Live HF Streaming | Grounded answers with 0% "memory drift". |
| QLoRA Mastery | 4-bit NF4 + Double Quantization | 13B model runs comfortably on 8GB VRAM. |
| Precision Post-Processing | Automated Word Segmentation | Eliminates fine-tuning sub-word artifacts. |
| Electric Azure UI | Flask + Liquid Glass Design System | Premium, high-fidelity dashboard experience. |
| Hardware Telemetry | Real-time VRAM & Inference Monitoring | Full transparency into system resource usage. |
| Cloud-to-Local Link | Google Drive Automated Checkpointing | Seamless migration of multi-GB model weights. |
InfoSage is orchestrated through a distributed pipeline that separates heavy-lifting compute from daily interactive inference.
graph TD
subgraph "Phase 1: Cloud Intelligence (H100 80GB)"
A[FineWeb-Edu 1M Stream] --> B{Streaming Tokenizer}
B --> C[QLoRA H100 Trainer]
C --> D[Rank-32 LoRA Adapters]
end
subgraph "Phase 2: Data Engineering"
E[100K Educational Passages] --> F[Sentence-Transformers]
F --> G[(Local FAISS Vector store)]
end
subgraph "Phase 3: Inference Engine (RTX Series)"
G <--> H[Orchestration Layer]
D --> H
I[Live HuggingFace Search] <--> H
H --> J[Electric Azure UX]
end
To achieve the reasoning depth of a 13 billion parameter model, InfoSage leverages enterprise hardware. Follow these steps exactly to replicate the training results.
- Open Google Colab.
- Upload the
train.ipynbfile from this repository. - Critical Hardware Selection: Go to
Runtime>Change runtime typeand select H100 GPU. This model is specifically tuned for H100 VRAM and Tensor Core architectures.
The pipeline utilizes the HuggingFaceFW/fineweb-edu dataset.
- Streaming Mode: The script is configured to stream 1,000,000 samples. This bypasses Colab's limited disk space by processing data directly from HuggingFace servers.
- Specialized Subset: We focus on the high-quality educational score subset to ensure the model's "student-first" reasoning.
- Mount Google Drive: The first cell will prompt for Drive access. This is essential for saving the model.
- Hardware Diagnostics: The internal
hw_diagsection will verify that thermal/memory limits are optimal for H100. - QLoRA Execution: Run all cells. The training takes approximately 45-60 minutes for 5,000 steps.
- Automatic Persistence: Upon completion, the script will create a directory in your Google Drive:
My Drive/fineweb_edu_llama2_13b/
Once the training in Colab finishes, you must migrate the "brain" and "memory" of the system to your local machine.
Navigate to your Google Drive and find the following files in /fineweb_edu_llama2_13b/:
From final_model/:
adapter_model.safetensors(The actual intelligence weights)adapter_config.json(The quantization mapping)tokenizer.json/tokenizer_config.json(Required for text processing)
From rag_index/:
faiss_index.bin(The vector database)passages.npy(The raw text corpus for RAG)
You must create the following folder structure in the root of your cloned repository:
fineweb-edu-llm-training/
βββ out/
β βββ final_model/ <-- Place safetensors & config here
β βββ rag_index/ <-- Place faiss_index.bin & passages.npy here
βββ gui/ <-- (Included in repo)
βββ chat_llm.py <-- (Included in repo)
βββ train.ipynb <-- (Used in Colab)
The local runtime is optimized for Windows/NVIDIA environments.
# Install the high-performance local stack
pip install torch transformers datasets faiss-cpu sentence-transformers peft bitsandbytes accelerate wordsegment flask tqdmInfoSage offers two modes of interaction:
Option A: The Dashboard (Flagship Experience) Provides the full Azure glassmorphism experience with hardware telemetry.
python gui/app.py
# Open: http://localhost:5000 in your browserOption B: Terminal Mode Low-latency console interaction for developers and automated testing.
python chat_llm.pyInfoSage doesn't just "remember"βit researches. The Layered Retrieval system operates as follows:
- Tier 1 (Instant): Queries the local FAISS vector store. If a similar passage is found with >0.5 confidence, it is injected as context.
- Tier 2 (Live Search): If local results are weak, the engine performs a live, keyword-filtered search across the full FineWeb-Edu dataset on HuggingFace.
- Tier 3 (Synthesis): The Llama-2-13B model synthesizes a grounded answer using the retrieved "Educational Context" blocks.
Fine-tuned models often suffer from spacing artifacts (joinedwords). InfoSage integrates the wordsegment library into the post-generation pipeline. Every response is dynamically parsed and reconstructed to ensure human-grade readability at 15-20 tokens/sec.
| Parameter | Configuration Detail |
|---|---|
| Foundation Model | Llama-2-13B (NousResearch community mirror) |
| Quantization | 4-bit NormalFloat (NF4) with Double Quantization |
| LoRA Config | Rank 32 / Alpha 64 / Dropout 0.05 |
| VRAM Buffer | Specifically capped at 7,500 MiB for 8GB cards |
| Attention Policy | Flash Attention 2 (H100) / SDPA (Local Inference) |
| Training Duration | 5,000 steps / 1,000,000 samples subset |
| Optimizer | AdamW 8-bit (Paged) |
gui/: Full-stack dashboard assets (Flask backend, Azure frontend).chat_llm.py: The orchestration kernel. Manages quantization, RAG routing, and post-processing.train.ipynb: H100-Only training pipeline for Google Colab.build_rag_index.py: Data engineering tool for FAISS index creation.
InfoSage AI is an open source research project licensed under the MIT framework.
Developed with β€οΈ for privacy-conscious intelligence using HuggingFace Transformers, PEFT, and a lot of GPU hours.