llm-dataset

Here are 8 public repositories matching this topic...

Nooxus-AI / NOO-Verified-Global-Entities

The Anti-Hallucination data layer for B2B Sourcing. Deep-verified global supply chain entities designed for RAG and LLM instruction tuning.

supply-chain data-engineering knowledge-graph b2b ai-agents jsonl rag chatml agentic-ai llm-dataset anti-hallucination sft-dataset supply-chain-llm

Updated Mar 19, 2026

jimstratus / forum-extractor

Star

A comprehensive Python tool for extracting, processing, and analyzing RPG scenarios from the Era of the Imperial Republic (EOTIR) forums. Features automated web scraping, NLP-powered content analysis, character extraction, timeline generation, and LLM dataset preparation with an interactive HTML dashboard.

nlp rpg web-scraping data-extraction markdown-converter content-analysis invision-community scenario-analysis forum-scraper llm-dataset eotir

Updated Jan 20, 2026
Python

zachurban / HousingMind

Star

AI-powered Q&A system for U.S. affordable housing policy using RAG over 2,500+ HUD documents and 24 CFR

housing semantic-search govtech housing-affordability public-housing housing-data housing-policy document-ai affordable-housing llm-dataset hud-compliance rad-conversion voucher-programs ai-in-government

Updated Jan 27, 2026
HTML

sumomomomomo / uma-voice-dataset-creator

Star

python dataset dataset-generation umamusume tts-dataset llm-dataset

Updated Feb 19, 2026
Python

sandy-sp / gittxt

Star

Gittxt is an AI-focused CLI and plugin tool for extracting, filtering, and packaging text from GitHub repos. Build LLM-compatible datasets, prep code for prompt engineering, and power AI workflows with structured .txt, .json, .md, or .zip outputs.

text-extraction developer-tools structured-data data-pipeline cli-tool ai-ready open-source-llm llm-dataset ai-preprocessing repo-analysis github-scanner prompt-data code-to-llm

Updated Aug 25, 2025
Python

gafnts / kleister-nda-preparation

Star

Prepare the Kleister NDA dataset for LLM inference. Validates labels against a Pydantic schema and delivers partitioned Parquet with co-located PDFs

kie pydantic key-information-extraction document-ai llm-dataset

Updated Apr 1, 2026
Python

ikbal-nayem / bd-law-dataset

Star

This repository aims to provide a structured and easily accessible dataset of laws in Bangladesh. The data is primarily sourced from the Bangladesh Law (BDLAW) website.

law dataset laws llm-dataset

Updated Jul 12, 2025

stagproject / mirelia-patent-market

Star

Autonomous MCP server for M2M patent intelligence. Delivers structured JSON datasets (CPC A-H) enriched with biz_value_prop, tech stacks, and importance scoring. Supports instant autonomous data purchasing via ROSE cryptocurrency.

Updated Apr 3, 2026
Python

Improve this page

Add a description, image, and links to the llm-dataset topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-dataset topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-dataset

Here are 8 public repositories matching this topic...

Nooxus-AI / NOO-Verified-Global-Entities

jimstratus / forum-extractor

zachurban / HousingMind

sumomomomomo / uma-voice-dataset-creator

sandy-sp / gittxt

gafnts / kleister-nda-preparation

ikbal-nayem / bd-law-dataset

stagproject / mirelia-patent-market

Improve this page

Add this topic to your repo