This project processes student documents (PDFs, PowerPoint files) and uses OpenAI models to generate structured "context" reports for each student. The output is saved as JSON with phone numbers as keys.
What it does:
- Takes student files (resumes, SWOT analyses, career plans) as PDFs or PPTX
- Extracts text content using PyMuPDF and python-pptx
- Sends extracted content to OpenAI (GPT-3.5-turbo or GPT-4o-mini)
- Generates a structured profile with skills, experiences, strengths, weaknesses, goals, etc.
- Saves output as JSON with phone number as the key
pip install -r requirements.txtKey dependencies:
openai- OpenAI API clientpymupdf(fitz) - PDF extractionpdfplumber- PDF fallback extractionpython-pptx- PowerPoint extractionsqlalchemy- Database ORMpydantic- Data validationclick- CLI frameworkrich- Pretty terminal outputpython-dotenv- Environment variables
Create a .env file in the project root:
cp .env.example .envEdit .env and add your OpenAI API key:
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional (defaults shown)
OPENAI_MODEL=gpt-3.5-turbo
DATABASE_URL=sqlite:///data/db/guidance.db
LOG_LEVEL=INFO
MAX_FILE_SIZE_MB=100python -m src.cli init-dbTo reset an existing database:
python -m src.cli init-db --resetpython -m src.cli add-student --name "John Doe" --grade "9-12"Options:
--name(required): Student's full name--grade: Grade band (e.g., "9-12", "college")--email: Student's email address
python -m src.cli process-document --student-id 1 --file /path/to/resume.pdf
python -m src.cli process-document --student-id 1 --file /path/to/swot.pdf
python -m src.cli process-document --student-id 1 --file /path/to/career_plan.pptxSupported formats: .pdf, .pptx, .ppt
python -m src.cli generate-context --student-id 1This sends the extracted content to OpenAI and generates a structured profile.
# View as formatted table
python -m src.cli show-context --student-id 1
# View as JSON
python -m src.cli show-context --student-id 1 --format json
# View as YAML
python -m src.cli show-context --student-id 1 --format yamlExport single student:
python -m src.cli export-student --student-id 1 --format json --output student_1.jsonExport all students to Excel:
python -m src.cli export-contexts --output all_contexts.xlsxExport all students to individual files:
python -m src.cli export-all-students --output-dir exports/The main output format uses phone numbers as keys. To generate this:
python create_phone_json.pyThis script:
- Reads student contexts from
all_students.json - Maps email prefixes to phone numbers from
emails_names_phones.txt - Outputs
all_students_ph.jsonwith phone numbers as keys
After generating contexts for all students:
# First export all to JSON
python -m src.cli export-all-students --output-dir exports/ --format json
# Then run the phone mapping script
python create_phone_json.pyThe final JSON output (all_students_ph.json) has this structure:
{
"9876543210": {
"student_id": 1,
"profile": {
"name": "John Doe",
"skills": ["Python", "Data Analysis", "Communication"],
"experiences": ["Internship at XYZ Corp", "Research Project"],
"education": ["B.Sc Computer Science"],
"strengths": ["Problem-solving", "Team collaboration"],
"weaknesses": ["Time management"],
"opportunities": ["Industry certifications", "Graduate studies"],
"threats": ["Competitive job market"],
"career_goals": ["Software Engineer at tech company"],
"interests": ["AI/ML", "Web Development"],
"achievements": ["Dean's List", "Hackathon Winner"]
},
"artifacts_summary": {
"total_artifacts": 3,
"artifact_types": ["resume", "swot_analysis", "career_plan"],
"files": ["resume.pdf", "swot.pdf", "career_plan.pptx"]
},
"generated_at": "2025-01-07T12:00:00.000000",
"model_used": "gpt-4o-mini"
}
}For processing many students at once, use:
python process_all_students.pyThis script:
- Reads student info from a source file
- Creates student records in the database
- Processes all associated documents
- Generates contexts for each student
- Exports to JSON
| Command | Description |
|---|---|
init-db |
Initialize or reset the database |
add-student |
Register a new student |
list-students |
View all registered students |
process-document |
Extract text from PDF/PPTX |
show-artifacts |
List processed documents for a student |
show-content |
View extracted text elements |
generate-context |
Generate AI profile from documents |
show-context |
Display generated context |
export-contexts |
Export all contexts to Excel |
export-student |
Export single student context |
export-all-students |
Batch export individual files |
Add --help to any command for detailed options:
python -m src.cli generate-context --helpguidance-agent/
├── src/
│ ├── cli.py # Main CLI interface
│ ├── context_generator.py # OpenAI integration
│ ├── query_handler.py # Query processing
│ ├── database/
│ │ ├── connection.py # DB connection
│ │ └── models.py # Database models
│ └── extractors/
│ ├── pdf_extractor.py # PDF text extraction
│ └── pptx_extractor.py # PowerPoint extraction
├── data/
│ ├── db/ # SQLite database
│ ├── contexts/ # Generated YAML contexts
│ ├── uploads/ # Original documents
│ └── artifacts/ # Processed copies
├── exports/ # Exported files
├── .env # Environment variables (create this)
├── .env.example # Environment template
└── requirements.txt # Python dependencies
Create a .env file with your API key (see Prerequisites).
Run process-document before generate-context.
The system uses PyMuPDF first, then falls back to pdfplumber. For scanned PDFs (images), OCR is not currently supported.
Close any other processes using the database, or use --reset with init-db.
# 1. Setup
pip install -r requirements.txt
cp .env.example .env
# Edit .env and add OPENAI_API_KEY
# 2. Initialize
python -m src.cli init-db
# 3. Add student
python -m src.cli add-student --name "Jane Smith" --email "jane@example.com"
# 4. Process documents
python -m src.cli process-document --student-id 1 --file docs/jane_resume.pdf
python -m src.cli process-document --student-id 1 --file docs/jane_swot.pdf
# 5. Generate context
python -m src.cli generate-context --student-id 1
# 6. View result
python -m src.cli show-context --student-id 1 --format json
# 7. Export
python -m src.cli export-student --student-id 1 --format json --output jane_context.json