Skip to content

VigyanShaala-Tech/assignment-context-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to Run the Student Context Generator

This project processes student documents (PDFs, PowerPoint files) and uses OpenAI models to generate structured "context" reports for each student. The output is saved as JSON with phone numbers as keys.

Overview

What it does:

  1. Takes student files (resumes, SWOT analyses, career plans) as PDFs or PPTX
  2. Extracts text content using PyMuPDF and python-pptx
  3. Sends extracted content to OpenAI (GPT-3.5-turbo or GPT-4o-mini)
  4. Generates a structured profile with skills, experiences, strengths, weaknesses, goals, etc.
  5. Saves output as JSON with phone number as the key

Prerequisites

1. Install Dependencies

pip install -r requirements.txt

Key dependencies:

  • openai - OpenAI API client
  • pymupdf (fitz) - PDF extraction
  • pdfplumber - PDF fallback extraction
  • python-pptx - PowerPoint extraction
  • sqlalchemy - Database ORM
  • pydantic - Data validation
  • click - CLI framework
  • rich - Pretty terminal output
  • python-dotenv - Environment variables

2. Set Up Environment Variables

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your OpenAI API key:

# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional (defaults shown)
OPENAI_MODEL=gpt-3.5-turbo
DATABASE_URL=sqlite:///data/db/guidance.db
LOG_LEVEL=INFO
MAX_FILE_SIZE_MB=100

Quick Start

Step 1: Initialize the Database

python -m src.cli init-db

To reset an existing database:

python -m src.cli init-db --reset

Step 2: Add a Student

python -m src.cli add-student --name "John Doe" --grade "9-12"

Options:

  • --name (required): Student's full name
  • --grade: Grade band (e.g., "9-12", "college")
  • --email: Student's email address

Step 3: Process Student Documents

python -m src.cli process-document --student-id 1 --file /path/to/resume.pdf
python -m src.cli process-document --student-id 1 --file /path/to/swot.pdf
python -m src.cli process-document --student-id 1 --file /path/to/career_plan.pptx

Supported formats: .pdf, .pptx, .ppt

Step 4: Generate Context (AI Analysis)

python -m src.cli generate-context --student-id 1

This sends the extracted content to OpenAI and generates a structured profile.

Step 5: View the Generated Context

# View as formatted table
python -m src.cli show-context --student-id 1

# View as JSON
python -m src.cli show-context --student-id 1 --format json

# View as YAML
python -m src.cli show-context --student-id 1 --format yaml

Step 6: Export Contexts

Export single student:

python -m src.cli export-student --student-id 1 --format json --output student_1.json

Export all students to Excel:

python -m src.cli export-contexts --output all_contexts.xlsx

Export all students to individual files:

python -m src.cli export-all-students --output-dir exports/

Generating Phone-Keyed JSON

The main output format uses phone numbers as keys. To generate this:

Method 1: Using the Script

python create_phone_json.py

This script:

  1. Reads student contexts from all_students.json
  2. Maps email prefixes to phone numbers from emails_names_phones.txt
  3. Outputs all_students_ph.json with phone numbers as keys

Method 2: Manual Export

After generating contexts for all students:

# First export all to JSON
python -m src.cli export-all-students --output-dir exports/ --format json

# Then run the phone mapping script
python create_phone_json.py

Output Format

The final JSON output (all_students_ph.json) has this structure:

{
  "9876543210": {
    "student_id": 1,
    "profile": {
      "name": "John Doe",
      "skills": ["Python", "Data Analysis", "Communication"],
      "experiences": ["Internship at XYZ Corp", "Research Project"],
      "education": ["B.Sc Computer Science"],
      "strengths": ["Problem-solving", "Team collaboration"],
      "weaknesses": ["Time management"],
      "opportunities": ["Industry certifications", "Graduate studies"],
      "threats": ["Competitive job market"],
      "career_goals": ["Software Engineer at tech company"],
      "interests": ["AI/ML", "Web Development"],
      "achievements": ["Dean's List", "Hackathon Winner"]
    },
    "artifacts_summary": {
      "total_artifacts": 3,
      "artifact_types": ["resume", "swot_analysis", "career_plan"],
      "files": ["resume.pdf", "swot.pdf", "career_plan.pptx"]
    },
    "generated_at": "2025-01-07T12:00:00.000000",
    "model_used": "gpt-4o-mini"
  }
}

Batch Processing

For processing many students at once, use:

python process_all_students.py

This script:

  1. Reads student info from a source file
  2. Creates student records in the database
  3. Processes all associated documents
  4. Generates contexts for each student
  5. Exports to JSON

CLI Command Reference

Command Description
init-db Initialize or reset the database
add-student Register a new student
list-students View all registered students
process-document Extract text from PDF/PPTX
show-artifacts List processed documents for a student
show-content View extracted text elements
generate-context Generate AI profile from documents
show-context Display generated context
export-contexts Export all contexts to Excel
export-student Export single student context
export-all-students Batch export individual files

Add --help to any command for detailed options:

python -m src.cli generate-context --help

Directory Structure

guidance-agent/
├── src/
│   ├── cli.py                 # Main CLI interface
│   ├── context_generator.py   # OpenAI integration
│   ├── query_handler.py       # Query processing
│   ├── database/
│   │   ├── connection.py      # DB connection
│   │   └── models.py          # Database models
│   └── extractors/
│       ├── pdf_extractor.py   # PDF text extraction
│       └── pptx_extractor.py  # PowerPoint extraction
├── data/
│   ├── db/                    # SQLite database
│   ├── contexts/              # Generated YAML contexts
│   ├── uploads/               # Original documents
│   └── artifacts/             # Processed copies
├── exports/                   # Exported files
├── .env                       # Environment variables (create this)
├── .env.example               # Environment template
└── requirements.txt           # Python dependencies

Troubleshooting

"OPENAI_API_KEY not set"

Create a .env file with your API key (see Prerequisites).

"No artifacts found for student"

Run process-document before generate-context.

PDF extraction issues

The system uses PyMuPDF first, then falls back to pdfplumber. For scanned PDFs (images), OCR is not currently supported.

Database locked errors

Close any other processes using the database, or use --reset with init-db.


Example Workflow

# 1. Setup
pip install -r requirements.txt
cp .env.example .env
# Edit .env and add OPENAI_API_KEY

# 2. Initialize
python -m src.cli init-db

# 3. Add student
python -m src.cli add-student --name "Jane Smith" --email "jane@example.com"

# 4. Process documents
python -m src.cli process-document --student-id 1 --file docs/jane_resume.pdf
python -m src.cli process-document --student-id 1 --file docs/jane_swot.pdf

# 5. Generate context
python -m src.cli generate-context --student-id 1

# 6. View result
python -m src.cli show-context --student-id 1 --format json

# 7. Export
python -m src.cli export-student --student-id 1 --format json --output jane_context.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors