Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 64 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

CiteMe is a modern, full-stack application designed to help researchers and academics manage their citations and references efficiently. The system provides intelligent citation suggestions, reference management, and seamless integration with academic databases.

🌐 **Live Demo**: [CiteMe Editor](https://cite-me-wpre.vercel.app/editor)

## 🚀 Features

- **Smart Citation Suggestions**: AI-powered citation recommendations based on your research context
Expand All @@ -11,6 +13,23 @@ CiteMe is a modern, full-stack application designed to help researchers and acad
- **Modern UI**: Responsive and intuitive user interface
- **API Integration**: Seamless integration with academic databases and search engines

## 📁 Project Structure

```
CiteMe/
├── frontend/ # Vue.js 3 frontend application
│ ├── src/ # Source code
│ ├── public/ # Static assets
│ ├── e2e/ # End-to-end tests
│ └── dist/ # Production build
├── backend/
│ ├── mainService/ # Core citation service
│ └── metricsService/ # Analytics and metrics service
├── .github/ # GitHub workflows and templates
├── docker-compose.yml # Docker services configuration
└── README.md # Project documentation
```

## 🏗️ Architecture

The application is built using a microservices architecture with three main components:
Expand Down Expand Up @@ -46,29 +65,60 @@ The application is built using a microservices architecture with three main comp
- Node.js 20+ (for local frontend development)
- Python 3.11+ (for local backend development)

### Running with Docker
### Running with Docker Compose (Recommended for Local Development)

1. Clone the repository:
```bash
git clone https://github.com/yourusername/citeme.git
cd citeme
```

2. Create a `.env` file in the root directory with necessary environment variables:
```env
# Add your environment variables here
```
2. Create `.env` files in both service directories:
- `backend/mainService/.env`
- `backend/metricsService/.env`

3. Build and run the backend services using Docker Compose:
3. Build and run the services using Docker Compose:
```bash
docker-compose up --build
```

The backend services will be available at:
- Main Service: http://localhost:8000
- Metrics Service: http://localhost:8001
The services will be available at:
- Main Service: http://localhost:9020
- Metrics Service: http://localhost:9050

### Running Services Individually

If you need to run services separately:

1. Create the Docker network:
```bash
docker network create cite_me
```

2. Run the Metrics Service:
```bash
cd backend/metricsService
docker build -t metrics_service .
docker run -p 9050:8000 \
--name ms \
--network cite_me \
--env-file .env \
metrics_service
```

3. Run the Main Service:
```bash
cd backend/mainService
docker build -t main_service .
docker run -p 9020:8000 \
--name mbs \
--network cite_me \
--env-file .env \
-e CREDIBILITY_API_URL=http://ms:8000/api/v1/credibility/batch \
main_service
```

### Local Development
### Local Development Without Docker

#### Frontend
```bash
Expand All @@ -83,7 +133,7 @@ cd backend/mainService
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --reload
uvicorn app:app --reload --port 9020
```

#### Metrics Service
Expand All @@ -92,14 +142,14 @@ cd backend/metricsService
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --reload
uvicorn src.main:app --reload --port 9050
```

## 📚 API Documentation

Once the services are running, you can access the API documentation at:
- Main Service: http://localhost:8000/docs
- Metrics Service: http://localhost:8001/docs
- Main Service: http://localhost:9020/docs
- Metrics Service: http://localhost:9050/docs

## 🧪 Testing

Expand Down
44 changes: 43 additions & 1 deletion backend/mainService/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,46 @@ GROQ_API_KEY= # your groq api key
GOOGLE_API_KEY= # your gemini google api key
MIXBREAD_API_KEY= # your mixbread api key
PINECONE_API_KEY= # your pinecone api key
AZURE_MODELS_ENDPOINT = # your azure model endpoint for citation generation
AZURE_MODELS_ENDPOINT = # your azure model endpoint for citation generation
CREDIBILITY_API_URL = # your credibility api url


#NOTE:
# CREDIBILITY_API_URL is the url of the credibility api that is used to get the credibility metrics for the sources
# CREDIBILITY_API_URL is optional and is only used if the CREDIBILITY_API_URL environment variable is set
# If the CREDIBILITY_API_URL environment variable is not set, the credibility metrics will not be fetched


#AZURE_MODELS_ENDPOINT is the endpoint of the azure model that is used for citation generation
#AZURE_MODELS_ENDPOINT is required and is used to generate the citations for the sources


#MIXBREAD_API_KEY is the api key of the mixbread api that is used to rerank the sources
#MIXBREAD_API_KEY is required and is used to rerank the sources


#PINECONE_API_KEY is the api key of the pinecone api that is used to store the embeddings of the sources
#PINECONE_API_KEY is required and is used to store the embeddings of the sources


#GPSE_API_KEY is the api key of the google programmable search engine api that is used to search the web
#GPSE_API_KEY is required and is used to search the web


#GOOGLE_API_KEY is the api key for gemini google api
#GOOGLE_API_KEY it is required and is used to merge the chunk of cited citations returned by the azure model

#CX is the custom search engine id for google programmable search engine

#All the above can be replaced by writing your own functions for the respective services
#for instance, one could decide to use gemini to generate the intext citation and references rather than using an azure
#hosted model. Hence all you need to do is write your own cite function/module and replace the azure cite function in the citation service file









21 changes: 18 additions & 3 deletions backend/mainService/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,36 @@ FROM python:3.11-slim
WORKDIR /app

# Install system dependencies
# Installs essential tools for compiling software from source, often needed for Python package dependencies.(build-essential)
# Removes the package lists downloaded during the update to reduce the image size.
RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Set the PATH environment variable to include /app
ENV PATH="/app:${PATH}"

# Copy requirements first to leverage Docker cache
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY . .
# Copy the source code
COPY ./scripts/ /app/scripts/
COPY ./src/ /app/src/
COPY ./app.py /app/app.py
COPY ./__init__.py /app/__init__.py

# Create a directory for runtime configuration
RUN mkdir -p /app/config

# Install playwright
RUN playwright install && playwright install-deps

# Expose the port the app runs on
EXPOSE 8000


# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
4 changes: 4 additions & 0 deletions backend/mainService/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
from src.scraper.async_content_scraper import AsyncContentScraper
from fastapi.middleware.cors import CORSMiddleware
import nltk
from src.utils.concurrent_resources import cleanup_resources



origins = [
"http://localhost:5173", # Frontend running on localhost (React, Vue, etc.)
Expand All @@ -37,6 +40,7 @@ async def startup_event(app: FastAPI):
await app.state.playwright_driver.quit()
await app.state.pc.cleanup()
await AsyncHTTPClient.close_session()
cleanup_resources() # Clean up thread pool and other concurrent resources


app = FastAPI(lifespan=startup_event)
Expand Down
1 change: 1 addition & 0 deletions backend/mainService/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,5 @@ urllib3==2.3.0
lxml==5.3.0
google-genai
redis>=4.2.0
uvicorn

15 changes: 7 additions & 8 deletions backend/mainService/src/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,18 +98,17 @@ class LlmConfig:
# Concurrency and Performance
@dataclass
class ConcurrencyConfig:
"""Configuration class for concurrency settings.
"""Configuration class for concurrency settings."""

Contains settings that control parallel processing and thread management."""
"""Configuration for concurrency and performance settings."""

"""
This is the number of concurrent workers that will be used for parallel and concurrent operations.
"""
# General concurrency settings
DEFAULT_CONCURRENT_WORKERS: int = (os.cpu_count() // 2) + 1

HANDLE_INDEX_DELETE_WORKERS: int = 2

# Credibility service specific settings
CREDIBILITY_MAX_THREADS: int = 4 # Maximum threads for credibility calculations
CREDIBILITY_MAX_CONCURRENT: int = 8 # Maximum concurrent operations
CREDIBILITY_BATCH_SIZE: int = 4 # Size of processing batches


@dataclass
class ModelConfig:
Expand Down
2 changes: 1 addition & 1 deletion backend/mainService/src/scraper/async_content_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ async def __aenter__(self):
async def __aexit__(self, exc_type, exc_val, exc_tb):
try:
if self._context:
self.scraper_driver.quit()
await self.scraper_driver.quit()
await self._context.close()
except Exception as e:
# Log the exception even if it occurred during cleanup
Expand Down
92 changes: 38 additions & 54 deletions backend/mainService/src/services/citation_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,64 +368,58 @@ async def _generate_citations(
max_tokens=LLMEC.QUERY_TOKEN_SIZE,
overlap_percent=5
)
# RAG + Rerank
# RAG + Rerank
results = await self.process_queries(queries)
logger.info(f"size of reranked results:{len(results)}\n\n")
filtered_results = filter_mixbread_results(results)
logger.info(f"size of filtered results:{len(filtered_results)}\n\n")
# Generrate citation

sources_with_scores = [
{
"title": result.get("title", ""),
"link": result.get("link", "") or result.get("url", ""),
"domain": result.get("domain", ""),
"journal": result.get("journal_title", ""),
"citation_doi": result.get("citation_doi", ""),
"citation_references": result.get("references", [""]),
"publication_date": result.get("publication_date", ""),
"author_name": result.get("author_name", "") or result.get("author", "") or result.get("authors", ""),
"abstract": result.get("abstract", ""),
"issn": result.get("issn", ""),
"type": result.get("type", ""),
"rerank_score": result.get("score", 0)
} for result in filtered_results
]

credibility_task = get_credibility_metrics(sources_with_scores)
citation_task = Citation(source=filtered_results).cite(
text=queries,
citation_style=style
)

sources_for_credibility = [
{
"title": result.get(
"title", ""), "link": result.get(
"link", "") or result.get(
"url", ""), "domain": result.get(
"domain", ""), "journal": result.get(
"journal_title", ""), "citation_doi": result.get(
"citation_doi", ""), "citation_references": result.get(
"references", [""]), "publication_date": result.get(
"publication_date", ""), "author_name": result.get(
"author_name", "") or result.get(
"author", "") or result.get(
"authors", ""), "abstract": result.get(
"abstract", ""), "issn": result.get(
"issn", ""), "type": result.get(
"type", "")} for result in filtered_results]
credibility_task = get_credibility_metrics(sources_for_credibility)

# Wait for both tasks to complete
citation_result, credibility_metrics = await asyncio.gather(
citation_task,
credibility_task,
return_exceptions=True
)
# Start both tasks but handle credibility metrics first
credibility_metrics = await asyncio.gather(credibility_task, return_exceptions=True)

if isinstance(credibility_metrics[0], Exception):
logger.exception(f"Credibility metrics failed: {str(credibility_metrics[0])}")
credibility_metrics = []
else:
credibility_metrics = credibility_metrics[0]

# Calculate scores immediately after getting credibility metrics
scores = await calculate_overall_score(credibility_metrics, sources_with_scores,
rerank_weight=0.6, credibility_weight=0.4)

sources = [
item["data"] for item in credibility_metrics if item["status"] == "success"
] if credibility_metrics else []

citation_result = await citation_task
if isinstance(citation_result, Exception):
logger.exception(f"Citation generation failed: {str(citation_result)}")
raise CitationGenerationError("Failed to generate citations")

if isinstance(credibility_metrics, Exception):
logger.exception(f"Credibility metrics failed: {str(credibility_metrics)}")
credibility_metrics = []

# Calculate overall credibility score
overall_score = calculate_overall_score(credibility_metrics)

# Extract source details from credibility metrics
sources = []
if credibility_metrics:
sources = [
item["data"] for item in credibility_metrics if item["status"] == "success"]

# Structure the final response
return {
"result": citation_result,
"overall_score": overall_score,
"overall_score": scores["overall_score"],
"sources": sources
}

Expand All @@ -436,13 +430,3 @@ async def _generate_citations(
logger.exception(f"Unexpected error in citation generation: {str(e)}")
return False


# TODO: store unique top 2 reranked results in a set
# TODO: feed the above to an llm as context to generate a citation
# TODO: Break the user content into large batches and ask the llm to generate a citation for each sentence/paragraph in a batch in the requested format
# TODO: store the citation in a database
# TODO: return the content with intext citations and reference list in json format
# TODO: annotate code
# TODO: clean up code: Get rid of code smells,magic numbers and bottlenecks
# TODO: add tests
# TODO: add more error handling
Loading
Loading