TheBluCoder · TheBluCoder · Mar 24, 2025 · Mar 23, 2025 · Mar 23, 2025 · Mar 23, 2025
diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 CiteMe is a modern, full-stack application designed to help researchers and academics manage their citations and references efficiently. The system provides intelligent citation suggestions, reference management, and seamless integration with academic databases.
 
+🌐 **Live Demo**: [CiteMe Editor](https://cite-me-wpre.vercel.app/editor)
+
 ## 🚀 Features
 
 - **Smart Citation Suggestions**: AI-powered citation recommendations based on your research context
@@ -11,6 +13,23 @@ CiteMe is a modern, full-stack application designed to help researchers and acad
 - **Modern UI**: Responsive and intuitive user interface
 - **API Integration**: Seamless integration with academic databases and search engines
 
+## 📁 Project Structure
+
+```
+CiteMe/
+├── frontend/                 # Vue.js 3 frontend application
+│   ├── src/                 # Source code
+│   ├── public/              # Static assets
+│   ├── e2e/                 # End-to-end tests
+│   └── dist/                # Production build
+├── backend/
+│   ├── mainService/         # Core citation service
+│   └── metricsService/      # Analytics and metrics service
+├── .github/                 # GitHub workflows and templates
+├── docker-compose.yml       # Docker services configuration
+└── README.md               # Project documentation
+```
+
 ## 🏗️ Architecture
 
 The application is built using a microservices architecture with three main components:
@@ -46,29 +65,60 @@ The application is built using a microservices architecture with three main comp
 - Node.js 20+ (for local frontend development)
 - Python 3.11+ (for local backend development)
 
-### Running with Docker
+### Running with Docker Compose (Recommended for Local Development)
 
 1. Clone the repository:
 ```bash
 git clone https://github.com/yourusername/citeme.git
 cd citeme
 ```
 
-2. Create a `.env` file in the root directory with necessary environment variables:
-```env
-# Add your environment variables here
-```
+2. Create `.env` files in both service directories:
+   - `backend/mainService/.env`
+   - `backend/metricsService/.env`
 
-3. Build and run the backend services using Docker Compose:
+3. Build and run the services using Docker Compose:
 ```bash
 docker-compose up --build
 ```
 
-The backend services will be available at:
-- Main Service: http://localhost:8000
-- Metrics Service: http://localhost:8001
+The services will be available at:
+- Main Service: http://localhost:9020
+- Metrics Service: http://localhost:9050
+
+### Running Services Individually
+
+If you need to run services separately:
+
+1. Create the Docker network:
+```bash
+docker network create cite_me
+```
+
+2. Run the Metrics Service:
+```bash
+cd backend/metricsService
+docker build -t metrics_service .
+docker run -p 9050:8000 \
+  --name ms \
+  --network cite_me \
+  --env-file .env \
+  metrics_service
+```
+
+3. Run the Main Service:
+```bash
+cd backend/mainService
+docker build -t main_service .
+docker run -p 9020:8000 \
+  --name mbs \
+  --network cite_me \
+  --env-file .env \
+  -e CREDIBILITY_API_URL=http://ms:8000/api/v1/credibility/batch \
+  main_service
+```
 
-### Local Development
+### Local Development Without Docker
 
 #### Frontend
 ```bash
@@ -83,7 +133,7 @@ cd backend/mainService
 python -m venv venv
 source venv/bin/activate  # On Windows: .\venv\Scripts\activate
 pip install -r requirements.txt
-uvicorn main:app --reload
+uvicorn app:app --reload --port 9020
 ```
 
 #### Metrics Service
@@ -92,14 +142,14 @@ cd backend/metricsService
 python -m venv venv
 source venv/bin/activate  # On Windows: .\venv\Scripts\activate
 pip install -r requirements.txt
-uvicorn main:app --reload
+uvicorn src.main:app --reload --port 9050
 ```
 
 ## 📚 API Documentation
 
 Once the services are running, you can access the API documentation at:
-- Main Service: http://localhost:8000/docs
-- Metrics Service: http://localhost:8001/docs
+- Main Service: http://localhost:9020/docs
+- Metrics Service: http://localhost:9050/docs
 
 ## 🧪 Testing
 

diff --git a/backend/mainService/.env.example b/backend/mainService/.env.example
@@ -4,4 +4,46 @@ GROQ_API_KEY= # your groq api key
 GOOGLE_API_KEY= # your gemini google api key
 MIXBREAD_API_KEY= # your mixbread api key
 PINECONE_API_KEY= # your pinecone api key
-AZURE_MODELS_ENDPOINT = # your azure model endpoint for citation generation
+AZURE_MODELS_ENDPOINT = # your azure model endpoint for citation generation
+CREDIBILITY_API_URL = # your credibility api url
+
+
+#NOTE:
+# CREDIBILITY_API_URL is the url of the credibility api that is used to get the credibility metrics for the sources
+# CREDIBILITY_API_URL is optional and is only used if the CREDIBILITY_API_URL environment variable is set
+# If the CREDIBILITY_API_URL environment variable is not set, the credibility metrics will not be fetched
+
+
+#AZURE_MODELS_ENDPOINT is the endpoint of the azure model that is used for citation generation
+#AZURE_MODELS_ENDPOINT is required and is used to generate the citations for the sources
+
+
+#MIXBREAD_API_KEY is the api key of the mixbread api that is used to rerank the sources
+#MIXBREAD_API_KEY is required and is used to rerank the sources
+
+
+#PINECONE_API_KEY is the api key of the pinecone api that is used to store the embeddings of the sources
+#PINECONE_API_KEY is required and is used to store the embeddings of the sources    
+
+
+#GPSE_API_KEY is the api key of the google programmable search engine api that is used to search the web
+#GPSE_API_KEY is required and is used to search the web
+
+
+#GOOGLE_API_KEY is the api key for gemini google api
+#GOOGLE_API_KEY it is required and is used to merge the chunk of cited citations returned by the azure model
+
+#CX is the custom search engine id for google programmable search engine
+
+#All the above can be replaced by writing your own functions for the respective services
+#for instance, one could decide to use gemini to generate the intext citation and references rather than using an azure 
+#hosted model. Hence all you need to do is write your own cite function/module and replace the azure cite function in the citation service file
+
+
+
+
+
+
+
+
+
diff --git a/backend/mainService/Dockerfile b/backend/mainService/Dockerfile
@@ -3,21 +3,36 @@ FROM python:3.11-slim
 WORKDIR /app
 
 # Install system dependencies
+# Installs essential tools for compiling software from source, often needed for Python package dependencies.(build-essential)
+# Removes the package lists downloaded during the update to reduce the image size.
 RUN apt-get update && apt-get install -y \
     build-essential \
     && rm -rf /var/lib/apt/lists/*
 
+# Set the PATH environment variable to include /app
+ENV PATH="/app:${PATH}"
+
 # Copy requirements first to leverage Docker cache
 COPY requirements.txt .
 
 # Install Python dependencies
 RUN pip install --no-cache-dir -r requirements.txt
 
-# Copy the rest of the application
-COPY . .
+# Copy the source code
+COPY ./scripts/ /app/scripts/
+COPY ./src/ /app/src/
+COPY ./app.py /app/app.py
+COPY ./__init__.py /app/__init__.py
+
+# Create a directory for runtime configuration
+RUN mkdir -p /app/config
+
+# Install playwright
+RUN playwright install && playwright install-deps 
 
 # Expose the port the app runs on
 EXPOSE 8000
 
+
 # Command to run the application
-CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] 
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] 
diff --git a/backend/mainService/app.py b/backend/mainService/app.py
@@ -12,6 +12,9 @@
 from src.scraper.async_content_scraper import AsyncContentScraper
 from fastapi.middleware.cors import CORSMiddleware
 import nltk
+from src.utils.concurrent_resources import cleanup_resources
+
+
 
 origins = [
     "http://localhost:5173",  # Frontend running on localhost (React, Vue, etc.)
@@ -37,6 +40,7 @@ async def startup_event(app: FastAPI):
     await app.state.playwright_driver.quit()
     await app.state.pc.cleanup()
     await AsyncHTTPClient.close_session()
+    cleanup_resources()  # Clean up thread pool and other concurrent resources
 
 
 app = FastAPI(lifespan=startup_event)

diff --git a/backend/mainService/requirements.txt b/backend/mainService/requirements.txt
@@ -23,4 +23,5 @@ urllib3==2.3.0
 lxml==5.3.0
 google-genai
 redis>=4.2.0
+uvicorn
 
diff --git a/backend/mainService/src/config/config.py b/backend/mainService/src/config/config.py
@@ -98,18 +98,17 @@ class LlmConfig:
 # Concurrency and Performance
 @dataclass
 class ConcurrencyConfig:
-    """Configuration class for concurrency settings.
+    """Configuration class for concurrency settings."""
 
-    Contains settings that control parallel processing and thread management."""
-    """Configuration for concurrency and performance settings."""
-
-    """
-        This is the number of concurrent workers that will be used for parallel and concurrent operations.
-    """
+    # General concurrency settings
     DEFAULT_CONCURRENT_WORKERS: int = (os.cpu_count() // 2) + 1
-
     HANDLE_INDEX_DELETE_WORKERS: int = 2
 
+    # Credibility service specific settings
+    CREDIBILITY_MAX_THREADS: int = 4  # Maximum threads for credibility calculations
+    CREDIBILITY_MAX_CONCURRENT: int = 8  # Maximum concurrent operations
+    CREDIBILITY_BATCH_SIZE: int = 4  # Size of processing batches
+
 
 @dataclass
 class ModelConfig:

diff --git a/backend/mainService/src/scraper/async_content_scraper.py b/backend/mainService/src/scraper/async_content_scraper.py
@@ -99,7 +99,7 @@ async def __aenter__(self):
     async def __aexit__(self, exc_type, exc_val, exc_tb):
         try:
             if self._context:
-                self.scraper_driver.quit()
+                await self.scraper_driver.quit()
                 await self._context.close()
         except Exception as e:
             # Log the exception even if it occurred during cleanup

diff --git a/backend/mainService/src/services/citation_service.py b/backend/mainService/src/services/citation_service.py
@@ -368,64 +368,58 @@ async def _generate_citations(
                 max_tokens=LLMEC.QUERY_TOKEN_SIZE,
                 overlap_percent=5
             )
-            # RAG +  Rerank
+            # RAG + Rerank
             results = await self.process_queries(queries)
-            logger.info(f"size of reranked results:{len(results)}\n\n")
             filtered_results = filter_mixbread_results(results)
-            logger.info(f"size of filtered results:{len(filtered_results)}\n\n")
-            # Generrate citation
+
+            sources_with_scores = [
+                {
+                    "title": result.get("title", ""),
+                    "link": result.get("link", "") or result.get("url", ""),
+                    "domain": result.get("domain", ""),
+                    "journal": result.get("journal_title", ""),
+                    "citation_doi": result.get("citation_doi", ""),
+                    "citation_references": result.get("references", [""]),
+                    "publication_date": result.get("publication_date", ""),
+                    "author_name": result.get("author_name", "") or result.get("author", "") or result.get("authors", ""),
+                    "abstract": result.get("abstract", ""),
+                    "issn": result.get("issn", ""),
+                    "type": result.get("type", ""),
+                    "rerank_score": result.get("score", 0)
+                } for result in filtered_results
+            ]
+
+            credibility_task = get_credibility_metrics(sources_with_scores)
             citation_task = Citation(source=filtered_results).cite(
                 text=queries,
                 citation_style=style
             )
 
-            sources_for_credibility = [
-                {
-                    "title": result.get(
-                        "title", ""), "link": result.get(
-                        "link", "") or result.get(
-                        "url", ""), "domain": result.get(
-                        "domain", ""), "journal": result.get(
-                            "journal_title", ""), "citation_doi": result.get(
-                                "citation_doi", ""), "citation_references": result.get(
-                                    "references", [""]), "publication_date": result.get(
-                                        "publication_date", ""), "author_name": result.get(
-                                            "author_name", "") or result.get(
-                                                "author", "") or result.get(
-                                                    "authors", ""), "abstract": result.get(
-                                                        "abstract", ""), "issn": result.get(
-                                                            "issn", ""), "type": result.get(
-                                                                "type", "")} for result in filtered_results]
-            credibility_task = get_credibility_metrics(sources_for_credibility)
-
-            # Wait for both tasks to complete
-            citation_result, credibility_metrics = await asyncio.gather(
-                citation_task,
-                credibility_task,
-                return_exceptions=True
-            )
+            # Start both tasks but handle credibility metrics first
+            credibility_metrics = await asyncio.gather(credibility_task, return_exceptions=True)
+
+            if isinstance(credibility_metrics[0], Exception):
+                logger.exception(f"Credibility metrics failed: {str(credibility_metrics[0])}")
+                credibility_metrics = []
+            else:
+                credibility_metrics = credibility_metrics[0]
+
+            # Calculate scores immediately after getting credibility metrics
+            scores = await calculate_overall_score(credibility_metrics, sources_with_scores, 
+                                          rerank_weight=0.6, credibility_weight=0.4)
 
+            sources = [
+                item["data"] for item in credibility_metrics if item["status"] == "success"
+            ] if credibility_metrics else []
+
+            citation_result = await citation_task
             if isinstance(citation_result, Exception):
                 logger.exception(f"Citation generation failed: {str(citation_result)}")
                 raise CitationGenerationError("Failed to generate citations")
 
-            if isinstance(credibility_metrics, Exception):
-                logger.exception(f"Credibility metrics failed: {str(credibility_metrics)}")
-                credibility_metrics = []
-
-            # Calculate overall credibility score
-            overall_score = calculate_overall_score(credibility_metrics)
-
-            # Extract source details from credibility metrics
-            sources = []
-            if credibility_metrics:
-                sources = [
-                    item["data"] for item in credibility_metrics if item["status"] == "success"]
-
-            # Structure the final response
             return {
                 "result": citation_result,
-                "overall_score": overall_score,
+                "overall_score": scores["overall_score"],
                 "sources": sources
             }
 
@@ -436,13 +430,3 @@ async def _generate_citations(
             logger.exception(f"Unexpected error in citation generation: {str(e)}")
             return False
 
-
-# TODO: store unique top 2 reranked results in a set
-# TODO: feed the above to an llm as context to generate a citation
-# TODO: Break the user content into large batches and ask the llm to generate a citation for each sentence/paragraph in a batch in the requested format
-# TODO: store the citation in a database
-# TODO: return the content with intext citations and reference list in json format
-# TODO: annotate code
-# TODO: clean up code: Get rid of code smells,magic numbers and bottlenecks
-# TODO: add tests
-# TODO: add more error handling