Context embedding based suggestions by VaibhavA123 · Pull Request #289 · komalharshita/DevPath

VaibhavA123 · 2026-05-18T11:38:06Z

Offline preprocessing script successfully vectorized and serialized all dataset profiles.
Mathematical dot-product calculations verified locally return matching indices under 15ms.
Graceful error management fallback implemented to guarantee operational search pathways if the external API hits rate limits.

<!--
  Pull Request Template — DevPath
  --------------------------------
  Delete sections that do not apply.
  Every section marked [required] must be completed before review begins.
  PRs with empty required sections will be returned without review.
-->

## Summary [required]

This PR upgrades the discovery engine from literal keyword matching to full-context semantic embeddings, eliminating the "vocabulary mismatch" limitation. By transforming both user input and comprehensive project details into vector representations, the app can now capture conceptual meaning rather than relying on exact word overlaps. This approach introduces semantic intent recognition and completely eliminates "zero-match" scenarios. Furthermore, it strictly adheres to DevPath’s lightweight, "no-database" design philosophy by keeping the pre-computed project vectors stored inside a static serialized file within the repository.

## Related Issue [required]

Closes #249

## Type of Change [required]

- [ ] Bug fix — resolves a broken behaviour
- [x] Feature — adds new functionality
- [ ] Data — adds new projects to `data/projects.json`
- [ ] Documentation — updates docs, README, or code comments only
- [ ] Style — CSS or visual changes only, no logic change
- [ ] Refactor — restructures code without changing behaviour
- [ ] Test — adds or updates tests

## What Was Changed [required]

| File | Change made |
|------|-------------|
| `utils/embedding_helpers.py` | New utility script containing cosine similarity math logic and external API routing to generate vector strings. |
| `scripts/generate_embeddings.py` | Offline utility script used to concatenate full project fields, compile their text vectors, and serialize them. |
| `data/project_embeddings.pkl` | Added a static binary serialized storage file containing pre-computed vector matrices for all curated repository paths. |
| `routes/main_routes.py` | Updated the recommendation route to intercept user form input, generate its vector on the fly, and run a matrix dot-product sort against the static file. |

## How to Test This PR [required]

1. Clone this branch: `git checkout feat/semantic-embeddings`
2. Ensure your environment variables include your external API verification keys:
   ```bash
   export GEMINI_API_KEY="your_api_key_here"

vercel · 2026-05-18T11:38:11Z

@VaibhavA123 is attempting to deploy a commit to the komalsony234-1530's projects Team on Vercel.

A member of the Team first needs to authorize it.

komalharshita

Thanks for the contribution. This PR introduces a very ambitious semantic recommendation system using Gemini embeddings, Redis-based interaction history, Kafka event logging, and cosine similarity matching. There is clearly a significant amount of engineering effort and thoughtful system design behind the implementation.

However, this PR is currently not merge-ready because it introduces a major architectural expansion that goes far beyond a standard feature addition for the current DevPath stack.

Main concerns:

The PR introduces substantial new infrastructure dependencies (Redis, Kafka, Gemini embedding APIs) without corresponding setup/configuration/documentation for maintainers or contributors.
No tests were added for the new APIs, semantic matching pipeline, Redis/Kafka integrations, or failure scenarios.
The feature currently lacks graceful fallback behavior when external systems/services fail.
Using pickle-based embedding persistence introduces maintainability and safety concerns.
The recommendation pipeline currently loads embeddings and computes similarity synchronously per request, which may not scale well.
This level of infrastructure change likely requires architectural discussion/design approval before merging.

Positive note:
The semantic matching concept itself is strong, and the separation of preprocessing/helpers shows good engineering intent. However, this would be better approached as a phased architecture initiative with:

proper dependency setup
environment documentation
testing strategy
fallback handling
deployment planning
scalability considerations

At the moment, the implementation is too large and infrastructure-heavy to merge safely in its current form.

VaibhavA123 added 2 commits May 18, 2026 16:51

implement_redis_kafka_rate_limit

c23eccf

context_based_searching

c3eab3c

github-actions Bot added type:performance type:security gssoc-2026 labels May 18, 2026

komalharshita added the need review Further information is requested label May 19, 2026

komalharshita requested changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context embedding based suggestions#289

Context embedding based suggestions#289
VaibhavA123 wants to merge 2 commits into
komalharshita:mainfrom
VaibhavA123:context_embedding_based_suggestions

VaibhavA123 commented May 18, 2026

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

komalharshita left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VaibhavA123 commented May 18, 2026

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

komalharshita left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants