Conversation
Added a section outlining the task for the search system.
Added a section detailing the current situation with BioPortal as the ontology database, including its benefits and trade-offs.
Expanded the search design document with detailed requirements, proposed approach, implementation strategy, and design principles for the retrieval and reranking system.
Clarified search requirements and added references for models used in ranking and retrieval strategy.
Summary of ChangesHello @tekrajchhetri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a foundational design document for a new search system tailored for StructSense and Dandi. The initiative aims to mitigate current dependencies on BioPortal by establishing an independent, high-performance search capability. The proposed architecture details a hybrid retrieval and reranking strategy, integrating advanced techniques like vector search and various encoder models, all encapsulated within a modular, API-first service designed for scalability and extensibility. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design document for a new search system. The document is well-structured, outlining the motivation, requirements, and a solid two-stage (retrieval and re-ranking) architectural approach. My review includes a few suggestions to fix minor typos, an invalid tag in the Mermaid diagram, and placeholder content in the references to enhance the document's accuracy and clarity.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
|
||
| Before further detail, let’s first understand the steps involved. This task typically consists of two main stages:**retrieval** and **reranking**. | ||
|
|
||
| In the first stage, **retrieval**, the objective is to identify a subset of potentially relevant candidates from $\mathcal{D}$. This is achieved by maximizing a scoring function $f(q, d)$, which estimates the relevance between the query and each candidate document: |
There was a problem hiding this comment.
Can you please make more specific to this case.
is d a specific concept? Is D all ontologies that are provided?
|
|
||
| Note at this stage we want to prioritize high recall and computational efficiency. | ||
|
|
||
| In the second stage, **reranking**, the retrieved candidates $\mathcal{D}_K$ are re-evaluated using a more expressive (and often computationally expensive) relevance model $g(q, d)$. The goal is to refine the initial ordering by more precisely estimating relevance: |
There was a problem hiding this comment.
What is D_k? How do you get one? How is it related to d^*?
There was a problem hiding this comment.
@djarecka D is the candidate set as described above and _K is the subset.
There was a problem hiding this comment.
Yes, I understand that D_k is a subset, but how do you get this subset
| 2. **Keyword-based retrieval** — The system must also support fast and efficient keyword-based search. | ||
| 3. **Generalizability** — The implementation should be easily adaptable to other use cases with minimal or no additional effort. | ||
|
|
||
| ## Proposed approach |
There was a problem hiding this comment.
could you please make connection between the concepts described in the previous section (e.g., Keyword-based retrieval, Contextualized retrieval) and the terms used in the diagram
There was a problem hiding this comment.
@djarecka what do you mean by make connection? Do you want me to include mentions like BM25 for keyword based retrieval?
There was a problem hiding this comment.
It's really not meant to be tricky, the diagram simply doesn't have the terms that you spent time introducing in the previous section, e.g., Contextualized retrieval, Keyword-based retrieval, etc. I just thought that it might be useful to create the connection between the diagram and the introduction you wrote
Addressed: #76
This design document provides a high-level architectural overview and design considerations (algorithms + system implementation) for search for StructSense + Dandi.