Skip to content

Fix json op issue#85

Open
puja-trivedi wants to merge 24 commits intoimprovementfrom
fix_json_op_issue
Open

Fix json op issue#85
puja-trivedi wants to merge 24 commits intoimprovementfrom
fix_json_op_issue

Conversation

@puja-trivedi
Copy link
Copy Markdown
Contributor

@puja-trivedi puja-trivedi commented Mar 2, 2026

Resolving issue #84

This should be merged after #65 is merged

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust framework for benchmarking Named Entity Recognition (NER) capabilities of different large language models within the biomedical domain. It integrates a new dataset, a data processing utility, and a suite of evaluation results and analysis tools. The changes aim to provide a clear and quantifiable assessment of model performance, highlighting the impact of human-in-the-loop processes on NER accuracy and ontology alignment.

Highlights

  • New Benchmarking Dataset: Introduced a new benchmarking dataset for evaluating StructSense, including named entity recognition (NER) annotations from NCBI Disease and JNLPBA biomedical corpora.
  • BIO-tagged Data Conversion Utility: Added a Python utility script (bio_txt.py) to convert BIO-tagged TSV/CSV data into continuous text and JSONL entity mappings, handling various parsing complexities.
  • Comprehensive NER Evaluation Results: Incorporated extensive Named Entity Recognition (NER) evaluation results for various Large Language Models (LLMs) such as Claude 3.7 Sonnet, GPT-4o-mini, and DeepSeek V3 0324 on biomedical literature.
  • Performance Metrics and Human-in-the-Loop Analysis: Provided detailed performance metrics, including token usage, entity detection rates, ontology mapping completeness, and judge scores, comparing model performance both with and without human-in-the-loop (HIL) intervention.
  • New Analysis Scripts and Configurations: Included new Python analysis scripts and YAML configuration files to support comprehensive NER evaluation, visualization of results, and detailed statistical analysis of model performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • evaluation/benchmark/readme.md
    • Documented the new benchmarking dataset for StructSense and its structure.
  • evaluation/benchmark/script/bio_txt.py
    • Implemented a Python script for converting BIO-tagged data to continuous text and JSONL formats.
  • evaluation/combined_all_token_cost_data/old/combined_all_csv.csv
    • Added a comprehensive CSV file detailing token usage, cost, and speed for various LLMs across multiple NER tasks.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/Integrating-brainstem-token-usage-with-hl.csv
    • Added token usage data for NER evaluation on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_claudesonet_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_deepseek_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_gpt_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/Integrating-brainstem-token-usage-without-hl.csv
    • Added token usage data for NER evaluation on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_deepseek_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_gpt_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_analysis_report.txt
    • Added a text report summarizing comprehensive NER analysis results for the 'Latent-circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_summary_table.csv
    • Added a CSV table summarizing NER evaluation metrics for the 'Latent-circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entities_missing_ontology.csv
    • Added a CSV file listing entities that lacked ontology information.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_with_hil.csv
    • Added a CSV file summarizing the entity pool for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_without_hil.csv
    • Added a CSV file summarizing the entity pool for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_detailed_statistics.csv
    • Added a CSV file with detailed judge score statistics for NER evaluations.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_with_hil.csv
    • Added a CSV file with judge score statistics for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_without_hil.csv
    • Added a CSV file with judge score statistics for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/label_distribution_statistics.csv
    • Added a CSV file detailing the distribution of entity labels.
  • evaluation/ner/old/evaluation/Latent-circuit/results/location_statistics.csv
    • Added a CSV file with statistics on entity detection by paper location.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_with_hil.csv
    • Added a CSV file listing entities missed by models for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_without_hil.csv
    • Added a CSV file listing entities missed by models for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/model_rankings.csv
    • Added a CSV file containing model rankings based on NER evaluation.
  • evaluation/ner/old/evaluation/Latent-circuit/results/ontology_coverage_summary.csv
    • Added a CSV file summarizing ontology coverage in NER evaluations.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_with_hil.csv
    • Added a CSV file listing entities shared by all models for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_without_hil.csv
    • Added a CSV file listing entities shared by all models for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_claudesonet_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_deepseek_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_gpt_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_token_usage_with_hil_latent.csv
    • Added token usage data for NER evaluation on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_deepseek_s41593-025-01869-7_without_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_gpt_s41593-025-01869-7_without_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_token_usage_without_hil_latent.csv
    • Added token usage data for NER evaluation on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/ner_config_claudesonet.yaml
    • Added YAML configuration for the Claude 3.7 Sonnet NER agent.
  • evaluation/ner/old/ner_config_deepseek.yaml
    • Added YAML configuration for the DeepSeek V3 0324 NER agent.
  • evaluation/ner/old/ner_config_gpt.yaml
    • Added YAML configuration for the GPT-4o-mini NER agent.
  • evaluation/notebook/README.md
    • Documented the token cost and speed analysis script.
  • evaluation/notebook/ner_comprehensive_summary.py
    • Added a Python script for generating comprehensive NER evaluation summaries and visualizations.
  • evaluation/notebook/ner_data_loader.py
    • Added a Python script for loading and preprocessing NER evaluation data.
  • evaluation/notebook/ner_entity_pool_analysis.py
    • Added a Python script for analyzing entity detection performance and false negatives.
  • evaluation/notebook/ner_judge_score_analysis.py
    • Added a Python script for analyzing judge scores and quality assessments in NER.
  • evaluation/notebook/ner_label_distribution.py
    • Added a Python script for analyzing the distribution of entity labels.
  • evaluation/notebook/ner_location_analysis.py
    • Added a Python script for analyzing entity detection patterns across paper locations.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_cost_violin.svg
    • Added an SVG plot visualizing cost distribution for the 'Integrating brainstem' task with human-in-the-loop.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_violin.svg
    • Added an SVG plot visualizing speed distribution for the 'Integrating brainstem' task with human-in-the-loop.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_vs_cost.svg
    • Added an SVG plot visualizing speed versus cost for the 'Integrating brainstem' task with human-in-the-loop.
Activity
  • The pull request was created by puja-trivedi.
  • New benchmarking data and scripts for NER evaluation were added.
  • Comprehensive evaluation results for various LLMs were included.
  • Analysis and visualization scripts for NER performance metrics were introduced.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmarking dataset for StructSense, including a Python script (bio_txt.py) to convert BIO-tagged data into continuous text and JSONL formats. It also adds numerous evaluation results for Named Entity Recognition (NER) tasks, specifically for 'Integrating brainstem' and 'Latent-circuit' papers, comparing various LLMs (Claude 3.7 Sonnet, DeepSeek V3 0324, GPT-4o-mini) both with and without Human-in-the-Loop (HIL). These results are presented in new CSV and JSON files, detailing token usage, costs, speed, entity detection, ontology mapping, and judge scores. Additionally, new Python scripts (ner_comprehensive_summary.py, ner_data_loader.py, ner_entity_pool_analysis.py, ner_judge_score_analysis.py, ner_label_distribution.py, ner_location_analysis.py, ner_ontology_analysis.py) are added to analyze and visualize these NER evaluation metrics, along with a README.md for token cost analysis. Review comments highlight several issues: redundant nested judge_ner_terms keys in JSON output files, an inconsistency in schema definitions where paper_title is specified as a string but examples show it as an array, hardcoded file paths in analysis scripts, and extraneous \ No newline at end of file artifacts in CSV files.

Note: Security Review is unavailable for this PR.

Comment on lines +522 to +523
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

Comment on lines +510 to +511
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

Comment on lines +317 to +318
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,1236,495,0.000482,47.0 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,300,262,0.000202,69.8 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59016,634,0.00923,67.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file ends with the text \ No newline at end of file. This appears to be an artifact from a version control or diff tool and should be removed. This extra text can cause parsing errors in tools that expect a clean CSV format.

"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6,tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

@puja-trivedi could you confirm that you've tested and everything is working as expected?

@tekrajchhetri tekrajchhetri changed the base branch from improvement to evaluation_data March 11, 2026 16:46
@tekrajchhetri tekrajchhetri changed the base branch from evaluation_data to improvement March 11, 2026 16:47
@tekrajchhetri tekrajchhetri marked this pull request as draft March 12, 2026 20:42
@djarecka
Copy link
Copy Markdown
Contributor

@tekrajchhetri - what is the stage of thi PR.

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

@tekrajchhetri - what is the stage of thi PR.
@djarecka It's WIP.

@djarecka
Copy link
Copy Markdown
Contributor

And do we know what is the issue? At the beginning I understood that this is issue with passing the information form agent to another agent/output due to some keys mismatch etc. Is this the case here? Or possibly the agent is just not finding anything interesting?

@djarecka
Copy link
Copy Markdown
Contributor

@tekrajchhetri - even after merging #65 this PR has 277 files changed. Does anything else should be merge first?

Could you please create a PR that has only relevant changes?

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

@tekrajchhetri - even after merging #65 this PR has 277 files changed. Does anything else should be merge first?

Could you please create a PR that has only relevant changes?

@djarecka the 277 file changes is coming from the evaluation directory, which can be ignored as it only contains the old evaluation files + some config file --> something I had communicated in meeting. I will see what I can do to remove those files changes it's showing.

Regarding the changes, some of the things are absolutely necessary, like local concept mapping tool and hence it's included here.

@tekrajchhetri tekrajchhetri marked this pull request as ready for review March 18, 2026 16:44
@tekrajchhetri
Copy link
Copy Markdown
Collaborator

And do we know what is the issue? At the beginning I understood that this is issue with passing the information form agent to another agent/output due to some keys mismatch etc. Is this the case here? Or possibly the agent is just not finding anything interesting?

Yes, we do and they are:

  • First, we realized from the @puja-trivedi output that we had different task detected, which should not have happened as different task has different post processing strategy. We've fixed it -- detect once and preserve it and use throughout the pipeline (see _get_detected_task_type in app.py where we cache).
  • Key mismatch. LLM's producing different keys--despite explicit instruction, we did have variations of what we ask but now we have updated to look entities agnostic of keys it holds at any stage (See call to _detect_container_key app.py) and also check promote_stage_output_to_canonical in postprocessing where we promote different keys for consistency.

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

Update: After PR 67 & 69 merge we no longer have 277 changes as before.

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

@puja-trivedi @djarecka can you test other pdf from your side so that we can merge this?

The parameter was renamed to input_source in the signature but the
method body still referenced source, causing a TypeError when
called from the CLI.
@puja-trivedi
Copy link
Copy Markdown
Contributor Author

@tekrajchhetri - There is an inconsistency with the parameter naming source vs input_source. I created PR for this change #97

Fix StructSenseFlow.__init__ parameter name: input_source -> source
@tekrajchhetri
Copy link
Copy Markdown
Collaborator

Since @puja-trivedi is able to produce output, we can merge this PR to improvement and improvement to main. The question of less number of entities, depends on the model choice and other parameter--issues independent of output being able to save.

Let me know if @djarecka @puja-trivedi have any concerns?

@djarecka
Copy link
Copy Markdown
Contributor

@tekrajchhetri - the test stopped working due to the issue with poetry, can you please updat

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

@tekrajchhetri - the test stopped working due to the issue with poetry, can you please updat

@djarecka can you share the logs because it's unclear what's causing the issue. Also, this means that we will always have an issue if test is failing just because we updated the dependencies.

@tekrajchhetri
Copy link
Copy Markdown
Collaborator

before merging this PR we should merge the following PR in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants