Skip to content

Handle weird whitespace unstructured from Crossref#1407

Open
lfoppiano wants to merge 3 commits intomasterfrom
bugfix/issue-849
Open

Handle weird whitespace unstructured from Crossref#1407
lfoppiano wants to merge 3 commits intomasterfrom
bugfix/issue-849

Conversation

@lfoppiano
Copy link
Copy Markdown
Member

This should fix #849

Signed-off-by: Luca Foppiano <luca@foppiano.org>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #849 by making citation parsing more robust to Crossref “unstructured” references containing unusual whitespace (e.g., NBSP-only strings), aiming to avoid HTTP 500 errors in processCitationList.

Changes:

  • Normalize input text earlier in CitationParser.processingStringMultiple so NBSP and similar characters are treated as whitespace.
  • Add null-guard handling around batch citation parsing results in patent ReferenceExtractor.
  • Make Engine.processRawReferences resilient to a null/empty result list from batch citation parsing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
grobid-core/src/main/java/org/grobid/core/engines/patent/ReferenceExtractor.java Adds null checks around batch citation parser outputs before building BibDataSets.
grobid-core/src/main/java/org/grobid/core/engines/Engine.java Uses CollectionUtils.isEmpty to safely handle null/empty batch parsing outputs.
grobid-core/src/main/java/org/grobid/core/engines/CitationParser.java Normalizes Unicode whitespace before blank-checks; adds guard in processingReferenceSection for null batch results.
Comments suppressed due to low confidence (1)

grobid-core/src/main/java/org/grobid/core/engines/Engine.java:175

  • processRawReferences may receive a results list containing null elements (citation parser inserts null for blank/whitespace-only references when other references in the batch are non-empty). If consolidate == 0, this method returns the list as-is and downstream code (e.g., REST processCitationList) will NPE when serializing; if consolidate != 0, the consolidation loop will NPE on bib.getReference(). Consider normalizing this method’s output to never contain null entries (e.g., replace with empty BiblioItem and set the raw reference) and ensure the consolidation stage skips/handles blank items while preserving original ordering.
        List<BiblioItem> results = parsers.getCitationParser().processingStringMultiple(references, 0);
        if (CollectionUtils.isEmpty(results))
            return finalResults;

        // consolidation in a second stage to take advantage of parallel calls
        if (consolidate == 0) {
            return results;
        } else { 

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Luca Foppiano <luca@foppiano.org>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

grobid-core/src/main/java/org/grobid/core/engines/Engine.java:184

  • Skipping null entries during consolidation changes the cardinality and index alignment of the returned list (finalResults is built only from bibDataSetResults). The REST API docs for citation lists state results are returned “in the same order”; consider preserving placeholders/indices (e.g., keep an index map when consolidating only non-null items, then rebuild a result list matching the input size/order).
            // prepare for set consolidation
            List<BibDataSet> bibDataSetResults = new ArrayList<BibDataSet>();
            for (BiblioItem bib : results) {
                if (bib == null)
                    continue;
                BibDataSet bds = new BibDataSet();
                bds.setResBib(bib);
                bds.setRawBib(bib.getReference());
                bibDataSetResults.add(bds);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Luca Foppiano <luca@foppiano.org>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP 500 for processCitationList on non-breaking whitespace string

2 participants