Handle weird whitespace unstructured from Crossref#1407
Handle weird whitespace unstructured from Crossref#1407
Conversation
Signed-off-by: Luca Foppiano <luca@foppiano.org>
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #849 by making citation parsing more robust to Crossref “unstructured” references containing unusual whitespace (e.g., NBSP-only strings), aiming to avoid HTTP 500 errors in processCitationList.
Changes:
- Normalize input text earlier in
CitationParser.processingStringMultipleso NBSP and similar characters are treated as whitespace. - Add null-guard handling around batch citation parsing results in patent
ReferenceExtractor. - Make
Engine.processRawReferencesresilient to anull/empty result list from batch citation parsing.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| grobid-core/src/main/java/org/grobid/core/engines/patent/ReferenceExtractor.java | Adds null checks around batch citation parser outputs before building BibDataSets. |
| grobid-core/src/main/java/org/grobid/core/engines/Engine.java | Uses CollectionUtils.isEmpty to safely handle null/empty batch parsing outputs. |
| grobid-core/src/main/java/org/grobid/core/engines/CitationParser.java | Normalizes Unicode whitespace before blank-checks; adds guard in processingReferenceSection for null batch results. |
Comments suppressed due to low confidence (1)
grobid-core/src/main/java/org/grobid/core/engines/Engine.java:175
processRawReferencesmay receive aresultslist containingnullelements (citation parser insertsnullfor blank/whitespace-only references when other references in the batch are non-empty). Ifconsolidate == 0, this method returns the list as-is and downstream code (e.g., RESTprocessCitationList) will NPE when serializing; ifconsolidate != 0, the consolidation loop will NPE onbib.getReference(). Consider normalizing this method’s output to never containnullentries (e.g., replace with emptyBiblioItemand set the raw reference) and ensure the consolidation stage skips/handles blank items while preserving original ordering.
List<BiblioItem> results = parsers.getCitationParser().processingStringMultiple(references, 0);
if (CollectionUtils.isEmpty(results))
return finalResults;
// consolidation in a second stage to take advantage of parallel calls
if (consolidate == 0) {
return results;
} else {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Luca Foppiano <luca@foppiano.org>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
grobid-core/src/main/java/org/grobid/core/engines/Engine.java:184
- Skipping null entries during consolidation changes the cardinality and index alignment of the returned list (finalResults is built only from bibDataSetResults). The REST API docs for citation lists state results are returned “in the same order”; consider preserving placeholders/indices (e.g., keep an index map when consolidating only non-null items, then rebuild a result list matching the input size/order).
// prepare for set consolidation
List<BibDataSet> bibDataSetResults = new ArrayList<BibDataSet>();
for (BiblioItem bib : results) {
if (bib == null)
continue;
BibDataSet bds = new BibDataSet();
bds.setResBib(bib);
bds.setRawBib(bib.getReference());
bibDataSetResults.add(bds);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessString.java
Show resolved
Hide resolved
grobid-core/src/main/java/org/grobid/core/engines/patent/ReferenceExtractor.java
Show resolved
Hide resolved
grobid-core/src/main/java/org/grobid/core/engines/patent/ReferenceExtractor.java
Show resolved
Hide resolved
grobid-core/src/test/java/org/grobid/core/engines/CitationParserNullHandlingTest.java
Show resolved
Hide resolved
grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessString.java
Show resolved
Hide resolved
Signed-off-by: Luca Foppiano <luca@foppiano.org>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This should fix #849