Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 82 additions & 9 deletions doc/Consolidation.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Consolidation

In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services.
In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services.

Consolidation has two main interests:

* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in F1-score, e.g. from 74.59 F1-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 F1-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1943 dataset, see more recent [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)).
* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in F1-score, e.g. from 74.59 F1-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 F1-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1943 dataset, see more recent [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)).

* The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services.
* The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services.

The consolidation includes the CrossRef Funder Registry for enriching the extracted funder information.
The consolidation includes the CrossRef Funder Registry for enriching the extracted funder information.

GROBID supports two consolidation services:

Expand All @@ -18,7 +18,7 @@ GROBID supports two consolidation services:

## CrossRef REST API

The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef.
The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef.

For using [reliably and politely the CrossRef REST API](https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service), it is highly recommended to add a contact email to the queries. This is done in GROBID by modifying the config file under `grobid-home/config/grobid.yaml`:

Expand All @@ -29,7 +29,7 @@ consolidation:
timeoutSec: 10
```

Without indicating this email, the service might be unreliable with some query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email.
Without indicating this email, the service might be unreliable with some query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email.

In case you are a lucky Crossref Metadata Plus subscriber, you can set your authorization token in the config file under `grobid-home/config/grobid.yaml` as follow:

Expand All @@ -42,6 +42,63 @@ consolidation:

According to Crossref, the token will ensure that said requests get directed to a pool of machines that are reserved for "Plus" SLA users (note: of course the above token is fake).

### Concurrency

GROBID automatically adjusts concurrency based on the CrossRef API tier detected from the configuration:

| Tier | Condition | Initial Concurrent Requests |
|------|-----------|----------------------------|
| Public | No `mailto`, no `token` | 1 |
| Polite | `mailto` set | 3 |
| Plus | `token` set | 50 |

These initial values are further tuned at runtime using the `x-concurrency-limit` header returned by CrossRef API responses.

When a Plus tier token is configured, GROBID validates it at startup by making a lightweight request (`/works?rows=0`) to CrossRef. If the token is not recognized as Plus tier (e.g. expired or invalid), GROBID automatically falls back to Polite concurrency (3) if `mailto` is set, or Public (1) otherwise, and logs a warning. If CrossRef is unreachable at startup, the Plus tier default is kept since the token cannot be proven invalid.

### Performance with CrossRef Consolidation

When citation consolidation is enabled, the CrossRef API becomes the dominant factor in processing time. Below are benchmarks from processing 10,000 PDF documents with `processFulltextDocument` and `consolidateCitations=1`:

| Metric | Polite Tier | Plus Tier |
|--------|------------|-----------|
| Total runtime | ~162,277 sec (~45 hours) | ~42,755 sec (~12 hours) |
| Speed | 0.06 docs/sec | 0.23 docs/sec |
| Throughput per document | 17.02 sec/doc | 4.31 sec/doc |
| Failed documents | 467/10,000 (4.7%) | 85/10,000 (0.85%) |

The Plus tier is approximately **3.8x faster** and produces **~5.5x fewer errors** compared to the Polite tier. For any batch processing beyond a few hundred documents with citation consolidation, the Plus tier is strongly recommended.

!!! warning "Increase client timeout when using consolidation"
With consolidation enabled, GROBID takes significantly longer to process each document. The default client timeout (e.g. 60 seconds in the Python client) is far too low — individual documents with many references can take well over a minute. **Increase the client timeout to 200–600 seconds** to avoid unnecessary timeout errors. For example, in the Python client's `config.json`, set `"timeout": 300`.

### Rate Limiting and Backoff

When CrossRef returns HTTP 429 (rate limit exceeded), GROBID applies exponential backoff with jitter ("full jitter" strategy):

- Base delay: 1 second, exponentially increased (`base * 2^attempt`), capped at 60 seconds
- Each retry sleeps for a random duration in `[0, cap]`, spreading retries across time and avoiding synchronized retry bursts (thundering herd)
- During backoff, concurrency is reduced to 1 (serialized requests)
- On the next successful response, backoff resets and concurrency is restored

GROBID also reads the `x-api-pool` header from responses to identify which CrossRef pool is being used.

### Post-Validation

By default, GROBID post-validates CrossRef results against the source metadata to filter false positives (the CrossRef API is a search API and may return inexact matches). This validation compares the first author surname using fuzzy matching.

For biblio-glutton, post-validation is always skipped because glutton handles validation internally.

For CrossRef, post-validation can be disabled via configuration:

```yaml
consolidation:
crossref:
postValidation: false
```

When post-validation is disabled, all CrossRef results are accepted. This can be useful for testing or when the results are post-processed by another system.

### Timeouts

Both consolidation services support configurable timeouts to control how long GROBID waits for external API responses:
Expand All @@ -55,14 +112,30 @@ consolidation:
```

!!! warning "Be careful when setting low timeout values"
Setting too low timeout values may cause request failures. For CrossRef, be particularly careful as aggressive querying with short timeouts and high volume may result in being banned from the service.
Setting too low timeout values may cause request failures. For CrossRef, be particularly careful as aggressive querying with short timeouts and high volume may result in being banned from the service.

## TEI Output

When consolidation is performed, the resulting TEI output includes attributes on `<biblStruct>` elements to indicate the consolidation status and which service was used:

- `status="consolidated"` — the bibliographic item was matched and enriched by the consolidation service
- `status="extracted"` — the bibliographic item was extracted by GROBID only (no consolidation match)
- `source="crossref"` or `source="glutton"` — which consolidation service provided the match

These attributes appear on both header `<biblStruct>` (in `<sourceDesc>`) and citation `<biblStruct>` elements.

Example:
```xml
<biblStruct status="consolidated" source="crossref">
...
</biblStruct>
```

## biblio-glutton

This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy).
This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy).

Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available.
Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available.

After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.yaml` file, with its url, for instance:

Expand Down
25 changes: 24 additions & 1 deletion doc/Frequently-asked-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,30 @@ Occasionally, [people have reported](https://github.com/kermitt2/grobid/issues/1
To resolve this issue, there are several options:
- Check that you are running the proper image for your hardware. If you are not sure, use the image `grobid/grobid:0.8.2-crf` which is the most lightweight and fastest image.
- Make sure you don't send too many requests at the same time, as this can overload the server. If you are using the Grobid Python client, you can set the `n` parameter to a lower value (e.g. 1 or 2) to limit the number of concurrent requests.
- Increase the timeout value in your client. If you are using the Grobid Python client, you can set the `timeout` parameter to a higher value (e.g. 90 seconds) in the `config.json` to give the server more time to respond.
- Increase the timeout value in your client. If you are using the Grobid Python client, you can set the `timeout` parameter to a higher value (e.g. 90 seconds) in the `config.json` to give the server more time to respond. **If consolidation is enabled** (`consolidateHeader` or `consolidateCitations`), you should increase the timeout much further — to **200–600 seconds** — because each document now waits for multiple external API lookups on top of PDF processing.

## The service is slow to process PDFs

If GROBID seems unusually slow when processing documents, the most common cause is **consolidation**. When consolidation is enabled (e.g. `consolidateHeader=1` or `consolidateCitations=1`), GROBID queries an external bibliographical service (CrossRef or biblio-glutton) for each bibliographic item, which adds significant latency — especially for documents with many references.

A few things to check:

1. **Are you using consolidation?** If you don't need DOI resolution or metadata enrichment, you can disable consolidation by setting `consolidateHeader=0` and `consolidateCitations=0` in your API calls. This will make processing much faster.

2. **Have you configured a polite email?** If you are using CrossRef consolidation, you should set a contact email in `grobid-home/config/grobid.yaml`:

```yaml
consolidation:
crossref:
mailto: your-email@example.org
```

Without a `mailto`, requests go through the CrossRef **public** pool with minimal concurrency (1 request at a time). Adding an email enables access to the **polite** pool with higher concurrency (up to 3), resulting in noticeably faster consolidation. See the [Consolidation documentation](Consolidation.md) for more details on tiers.

3. **Consider a CrossRef Plus subscription for medium-scale processing.** If you need consolidation but don't want to self-host biblio-glutton, a [CrossRef Metadata Plus](https://www.crossref.org/services/metadata-retrieval/metadata-plus/) subscription can help significantly. Benchmarks on 10,000 documents show the Plus tier is ~3.8x faster and has ~5.5x fewer errors compared to the Polite tier. See the [Consolidation documentation](Consolidation.md#performance-with-crossref-consolidation) for detailed numbers.

4. **Consider biblio-glutton for large-scale processing.** The CrossRef API has rate limits that make it impractical for processing large batches with citation consolidation enabled. If you need to process many PDFs with full citation consolidation, consider using [biblio-glutton](https://github.com/kermitt2/biblio-glutton), which can scale horizontally and is not subject to external rate limits.


## How to override the grobid configuration file when running via docker?

Expand Down
17 changes: 16 additions & 1 deletion grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java
Original file line number Diff line number Diff line change
Expand Up @@ -391,6 +391,9 @@ public String toString() {
// Source (whether the data was consolidated)
private String status = CONSOLIDATION_STATUS_EXTRACTED;

// Which consolidation service was used (e.g. "crossref" or "glutton")
private String consolidationService = null;

// All the tokens that are considered noise will be collected here
private List<String> discardedPieces = new ArrayList<>();
private List<List<LayoutToken>> discardedPiecesTokens = new ArrayList<>();
Expand Down Expand Up @@ -2316,7 +2319,10 @@ public String toTEI(int n, int indent, GrobidAnalysisConfig config) {
if (withCoords)
tei.append(TEIFormatter.getCoordsAttribute(coordinates, withCoords)).append(" ");

tei.append("status=\"" + getStatus() + "\" ").append(" ");
tei.append("status=\"" + getStatus() + "\"");
if (getConsolidationService() != null)
tei.append(" source=\"" + getConsolidationService() + "\"");
tei.append(" ");

if (!StringUtils.isEmpty(language)) {
if (n == -1) {
Expand Down Expand Up @@ -4461,6 +4467,7 @@ else if (bibo.getFullAuthors().size() == 1) {
}
}
bib.setStatus(bibo.getStatus());
bib.setConsolidationService(bibo.getConsolidationService());
}

/**
Expand Down Expand Up @@ -4638,6 +4645,14 @@ public void setStatus(String status) {
this.status = status;
}

public String getConsolidationService() {
return consolidationService;
}

public void setConsolidationService(String consolidationService) {
this.consolidationService = consolidationService;
}

public String getConflictStmt() {
return conflictStmt;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,12 @@ else if (biblio.getE_Year().length() == 4)
tei.append("\t\t\t\t<availability status=\"unknown\"><licence/></availability>\n");
tei.append("\t\t\t</publicationStmt>\n");
}
tei.append("\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct>\n\t\t\t\t\t<analytic>\n");
tei.append("\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct");
if (biblio.getStatus() != null)
tei.append(" status=\"" + biblio.getStatus() + "\"");
if (biblio.getConsolidationService() != null)
tei.append(" source=\"" + biblio.getConsolidationService() + "\"");
tei.append(">\n\t\t\t\t\t<analytic>\n");

// authors + affiliation
//biblio.createAuthorSet();
Expand Down
Loading
Loading