grobidOrg · lfoppiano · Apr 6, 2026 · Mar 20, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/doc/Consolidation.md b/doc/Consolidation.md
@@ -1,14 +1,14 @@
 # Consolidation
 
-In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services. 
+In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services.
 
 Consolidation has two main interests:
 
-* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in F1-score, e.g. from 74.59 F1-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 F1-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1943 dataset, see more recent [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)). 
+* The consolidation service improves very significantly the retrieval of header information (+.12 to .13 in F1-score, e.g. from 74.59 F1-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 88.89 F1-score, using biblio-glutton and GROBID version `0.5.6` for the PMC 1943 dataset, see more recent [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)).
 
-* The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services. 
+* The consolidation service matches the extracted bibliographical references with known publications, and complement the parsed bibliographical references with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services.
 
-The consolidation includes the CrossRef Funder Registry for enriching the extracted funder information. 
+The consolidation includes the CrossRef Funder Registry for enriching the extracted funder information.
 
 GROBID supports two consolidation services:
 
@@ -18,7 +18,7 @@ GROBID supports two consolidation services:
 
 ## CrossRef REST API
 
-The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef.  
+The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef.
 
 For using [reliably and politely the CrossRef REST API](https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service), it is highly recommended to add a contact email to the queries. This is done in GROBID by modifying the config file under `grobid-home/config/grobid.yaml`:
 
@@ -29,7 +29,7 @@ consolidation:
       timeoutSec: 10
 ```
 
-Without indicating this email, the service might be unreliable with some query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email.  
+Without indicating this email, the service might be unreliable with some query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email.
 
 In case you are a lucky Crossref Metadata Plus subscriber, you can set your authorization token in the config file under `grobid-home/config/grobid.yaml` as follow:
 
@@ -42,6 +42,63 @@ consolidation:
 
 According to Crossref, the token will ensure that said requests get directed to a pool of machines that are reserved for "Plus" SLA users (note: of course the above token is fake).
 
+### Concurrency
+
+GROBID automatically adjusts concurrency based on the CrossRef API tier detected from the configuration:
+
+| Tier | Condition | Initial Concurrent Requests |
+|------|-----------|----------------------------|
+| Public | No `mailto`, no `token` | 1 |
+| Polite | `mailto` set | 3 |
+| Plus | `token` set | 50 |
+
+These initial values are further tuned at runtime using the `x-concurrency-limit` header returned by CrossRef API responses.
+
+When a Plus tier token is configured, GROBID validates it at startup by making a lightweight request (`/works?rows=0`) to CrossRef. If the token is not recognized as Plus tier (e.g. expired or invalid), GROBID automatically falls back to Polite concurrency (3) if `mailto` is set, or Public (1) otherwise, and logs a warning. If CrossRef is unreachable at startup, the Plus tier default is kept since the token cannot be proven invalid.
+
+### Performance with CrossRef Consolidation
+
+When citation consolidation is enabled, the CrossRef API becomes the dominant factor in processing time. Below are benchmarks from processing 10,000 PDF documents with `processFulltextDocument` and `consolidateCitations=1`:
+
+| Metric | Polite Tier | Plus Tier |
+|--------|------------|-----------|
+| Total runtime | ~162,277 sec (~45 hours) | ~42,755 sec (~12 hours) |
+| Speed | 0.06 docs/sec | 0.23 docs/sec |
+| Throughput per document | 17.02 sec/doc | 4.31 sec/doc |
+| Failed documents | 467/10,000 (4.7%) | 85/10,000 (0.85%) |
+
+The Plus tier is approximately **3.8x faster** and produces **~5.5x fewer errors** compared to the Polite tier. For any batch processing beyond a few hundred documents with citation consolidation, the Plus tier is strongly recommended.
+
+!!! warning "Increase client timeout when using consolidation"
+    With consolidation enabled, GROBID takes significantly longer to process each document. The default client timeout (e.g. 60 seconds in the Python client) is far too low — individual documents with many references can take well over a minute. **Increase the client timeout to 200–600 seconds** to avoid unnecessary timeout errors. For example, in the Python client's `config.json`, set `"timeout": 300`.
+
+### Rate Limiting and Backoff
+
+When CrossRef returns HTTP 429 (rate limit exceeded), GROBID applies exponential backoff with jitter ("full jitter" strategy):
+
+- Base delay: 1 second, exponentially increased (`base * 2^attempt`), capped at 60 seconds
+- Each retry sleeps for a random duration in `[0, cap]`, spreading retries across time and avoiding synchronized retry bursts (thundering herd)
+- During backoff, concurrency is reduced to 1 (serialized requests)
+- On the next successful response, backoff resets and concurrency is restored
+
+GROBID also reads the `x-api-pool` header from responses to identify which CrossRef pool is being used.
+
+### Post-Validation
+
+By default, GROBID post-validates CrossRef results against the source metadata to filter false positives (the CrossRef API is a search API and may return inexact matches). This validation compares the first author surname using fuzzy matching.
+
+For biblio-glutton, post-validation is always skipped because glutton handles validation internally.
+
+For CrossRef, post-validation can be disabled via configuration:
+
+```yaml
+consolidation:
+    crossref:
+      postValidation: false
+```
+
+When post-validation is disabled, all CrossRef results are accepted. This can be useful for testing or when the results are post-processed by another system.
+
 ### Timeouts
 
 Both consolidation services support configurable timeouts to control how long GROBID waits for external API responses:
@@ -55,14 +112,30 @@ consolidation:
 ```
 
 !!! warning "Be careful when setting low timeout values"
-    Setting too low timeout values may cause request failures. For CrossRef, be particularly careful as aggressive querying with short timeouts and high volume may result in being banned from the service.   
+    Setting too low timeout values may cause request failures. For CrossRef, be particularly careful as aggressive querying with short timeouts and high volume may result in being banned from the service.
+
+## TEI Output
+
+When consolidation is performed, the resulting TEI output includes attributes on `<biblStruct>` elements to indicate the consolidation status and which service was used:
 
+- `status="consolidated"` — the bibliographic item was matched and enriched by the consolidation service
+- `status="extracted"` — the bibliographic item was extracted by GROBID only (no consolidation match)
+- `source="crossref"` or `source="glutton"` — which consolidation service provided the match
+
+These attributes appear on both header `<biblStruct>` (in `<sourceDesc>`) and citation `<biblStruct>` elements.
+
+Example:
+```xml
+<biblStruct status="consolidated" source="crossref">
+    ...
+</biblStruct>
+```
 
 ## biblio-glutton
 
-This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy). 
+This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy).
 
-Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available. 
+Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available.
 
 After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.yaml` file, with its url, for instance:
 

diff --git a/doc/Frequently-asked-questions.md b/doc/Frequently-asked-questions.md
@@ -66,7 +66,30 @@ Occasionally, [people have reported](https://github.com/kermitt2/grobid/issues/1
 To resolve this issue, there are several options:
  - Check that you are running the proper image for your hardware. If you are not sure, use the image `grobid/grobid:0.8.2-crf` which is the most lightweight and fastest image.
  - Make sure you don't send too many requests at the same time, as this can overload the server. If you are using the Grobid Python client, you can set the `n` parameter to a lower value (e.g. 1 or 2) to limit the number of concurrent requests.
- - Increase the timeout value in your client. If you are using the Grobid Python client, you can set the `timeout` parameter to a higher value (e.g. 90 seconds) in the `config.json` to give the server more time to respond.
+ - Increase the timeout value in your client. If you are using the Grobid Python client, you can set the `timeout` parameter to a higher value (e.g. 90 seconds) in the `config.json` to give the server more time to respond. **If consolidation is enabled** (`consolidateHeader` or `consolidateCitations`), you should increase the timeout much further — to **200–600 seconds** — because each document now waits for multiple external API lookups on top of PDF processing.
+
+## The service is slow to process PDFs
+
+If GROBID seems unusually slow when processing documents, the most common cause is **consolidation**. When consolidation is enabled (e.g. `consolidateHeader=1` or `consolidateCitations=1`), GROBID queries an external bibliographical service (CrossRef or biblio-glutton) for each bibliographic item, which adds significant latency — especially for documents with many references.
+
+A few things to check:
+
+1. **Are you using consolidation?** If you don't need DOI resolution or metadata enrichment, you can disable consolidation by setting `consolidateHeader=0` and `consolidateCitations=0` in your API calls. This will make processing much faster.
+
+2. **Have you configured a polite email?** If you are using CrossRef consolidation, you should set a contact email in `grobid-home/config/grobid.yaml`:
+
+    ```yaml
+    consolidation:
+        crossref:
+          mailto: your-email@example.org
+    ```
+
+    Without a `mailto`, requests go through the CrossRef **public** pool with minimal concurrency (1 request at a time). Adding an email enables access to the **polite** pool with higher concurrency (up to 3), resulting in noticeably faster consolidation. See the [Consolidation documentation](Consolidation.md) for more details on tiers.
+
+3. **Consider a CrossRef Plus subscription for medium-scale processing.** If you need consolidation but don't want to self-host biblio-glutton, a [CrossRef Metadata Plus](https://www.crossref.org/services/metadata-retrieval/metadata-plus/) subscription can help significantly. Benchmarks on 10,000 documents show the Plus tier is ~3.8x faster and has ~5.5x fewer errors compared to the Polite tier. See the [Consolidation documentation](Consolidation.md#performance-with-crossref-consolidation) for detailed numbers.
+
+4. **Consider biblio-glutton for large-scale processing.** The CrossRef API has rate limits that make it impractical for processing large batches with citation consolidation enabled. If you need to process many PDFs with full citation consolidation, consider using [biblio-glutton](https://github.com/kermitt2/biblio-glutton), which can scale horizontally and is not subject to external rate limits.
+
 
 ## How to override the grobid configuration file when running via docker?
 

diff --git a/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java b/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java
@@ -391,6 +391,9 @@ public String toString() {
     // Source (whether the data was consolidated)
     private String status = CONSOLIDATION_STATUS_EXTRACTED;
 
+    // Which consolidation service was used (e.g. "crossref" or "glutton")
+    private String consolidationService = null;
+
     // All the tokens that are considered noise will be collected here
     private List<String> discardedPieces = new ArrayList<>();
     private List<List<LayoutToken>> discardedPiecesTokens = new ArrayList<>();
@@ -2316,7 +2319,10 @@ public String toTEI(int n, int indent, GrobidAnalysisConfig config) {
             if (withCoords)
                 tei.append(TEIFormatter.getCoordsAttribute(coordinates, withCoords)).append(" ");
 
-            tei.append("status=\"" + getStatus() + "\" ").append(" ");
+            tei.append("status=\"" + getStatus() + "\"");
+            if (getConsolidationService() != null)
+                tei.append(" source=\"" + getConsolidationService() + "\"");
+            tei.append(" ");
 
             if (!StringUtils.isEmpty(language)) {
                 if (n == -1) {
@@ -4461,6 +4467,7 @@ else if (bibo.getFullAuthors().size() == 1) {
             }
         }
         bib.setStatus(bibo.getStatus());
+        bib.setConsolidationService(bibo.getConsolidationService());
     }
 
 	/**
@@ -4638,6 +4645,14 @@ public void setStatus(String status) {
         this.status = status;
     }
 
+    public String getConsolidationService() {
+        return consolidationService;
+    }
+
+    public void setConsolidationService(String consolidationService) {
+        this.consolidationService = consolidationService;
+    }
+
     public String getConflictStmt() {
         return conflictStmt;
     }

diff --git a/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java b/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java
@@ -422,7 +422,12 @@ else if (biblio.getE_Year().length() == 4)
             tei.append("\t\t\t\t<availability status=\"unknown\"><licence/></availability>\n");
             tei.append("\t\t\t</publicationStmt>\n");
         }
-        tei.append("\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct>\n\t\t\t\t\t<analytic>\n");
+        tei.append("\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct");
+        if (biblio.getStatus() != null)
+            tei.append(" status=\"" + biblio.getStatus() + "\"");
+        if (biblio.getConsolidationService() != null)
+            tei.append(" source=\"" + biblio.getConsolidationService() + "\"");
+        tei.append(">\n\t\t\t\t\t<analytic>\n");
 
         // authors + affiliation
         //biblio.createAuthorSet();