In this project we evaluate evaluation methods for automatic summarization on long financial documents from the Financial Narrative Summarization (FNS) task 2023.
This project uses 3 datasets of annual reports and their (gold) summaries.
- English
- Greek
- Spanish
Due to access limitations a script has been added to download the Greek dataset that is publicly available.
To get the Greek dataset run the data collection script.
You can get the statistics for all the datasets (original datasets and generated candidate summaries) and their texts by running the stats scripts.
Text statistics extracted:
- spaCy token count
- spaCy sentence count
- BERT token count
To create noisy summaries from existing ones run the summary corruption script.
To evaluate your summaries run the summary evaluation script.
thesis/
βββ data/ # Source documents, gold summaries, candidate summaries (generated by pipelines/generate.py)
βββ evaluation_methods/ # Evaluation methods
βββ notebooks/ # Analyses of the results
βββ results/ # Evaluation results from pipelines/evaluate.py
βββ samples/ # Example untracked files
βββ src/
| βββ modules/
β βββ data_collector.py # Data collection
β βββ summary_corruptor.py # Noise insertion
β βββ summary_generator.py # Candidate summaries generator
β βββ summary_evaluator.py # Evaluator
β βββ tokenizer.py # Tokenization handling
| βββ pipelines/
β βββ collect.py # Data collection pipeline
β βββ generate.py # Candidate summaries generation pipeline
β βββ evaluate.py # Evaluation pipeline
| βββ utils/ # Helper functions for the main modules
β βββ summary_corruptor_utils.py
β βββ summary_evaluator_utils.py
β βββ visualization.py
βββ main.py
βββ README.mdThis project uses uv for its dependencies. Follow the official installation instructions depending on your OS.
A data directory at the root is needed.
If you don't have a dataset available, you can run the data collection script to get a dataset of Greek annual reports and their gold summaries
- Create a
.envfile at the root based on thesamples/sample.envfile. - Create a
conf/config.yamlfile in thesrc/directory.
To generate and/or evaluate candidate summaries, configure the language variables in the .env file:
LANGUAGE:English,Greek,SpanishSUMMARY_VER:- English:
_1 - Greek:
_2 - Spanish:
_GS1
- English:
The metrics used in this project are:
| Metric type | Metric name | How to set up |
|---|---|---|
| N-gram-based | Rouge-1, Rouge-2 | Comes with the project |
| N-gram-graph-based | AutoSummENG, MeMoG, NPowER | Contact G.Giannakopoulos |
| Embeddings-based | BERTScore | Comes with the project |
| Embeddings-based | BARTScore | Clone repository |
| Model-based | Bleurt | Clone repository |
| Model-based | FactCC | Model is used via HuggingFace |
| Model-based | LongDocFACTScore | Comes with the project |
Create an evaluation_methods directory at the root.
cd thesis
mkdir evaluation_methods
cd evaluation_methodsClone the repositories inside the evaluation_methods directory following the instructions of the repositories.
git clone <the-eval-metric-repo>git clone https://github.com/stefsyrsiri/thesis.git
cd thesis
uv sync --lockedTo use the scripts, ensure that all the requirements are met.
# Run the pipeline with selected steps:
uv run main.py [--collect] [--generate] [--evaluate] [--all]To collect the Greek annual reports dataset run the main script with the collect flag:
uv run main.py --collectTo merge all the datasets (all languages, all document types) and store them in a single place run the following:
uv run main.py --merge-datasetsTo get the text statistics for the merged dataset run the following:
uv run main.py --statsTo create noisy summaries run the main script with the generate flag:
uv run main.py --generateYou can optionally use the truncate flag to truncate the summaries at 512 tokens before applying the noise. This is useful, if you need to evaluate summaries with metrics that have token input limitations.
Note: Even with truncation, it is not possible to predetermine the exact length of the final text and it is certain that it will exceed the 512 token limit and be automatically truncated. However, truncating prior to noise insertion limits the automatic truncation of the metric at inference time making the results more reliable.
To evaluate summaries run the main script with the evaluate flag. When running the evaluation script, you need to specify whether you're going to use CPU-bound or GPU-bound metrics by using either the cpu or gpu flag. If you want to use a reference-free metric, such as LongDocFACTScore, you need to add the no-refs flag.
Note:
LongDocFACTScore is GPU-bound as it depends on BARTScore, and therefore you need both gpu and no-refs flags.
uv run main.py --evaluateall: Run all the scripts (collect, generate, evaluate)subset: Run your selected script for [subset] documents, where the subset is a positive integer
# Collect (Greek) data
uv run main.py --collect
# Generate noisy summaries
uv run main.py --generate --truncate
# Evaluate summaries
uv run main.py --evaluate --gpu --subset 10
# Run all steps: collect, generate, evaluate
uv run main.py --all