MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD

This repository contains the code used in the paper:

“Beyond Single Words: MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD”

It supports two main workflows:

Corpus building: filter bioinformatics articles from a JATS XML collection and segment each article into main IMRaD sections (Abstract, Introduction, Materials and Methods, Results, Discussion, Conclusions when available).
MWE extraction: extract Multiword Expressions (MWEs) from the resulting corpus using:
- UD-based extraction (via dependency parsing)
- USAS-based extraction (via PyMUSAS / UCREL semantic tagging)
- (Optionally) additional list-based / terminology resources (MeSH, AFL/ARTES) stored in mwes-lists/

The segmented corpus itself is distributed separately (see Corpus access below).

Corpus access

The corpus (already segmented into IMRaD sections) is available for download here: https://doi.org/10.6084/m9.figshare.31215955

Repository structure

.
├── corpus-building/     # Python scripts to collect, filter, and structure the corpus
├── mwe-extraction/      # Python scripts to extract MWEs (UD and USAS)
├── mwes-lists/          # lists of MWEs from AFL, ARTES and the MeSH controlled vocabulary thesaurus
├── LICENSE              # Code license
├── README.md
└── requirements.txt

Installation

git clone git@github.com:jurgigi/BioMONO.git

Install dependencies

pip install -r requirements.txt

Download model

spaCy model:

python -m spacy download en_core_web_sm

Methods overview

Corpus presentation

BioMONO_en is derived from the PLOS allofplos collection (JATS XML). Articles belonging to the bioinformatics subject are filtered and then segmented into IMRaD sections using JATS section titles / tags, producing one plain-text file per section per article.

PLOS allofplos: https://github.com/PLOS/allofplos

MWE extraction approaches

MWEs are extracted using complementary automated methods:

UD-based MWEs: extracted from dependency parses using relations commonly associated with multiword constructions:
- compound (incl. nominal compounds)
- compound:prt (phrasal verbs)
- fixed (grammaticalized fixed expressions)
- flat (headless flat constructions)
- flat:foreign (foreign sequences)

Parsing is performed with Stanza: https://github.com/stanfordnlp/stanza

USAS-based MWEs: extracted via PyMUSAS, which exposes UCREL’s USAS semantic resources and includes MWE tagging support: https://github.com/UCREL/pymusas
MeSH / AFL / ARTES lists: stored in mwes-lists/ for optional list-based matching in downstream analyses.

MeSH: https://www.nlm.nih.gov/mesh/meshhome.html

AFL: https://www.eapfoundation.com/vocab/academic/afl/

ARTES: https://artes.app.univ-paris-diderot.fr/

How to use

Corpus building (XML → IMRaD TXT)

Goal: from a folder of JATS XML files, keep only those whose subject contains “bioinformatics”, then extract IMRaD sections into section-specific output folders.

python corpus-building/corpus_build.py \
  /path/to/xml_folder \
  /path/to/output_imrad_txt \
  --subject bioinformatics

UD-based MWEs (TXT → CoNLL-U → JSON)

1) Parse section texts with Stanza (TXT → CoNLL-U)

python mwe-extraction/parse_txt_folder_to_conllu.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/conllu/Introduction \
  --download_if_missing \
  --use_gpu

Optional: use a domain package (if available in your Stanza setup):

python mwe-extraction/parse_txt_folder_to_conllu.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/conllu/Introduction \
  --biomed genia \
  --download_if_missing \
  --use_gpu

2) Extract UD MWEs (CoNLL-U → JSON)

Per-file JSON outputs (default):

python mwe-extraction/extract_mwes_from_conllu_folder.py \
  --input_dir /path/to/conllu/Introduction \
  --output_dir /path/to/ud_mwes_json/Introduction

Single aggregated JSON for the folder:

python mwe-extraction/extract_mwes_from_conllu_folder.py \
  --input_dir /path/to/conllu/Introduction \
  --output_dir /path/to/ud_mwes_json/Introduction \
  --aggregate

USAS-based MWEs (TXT → JSON)

This extracts only MWEs detected by PyMUSAS from each input .txt.

Per-file JSON (default):

python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/usas_mwes_json/Introduction \
  --use_gpu

Single aggregated JSON for the folder:

python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/usas_mwes_json/Introduction \
  --aggregate \
  --agg_name all_usas_mwes_introduction.json \
  --use_gpu

Dispersion analysis

Dispersion can be computed once MWEs are extracted. The paper reports:

Document Frequency (DF) and DF%
Gries’ DP, quantifying deviation from an equal-share baseline:

$$DP = \frac{1}{2}\sum_{i=1}^{N}\left|p_i - s_i\right|$$

Where pᵢ is the observed proportion of an MWE’s occurrences in document i, and sᵢ is the expected proportion under the baseline (operationalized as the document’s share of tokens in the section).

How to cite this work

@inproceedings{giraud-gargett-2026-beyond,
    title = "Beyond Single Words: {MWE} Identification in Bioinformatics Research Articles and Dispersion Profiling Across {IMR}a{D}",
    author = "Giraud, Jurgi  and
      Gargett, Andrew",
    editor = {Ojha, Atul Kr.  and
      Mititelu, Verginica Barbu  and
      Constant, Mathieu  and
      Stoyanova, Ivelina  and
      Do{\u{g}}ru{\"o}z, A. Seza  and
      Rademaker, Alexandre},
    booktitle = "Proceedings of the 22nd Workshop on Multiword Expressions ({MWE} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Marocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.mwe-1.10/",
    doi = "10.18653/v1/2026.mwe-1.10",
    pages = "86--95",
    ISBN = "979-8-89176-363-0",
    abstract = "Multiword Expressions (MWEs) are pervasive in scientific writing, and in specialized domains they include both multiword terminology (e.g., noun compounds) and recurrent academic phrasing. This study profiles MWEs in a large corpus of bioinformatics research articles segmented by IMRaD sections. Building on recent multi-method approaches to scientific MWE identification, we extract MWEs using complementary automated strategies (semantic matching, dependency parsing, controlled vocabularies, and academic formula lists) and compare the resulting inventories by size, form, and IMRaD section distribution. We further quantify cross-document dispersion using document frequency and Gries' DP to distinguish widely reused expressions from items concentrated in a small subset of articles. Results show that bioinformatics MWEs are predominantly short and nominal, but that extraction methods differ in the extent to which they recover discourse and reporting phraseology. Dispersion is strongly long-tailed across sections with most MWEs being document-specific, while a smaller recurrent core aligns with section function and is enriched for conventional templates and standardized multiword terms. Overall, the findings argue for combining complementary identification methods with dispersion profiling to characterize domain ``multiwordness'' in a principled and section-sensitive way."
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD

Corpus access

Repository structure

Installation

Install dependencies

Download model

Methods overview

Corpus presentation

MWE extraction approaches

How to use

Corpus building (XML → IMRaD TXT)

UD-based MWEs (TXT → CoNLL-U → JSON)

1) Parse section texts with Stanza (TXT → CoNLL-U)

2) Extract UD MWEs (CoNLL-U → JSON)

USAS-based MWEs (TXT → JSON)

Dispersion analysis

How to cite this work

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
corpus-building		corpus-building
mwe-extraction		mwe-extraction
mwes-lists		mwes-lists
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD

Corpus access

Repository structure

Installation

Install dependencies

Download model

Methods overview

Corpus presentation

MWE extraction approaches

How to use

Corpus building (XML → IMRaD TXT)

UD-based MWEs (TXT → CoNLL-U → JSON)

1) Parse section texts with Stanza (TXT → CoNLL-U)

2) Extract UD MWEs (CoNLL-U → JSON)

USAS-based MWEs (TXT → JSON)

Dispersion analysis

How to cite this work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages