Skip to content

jurgigi/BioMONO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD

This repository contains the code used in the paper:

“Beyond Single Words: MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD”

It supports two main workflows:

  1. Corpus building: filter bioinformatics articles from a JATS XML collection and segment each article into main IMRaD sections (Abstract, Introduction, Materials and Methods, Results, Discussion, Conclusions when available).
  2. MWE extraction: extract Multiword Expressions (MWEs) from the resulting corpus using:
    • UD-based extraction (via dependency parsing)
    • USAS-based extraction (via PyMUSAS / UCREL semantic tagging)
    • (Optionally) additional list-based / terminology resources (MeSH, AFL/ARTES) stored in mwes-lists/

The segmented corpus itself is distributed separately (see Corpus access below).


Corpus access

The corpus (already segmented into IMRaD sections) is available for download here: https://doi.org/10.6084/m9.figshare.31215955


Repository structure

.
├── corpus-building/     # Python scripts to collect, filter, and structure the corpus
├── mwe-extraction/      # Python scripts to extract MWEs (UD and USAS)
├── mwes-lists/          # lists of MWEs from AFL, ARTES and the MeSH controlled vocabulary thesaurus
├── LICENSE              # Code license
├── README.md
└── requirements.txt

Installation

git clone git@github.com:jurgigi/BioMONO.git

Install dependencies

pip install -r requirements.txt

Download model

spaCy model:

python -m spacy download en_core_web_sm

Methods overview

Corpus presentation

BioMONO_en is derived from the PLOS allofplos collection (JATS XML). Articles belonging to the bioinformatics subject are filtered and then segmented into IMRaD sections using JATS section titles / tags, producing one plain-text file per section per article.

PLOS allofplos: https://github.com/PLOS/allofplos


MWE extraction approaches

MWEs are extracted using complementary automated methods:

  • UD-based MWEs: extracted from dependency parses using relations commonly associated with multiword constructions:
    • compound (incl. nominal compounds)
    • compound:prt (phrasal verbs)
    • fixed (grammaticalized fixed expressions)
    • flat (headless flat constructions)
    • flat:foreign (foreign sequences)

Parsing is performed with Stanza: https://github.com/stanfordnlp/stanza

  • USAS-based MWEs: extracted via PyMUSAS, which exposes UCREL’s USAS semantic resources and includes MWE tagging support: https://github.com/UCREL/pymusas

  • MeSH / AFL / ARTES lists: stored in mwes-lists/ for optional list-based matching in downstream analyses.

MeSH: https://www.nlm.nih.gov/mesh/meshhome.html

AFL: https://www.eapfoundation.com/vocab/academic/afl/

ARTES: https://artes.app.univ-paris-diderot.fr/


How to use

Corpus building (XML → IMRaD TXT)

Goal: from a folder of JATS XML files, keep only those whose subject contains “bioinformatics”, then extract IMRaD sections into section-specific output folders.

python corpus-building/corpus_build.py \
  /path/to/xml_folder \
  /path/to/output_imrad_txt \
  --subject bioinformatics

UD-based MWEs (TXT → CoNLL-U → JSON)

1) Parse section texts with Stanza (TXT → CoNLL-U)

python mwe-extraction/parse_txt_folder_to_conllu.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/conllu/Introduction \
  --download_if_missing \
  --use_gpu

Optional: use a domain package (if available in your Stanza setup):

python mwe-extraction/parse_txt_folder_to_conllu.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/conllu/Introduction \
  --biomed genia \
  --download_if_missing \
  --use_gpu

2) Extract UD MWEs (CoNLL-U → JSON)

Per-file JSON outputs (default):

python mwe-extraction/extract_mwes_from_conllu_folder.py \
  --input_dir /path/to/conllu/Introduction \
  --output_dir /path/to/ud_mwes_json/Introduction

Single aggregated JSON for the folder:

python mwe-extraction/extract_mwes_from_conllu_folder.py \
  --input_dir /path/to/conllu/Introduction \
  --output_dir /path/to/ud_mwes_json/Introduction \
  --aggregate

USAS-based MWEs (TXT → JSON)

This extracts only MWEs detected by PyMUSAS from each input .txt.

Per-file JSON (default):

python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/usas_mwes_json/Introduction \
  --use_gpu

Single aggregated JSON for the folder:

python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
  --input_dir /path/to/output_imrad_txt/Introduction \
  --output_dir /path/to/usas_mwes_json/Introduction \
  --aggregate \
  --agg_name all_usas_mwes_introduction.json \
  --use_gpu

Dispersion analysis

Dispersion can be computed once MWEs are extracted. The paper reports:

  • Document Frequency (DF) and DF%
  • Gries’ DP, quantifying deviation from an equal-share baseline:
$$DP = \frac{1}{2}\sum_{i=1}^{N}\left|p_i - s_i\right|$$

Where pᵢ is the observed proportion of an MWE’s occurrences in document i, and sᵢ is the expected proportion under the baseline (operationalized as the document’s share of tokens in the section).


How to cite this work

@inproceedings{giraud-gargett-2026-beyond,
    title = "Beyond Single Words: {MWE} Identification in Bioinformatics Research Articles and Dispersion Profiling Across {IMR}a{D}",
    author = "Giraud, Jurgi  and
      Gargett, Andrew",
    editor = {Ojha, Atul Kr.  and
      Mititelu, Verginica Barbu  and
      Constant, Mathieu  and
      Stoyanova, Ivelina  and
      Do{\u{g}}ru{\"o}z, A. Seza  and
      Rademaker, Alexandre},
    booktitle = "Proceedings of the 22nd Workshop on Multiword Expressions ({MWE} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Marocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.mwe-1.10/",
    doi = "10.18653/v1/2026.mwe-1.10",
    pages = "86--95",
    ISBN = "979-8-89176-363-0",
    abstract = "Multiword Expressions (MWEs) are pervasive in scientific writing, and in specialized domains they include both multiword terminology (e.g., noun compounds) and recurrent academic phrasing. This study profiles MWEs in a large corpus of bioinformatics research articles segmented by IMRaD sections. Building on recent multi-method approaches to scientific MWE identification, we extract MWEs using complementary automated strategies (semantic matching, dependency parsing, controlled vocabularies, and academic formula lists) and compare the resulting inventories by size, form, and IMRaD section distribution. We further quantify cross-document dispersion using document frequency and Gries' DP to distinguish widely reused expressions from items concentrated in a small subset of articles. Results show that bioinformatics MWEs are predominantly short and nominal, but that extraction methods differ in the extent to which they recover discourse and reporting phraseology. Dispersion is strongly long-tailed across sections with most MWEs being document-specific, while a smaller recurrent core aligns with section function and is enriched for conventional templates and standardized multiword terms. Overall, the findings argue for combining complementary identification methods with dispersion profiling to characterize domain ``multiwordness'' in a principled and section-sensitive way."
}

About

MWE Identification in Bioinformatics Research Articles

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages