How research themes evolve across long-running software-engineering conferences. Built on DBLP metadata enriched with OpenAlex abstracts and clustered with BERTopic.
Live results: https://deen1113.github.io/TopicDrift/
Python ≥ 3.11 (set in pyproject.toml). For an exact, pinned environment:
python -m pip install -r requirements.txt # pinned deps (recommended to reproduce)
python -m pip install -e . # then install this package
make help # full target listOr make install for the loose pins from pyproject.toml. A uv.lock is also committed if you use uv.
Pick one depending on what you want.
- Workflow A — slice one or more venues to a human-readable preview CSV per venue.
- Workflow B — fit the global topic model and render the per-scope theme figures.
Both workflows read from the same on-disk caches. Build them once:
make dump # download + parse DBLP XML → dblp_conf.parquet (~30 min, ~1 GB)
make scan # dump + OpenAlex abstract scan + pooled corpus (slow, resumable)scan depends on dump, so a fresh checkout can go straight to make scan if you want both. Workflow A only needs dump; Workflow B needs scan.
No data is committed to the repo (data/ is gitignored), and no prebuilt snapshot is published — so reproduction is a full rebuild from source via the commands above. There is no fast path to skip it.
Snapshot: once the pipeline has run, make zenodo packages the processed data + figures into a shareable archive (dist/topicdrift-dataset.zip).
Why scan is slow: it covers all of DBLP, and OpenAlex's free tier enforces a daily spend budget (~$1, resets midnight UTC), so a large scan can get throttled and may need more than one run to finish. It is fully resumable — just re-run make scan until it completes. The scan uses OpenAlex's polite pool via a mailto set in topicdrift/utils/http.py; change it to your own email before running.
Slice the DBLP dump down to one (or more) venues, enrich with OpenAlex, write a human-readable CSV per venue. Requires make dump.
make venue # default: VENUE=icse
make venue VENUE="icse ase issta msr" # any space-separated DBLP keys
make venue-deep VENUE=icse # adds slow title-pass + ACM DL scrape (needs ACM auth — see Data sources)
make venue INCLUDE_COMPANION=1 # keep companion volumes + workshopsA venue-agnostic main-track filter runs by default: it infers each venue's canonical acronym from the DBLP booktitle distribution, then keeps rows whose booktitle starts with that acronym (including split volumes like ICSE (2) and co-located research tracks like ICSE-SEIP, ICSE (NIER)). Companion volumes, workshop summaries, and *@ICSE-style satellites are dropped. The same rule works for renamed venues — e.g. conf/kbse correctly resolves to ASE and keeps both eras.
Outputs: outputs/tables/<venue>_papers_preview.csv per venue, plus silver-layer parquet at data/interim/<venue>_enriched.parquet.
One global topic space fit on a stratified sample across every qualifying DBLP conference. The website renders three scopes — a single venue (ICSE), a curated top 10, and the full set — as venue filters over that same model. Scope membership lives in config/venues.yaml; edit the lists there to change what each scope shows. Requires make scan.
make corpus # stratified fit sample (fast, no API)
make topics # fit BERTopic + label topics with a local LLM (first run downloads ~6 GB from HuggingFace)
make groups # writes config/topic_groups.conf.proposed.yaml for review (the locked .yaml ships — no edit needed to reproduce)
make apply # stamps the locked grouping into the tables
make figures # writes one HTML per scope into outputs/figures/
make site # also copies the HTML into docs/visualizations/
# or just:
make analysis # corpus → topics → groups → apply → sitePer-scope outputs land at outputs/figures/topic_group_streamgraph_{icse,top10,all}.html and topic_treemap_{icse,top10,all}.html.
make status # lists which pipeline artifacts exist on disk
make help # full target list with one-line descriptions
make clean-data # wipe data/ and outputs/ (asks first; loses cached API responses)topicdrift/ingest/ DBLP fetch, OpenAlex enrichment, ACM DL recovery
topicdrift/analysis/ corpus selection, topic fitting, theme mapping
topicdrift/visualization/ per-scope streamgraph + treemap figures
topicdrift/topic_model.py shared BERTopic wrapper (embed → UMAP → HDBSCAN → label)
config/ venues.yaml (scopes), topic_groups.conf.yaml, stopwords.txt
data/raw/ cached API responses
data/interim/, data/processed/ per-venue and pooled parquet tables
outputs/figures/, outputs/tables/ HTML figures and human-readable previews
docs/ static site served from this repo
data/interim/<venue>_enriched.parquet:
| Column | Type | Source |
|---|---|---|
dblp_id, dblp_key |
str | DBLP |
title, year, doi, authors, url, ee |
mixed | DBLP |
has_doi, venue |
bool, str | derived |
abstract, has_abstract, text |
str | OpenAlex (text = normalised title + abstract) |
oa_concepts |
list[str] | OpenAlex concepts (score ≥ 0.3) |
citation_count, openalex_id, oa_type |
mixed | OpenAlex |
| Source | What we use |
|---|---|
| DBLP | titles, authors, years, DOIs |
| OpenAlex | abstracts, concept tags, citation counts |
ACM Digital Library has no open bulk-metadata API. OpenAlex indexes ACM content and is the practical substitute; venue-deep adds an optional ACM DL scrape pass for hard-to-recover abstracts (cookie auth required).