Link wordnet to MathGLoss to make a mathematical wordnet
Investigate the shared overlap between OEWN and MathGloss.
uv venv
uv pip install wn pyyaml
.venv/bin/python scripts/task1_overlap.pyWordnet data (oewn:2024 and omw-en:2.0) is downloaded once into build/wordnet_data/
and reused on subsequent runs.
Domains covered: mathematics, algebra, arithmetic, geometry, statistics, matrix algebra, logic. Each domain is identified by one or more ILI identifiers that resolve to a synset in OEWN.
1a — domain_topic links (OEWN built-in)
Synsets are counted as belonging to a domain if they carry a domain_topic relation
pointing to the domain's synset (e.g. the (mathematics) synset 06009822-n,
ILI i68341). This is the explicit, hand-curated annotation in OEWN.
1b — WordNet Domains (xwnd)
The external/xwnd-30g/ directory contains per-domain .ppv files (Personalised
PageRank Vectors over WN 3.0 synsets). Because OEWN uses WN 3.1 offsets, synsets
are mapped via the Interlingual Index (ILI): each OEWN synset's ILI is looked up in
omw-en:2.0 (which uses WN 3.0 offsets) to obtain the WN 3.0 ID, which is then
checked against the .ppv file. A score threshold of XWND_THRESHOLD = 0.0001
is applied; the .ppv files cover the full WN 3.0 vocabulary so a threshold is
needed to select the most domain-relevant synsets. Domains without a .ppv file
(algebra, arithmetic, matrix algebra, logic) are only measured via method 1a.
2 — MathGloss overlap
external/MathGloss/data/database.csv provides ~4,800 mathematical concepts with
Wikidata IDs and names drawn from several sources (BCT, Chicago, Clowder, Context,
Mathlib, nLab, PlanetMath). Overlap with OEWN is measured two ways:
- by lemma: any OEWN lemma in a math-domain synset matches a MathGloss name (case-insensitive).
- by Wikidata ID: the Wikidata QID stored in the OEWN YAML matches the QID in
MathGloss. Note that Wikidata IDs are read directly from
external/english-wordnet/src/yaml/because thewnPython library does not expose them.
| Measure | Count |
|---|---|
| OEWN total synsets | 120,630 |
| In math domain — domain_topic (union of all domains) | 331 |
| In math domain — xwnd (union of domains with .ppv) | 2,192 |
| In either method | 2,245 |
| MathGloss entries (unique lemmas) | 7,502 |
| OEWN in-domain synsets matching MathGloss by lemma | 2,569 (34.2%) |
| OEWN in-domain synsets matching MathGloss by Wikidata ID | 40 (0.8%) |
| In-domain synsets NOT in MathGloss | 1,772 (78.9%) |
Wikidata ID coverage across the 2,245 in-domain synsets:
| Count | |
|---|---|
| WN Wikidata ID only | 96 |
| MathGloss Wikidata ID only | 439 |
| Both — same ID | 27 |
| Both — different ID | 7 |
| Neither | 1,676 |
In both OEWN and MathGloss (matched by lemma, with source(s)):
| OEWN lemma(s) | OEWN domain | MathGloss sources |
|---|---|---|
| idempotent | mathematics | chicago, context, nlab, planetmath |
| commute / transpose | mathematics | bct, chicago, context, nlab, planetmath |
| commutative | mathematics | chicago, nlab, planetmath |
| inverse | mathematics | bct, nlab |
| transposition | algebra | chicago, planetmath |
In OEWN math domains but NOT in MathGloss (WordNet has it, MathGloss does not):
| OEWN lemma(s) | OEWN domain |
|---|---|
| rounding / rounding error | mathematics |
| truncation error | mathematics |
| sampling | statistics |
| bimodal | statistics |
| combinatorial | mathematics |
In MathGloss but NOT in OEWN math domains (MathGloss has it, WordNet does not):
| MathGloss label | Wikidata |
|---|---|
| algebra over a field | Q1000660 |
| discrete Fourier transform | Q1006032 |
| algebraically closed field | Q1047547 |
| topological group | Q1046291 |
| finite group | Q1057968 |
Results are written to build/ on each run.
build/task1_results.txt — human-readable summary of all counts.
build/task1_synsets.tsv — one row per OEWN synset that appears in at least
one math domain (via either method). Columns:
| Column | Description |
|---|---|
lemma |
OEWN lemmas for the synset, |-separated |
domain |
Domain names from domain_topic links, |-separated (empty if none) |
xwndomain |
Domain names from xwnd .ppv files, |-separated (empty if none) |
ili |
Interlingual Index identifier for the synset |
wn-wd |
Wikidata QID(s) from the OEWN YAML, |-separated (empty if none) |
mg-wd |
Wikidata QID from MathGloss matched to this synset (empty if none) |
IN-mathgloss |
lemma, wikidata, both, or empty — how the synset matches MathGloss |
sources |
MathGloss source(s) where the match was found, |-separated |
Measure the semantic similarity of OEWN definitions against MathGloss concept descriptions using sentence embeddings, to validate Task 1 matches and discover new alignments.
uv pip install ollama numpy
# fetch Wikidata descriptions for all MathGloss QIDs (once, ~2 min)
.venv/bin/python scripts/fetch_wikidata_descriptions.py
# run embedding analysis
.venv/bin/python scripts/task2_embeddings.py --model embeddinggemma
.venv/bin/python scripts/task2_embeddings.py --model qwen3-embedding --reuse-embeddingsEmbedding models are served via Ollama. Pull the models first:
ollama pull embeddinggemma
ollama pull qwen3-embeddingEmbeddings are cached as .npy files in build/; use --reuse-embeddings to skip
re-encoding on subsequent runs with the same model.
Definition sources:
- OEWN:
ss.definition()from thewnlibrary — a short English gloss per synset. - MathGloss: Wikidata descriptions (e.g. "polygon with six sides") fetched via the
Wikidata API and cached in
build/wikidata_descriptions.json. Falls back to the Wikidata label when the description is missing (~9% of QIDs).
Two analyses:
Pairwise scoring — for each of the 473 OEWN synsets matched to a MathGloss QID in Task 1, compute the cosine similarity of their respective embeddings. Low scores reveal false positive lemma matches (e.g. the genetics synset for gene/cistron/factor matching MathGloss's arithmetic factor).
Cross-corpus retrieval — embed all 2,245 in-domain OEWN synsets and all 4,814 MathGloss concepts, then find the top-k nearest MathGloss concepts for every OEWN synset. Recall@k measures how often the known Task 1 match appears in the top k retrieved results.
Use --sim-threshold T to flag pairs with cosine similarity below T as likely false matches
in the output TSV. Use --wikidata-threshold to set the threshold automatically as the
minimum similarity among Wikidata-confirmed pairs (those where both resources agree on the
QID) — this guarantees all confirmed pairs pass while filtering out the worst false positives.
| Metric | embeddinggemma | qwen3-embedding |
|---|---|---|
| Mean cosine similarity | 0.521 | 0.637 |
| Median | 0.504 | 0.624 |
| Min | 0.177 | 0.310 |
| Pairs ≥ 0.8 | 35 (7%) | 82 (17%) |
| Mean sim — "both" matches | 0.649 | 0.765 |
| Mean sim — lemma-only matches | 0.513 | 0.629 |
| Recall@1 | 19.2% | 27.3% |
| Recall@3 | 29.4% | 36.2% |
| Recall@5 | 32.3% | 40.4% |
Pairs matched by both lemma and Wikidata ID score markedly higher than lemma-only matches, confirming the method self-validates. Recall@5 of 40% means embeddings alone recover 2 in 5 known matches without any lexical signal.
Wikidata-threshold filtering (--wikidata-threshold, qwen3-embedding):
The minimum similarity among the 27 Wikidata-confirmed pairs is 0.534 (the infinitesimal pair, where WN defines it as "a variable with zero as its limit" and MathGloss as "a nonzero positive number smaller than any positive real number"). Using this as a cut-off:
| Count | |
|---|---|
| Wikidata-confirmed passing | 27/27 (100%) |
| Lemma-only passing | 300/446 (67%) |
| Total accepted links | 327/473 (69%) |
| Flagged as likely false matches | 146 (31%) |
The 146 rejected pairs include cases like gene/factor (WN: genetics; MG: arithmetic, sim=0.32), crystal (WN: mineral; MG: fibered categories, sim=0.33), and triangulation (WN: surveying; MG: chess tactic, sim=0.33).
Highest-similarity pairs (qwen3-embedding):
| OEWN lemma(s) | OEWN definition | MathGloss label | Sim |
|---|---|---|---|
| square matrix | a matrix with the same number of rows and columns | square matrix | 0.977 |
| equation | a mathematical statement that two expressions are equal | equation | 0.962 |
| hexagon | a six-sided polygon | hexagon | 0.957 |
| irrational number | a real number that cannot be expressed as a rational number | irrational number | 0.930 |
| electron | an elementary particle with negative charge | electron | 0.953 |
Lowest-similarity pairs — likely false positives (qwen3-embedding):
| OEWN lemma(s) | OEWN definition | MathGloss label | Sim |
|---|---|---|---|
| gene/cistron/factor | (genetics) a segment of DNA… | factor (multiplication operand) | 0.321 |
| crystal | a rock formed by solidification… | crystal (Cartesian sections of fibered categories) | 0.326 |
| triangulation | a method of surveying… | triangulation (chess tactic) | 0.327 |
build/wikidata_descriptions.json — cached Wikidata label and description per QID.
build/task2_pair_scores_{model}.tsv — one row per matched (OEWN synset, MathGloss QID) pair.
| Column | Description |
|---|---|
ili |
OEWN ILI |
oewn_lemma |
OEWN synset lemmas, |-separated |
oewn_def |
OEWN definition text |
mg_qid |
MathGloss Wikidata QID |
mg_label |
Wikidata label for the QID |
mg_def |
Wikidata description used for embedding |
cosine_sim |
Cosine similarity (0–1) |
in_mathgloss |
Match type from Task 1: lemma, wikidata, or both |
likely_false_match |
True if sim < --sim-threshold (column absent if flag not set) |
model |
Ollama model name |
build/task2_topk_retrieval_{model}.tsv — top-k MathGloss candidates for every
in-domain OEWN synset.
| Column | Description |
|---|---|
ili |
OEWN ILI |
oewn_lemma |
OEWN synset lemmas |
oewn_def |
OEWN definition |
rank |
1-indexed retrieval rank |
mg_qid |
Retrieved MathGloss QID |
mg_label |
Wikidata label |
mg_def |
Wikidata description |
cosine_sim |
Similarity score |
is_task1_match |
True if this QID matches the Task 1 mg-wd for this synset |
model |
Ollama model name |
build/task2_results_{model}.txt — human-readable summary.