Skip to content

bond-lab/wordnet_math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wordnet_math

Link wordnet to MathGLoss to make a mathematical wordnet

Task One

Investigate the shared overlap between OEWN and MathGloss.

Running

uv venv
uv pip install wn pyyaml
.venv/bin/python scripts/task1_overlap.py

Wordnet data (oewn:2024 and omw-en:2.0) is downloaded once into build/wordnet_data/ and reused on subsequent runs.

Method

Domains covered: mathematics, algebra, arithmetic, geometry, statistics, matrix algebra, logic. Each domain is identified by one or more ILI identifiers that resolve to a synset in OEWN.

1a — domain_topic links (OEWN built-in) Synsets are counted as belonging to a domain if they carry a domain_topic relation pointing to the domain's synset (e.g. the (mathematics) synset 06009822-n, ILI i68341). This is the explicit, hand-curated annotation in OEWN.

1b — WordNet Domains (xwnd) The external/xwnd-30g/ directory contains per-domain .ppv files (Personalised PageRank Vectors over WN 3.0 synsets). Because OEWN uses WN 3.1 offsets, synsets are mapped via the Interlingual Index (ILI): each OEWN synset's ILI is looked up in omw-en:2.0 (which uses WN 3.0 offsets) to obtain the WN 3.0 ID, which is then checked against the .ppv file. A score threshold of XWND_THRESHOLD = 0.0001 is applied; the .ppv files cover the full WN 3.0 vocabulary so a threshold is needed to select the most domain-relevant synsets. Domains without a .ppv file (algebra, arithmetic, matrix algebra, logic) are only measured via method 1a.

2 — MathGloss overlap external/MathGloss/data/database.csv provides ~4,800 mathematical concepts with Wikidata IDs and names drawn from several sources (BCT, Chicago, Clowder, Context, Mathlib, nLab, PlanetMath). Overlap with OEWN is measured two ways:

  • by lemma: any OEWN lemma in a math-domain synset matches a MathGloss name (case-insensitive).
  • by Wikidata ID: the Wikidata QID stored in the OEWN YAML matches the QID in MathGloss. Note that Wikidata IDs are read directly from external/english-wordnet/src/yaml/ because the wn Python library does not expose them.

Results (oewn:2024, xwnd threshold 0.0001)

Measure Count
OEWN total synsets 120,630
In math domain — domain_topic (union of all domains) 331
In math domain — xwnd (union of domains with .ppv) 2,192
In either method 2,245
MathGloss entries (unique lemmas) 7,502
OEWN in-domain synsets matching MathGloss by lemma 2,569 (34.2%)
OEWN in-domain synsets matching MathGloss by Wikidata ID 40 (0.8%)
In-domain synsets NOT in MathGloss 1,772 (78.9%)

Wikidata ID coverage across the 2,245 in-domain synsets:

Count
WN Wikidata ID only 96
MathGloss Wikidata ID only 439
Both — same ID 27
Both — different ID 7
Neither 1,676

Examples

In both OEWN and MathGloss (matched by lemma, with source(s)):

OEWN lemma(s) OEWN domain MathGloss sources
idempotent mathematics chicago, context, nlab, planetmath
commute / transpose mathematics bct, chicago, context, nlab, planetmath
commutative mathematics chicago, nlab, planetmath
inverse mathematics bct, nlab
transposition algebra chicago, planetmath

In OEWN math domains but NOT in MathGloss (WordNet has it, MathGloss does not):

OEWN lemma(s) OEWN domain
rounding / rounding error mathematics
truncation error mathematics
sampling statistics
bimodal statistics
combinatorial mathematics

In MathGloss but NOT in OEWN math domains (MathGloss has it, WordNet does not):

MathGloss label Wikidata
algebra over a field Q1000660
discrete Fourier transform Q1006032
algebraically closed field Q1047547
topological group Q1046291
finite group Q1057968

Output files

Results are written to build/ on each run.

build/task1_results.txt — human-readable summary of all counts.

build/task1_synsets.tsv — one row per OEWN synset that appears in at least one math domain (via either method). Columns:

Column Description
lemma OEWN lemmas for the synset, |-separated
domain Domain names from domain_topic links, |-separated (empty if none)
xwndomain Domain names from xwnd .ppv files, |-separated (empty if none)
ili Interlingual Index identifier for the synset
wn-wd Wikidata QID(s) from the OEWN YAML, |-separated (empty if none)
mg-wd Wikidata QID from MathGloss matched to this synset (empty if none)
IN-mathgloss lemma, wikidata, both, or empty — how the synset matches MathGloss
sources MathGloss source(s) where the match was found, |-separated

Task Two

Measure the semantic similarity of OEWN definitions against MathGloss concept descriptions using sentence embeddings, to validate Task 1 matches and discover new alignments.

Running

uv pip install ollama numpy
# fetch Wikidata descriptions for all MathGloss QIDs (once, ~2 min)
.venv/bin/python scripts/fetch_wikidata_descriptions.py
# run embedding analysis
.venv/bin/python scripts/task2_embeddings.py --model embeddinggemma
.venv/bin/python scripts/task2_embeddings.py --model qwen3-embedding --reuse-embeddings

Embedding models are served via Ollama. Pull the models first:

ollama pull embeddinggemma
ollama pull qwen3-embedding

Embeddings are cached as .npy files in build/; use --reuse-embeddings to skip re-encoding on subsequent runs with the same model.

Method

Definition sources:

  • OEWN: ss.definition() from the wn library — a short English gloss per synset.
  • MathGloss: Wikidata descriptions (e.g. "polygon with six sides") fetched via the Wikidata API and cached in build/wikidata_descriptions.json. Falls back to the Wikidata label when the description is missing (~9% of QIDs).

Two analyses:

Pairwise scoring — for each of the 473 OEWN synsets matched to a MathGloss QID in Task 1, compute the cosine similarity of their respective embeddings. Low scores reveal false positive lemma matches (e.g. the genetics synset for gene/cistron/factor matching MathGloss's arithmetic factor).

Cross-corpus retrieval — embed all 2,245 in-domain OEWN synsets and all 4,814 MathGloss concepts, then find the top-k nearest MathGloss concepts for every OEWN synset. Recall@k measures how often the known Task 1 match appears in the top k retrieved results.

Use --sim-threshold T to flag pairs with cosine similarity below T as likely false matches in the output TSV. Use --wikidata-threshold to set the threshold automatically as the minimum similarity among Wikidata-confirmed pairs (those where both resources agree on the QID) — this guarantees all confirmed pairs pass while filtering out the worst false positives.

Results (oewn:2024, Wikidata descriptions, 473 matched pairs)

Metric embeddinggemma qwen3-embedding
Mean cosine similarity 0.521 0.637
Median 0.504 0.624
Min 0.177 0.310
Pairs ≥ 0.8 35 (7%) 82 (17%)
Mean sim — "both" matches 0.649 0.765
Mean sim — lemma-only matches 0.513 0.629
Recall@1 19.2% 27.3%
Recall@3 29.4% 36.2%
Recall@5 32.3% 40.4%

Pairs matched by both lemma and Wikidata ID score markedly higher than lemma-only matches, confirming the method self-validates. Recall@5 of 40% means embeddings alone recover 2 in 5 known matches without any lexical signal.

Wikidata-threshold filtering (--wikidata-threshold, qwen3-embedding):

The minimum similarity among the 27 Wikidata-confirmed pairs is 0.534 (the infinitesimal pair, where WN defines it as "a variable with zero as its limit" and MathGloss as "a nonzero positive number smaller than any positive real number"). Using this as a cut-off:

Count
Wikidata-confirmed passing 27/27 (100%)
Lemma-only passing 300/446 (67%)
Total accepted links 327/473 (69%)
Flagged as likely false matches 146 (31%)

The 146 rejected pairs include cases like gene/factor (WN: genetics; MG: arithmetic, sim=0.32), crystal (WN: mineral; MG: fibered categories, sim=0.33), and triangulation (WN: surveying; MG: chess tactic, sim=0.33).

Highest-similarity pairs (qwen3-embedding):

OEWN lemma(s) OEWN definition MathGloss label Sim
square matrix a matrix with the same number of rows and columns square matrix 0.977
equation a mathematical statement that two expressions are equal equation 0.962
hexagon a six-sided polygon hexagon 0.957
irrational number a real number that cannot be expressed as a rational number irrational number 0.930
electron an elementary particle with negative charge electron 0.953

Lowest-similarity pairs — likely false positives (qwen3-embedding):

OEWN lemma(s) OEWN definition MathGloss label Sim
gene/cistron/factor (genetics) a segment of DNA… factor (multiplication operand) 0.321
crystal a rock formed by solidification… crystal (Cartesian sections of fibered categories) 0.326
triangulation a method of surveying… triangulation (chess tactic) 0.327

Output files

build/wikidata_descriptions.json — cached Wikidata label and description per QID.

build/task2_pair_scores_{model}.tsv — one row per matched (OEWN synset, MathGloss QID) pair.

Column Description
ili OEWN ILI
oewn_lemma OEWN synset lemmas, |-separated
oewn_def OEWN definition text
mg_qid MathGloss Wikidata QID
mg_label Wikidata label for the QID
mg_def Wikidata description used for embedding
cosine_sim Cosine similarity (0–1)
in_mathgloss Match type from Task 1: lemma, wikidata, or both
likely_false_match True if sim < --sim-threshold (column absent if flag not set)
model Ollama model name

build/task2_topk_retrieval_{model}.tsv — top-k MathGloss candidates for every in-domain OEWN synset.

Column Description
ili OEWN ILI
oewn_lemma OEWN synset lemmas
oewn_def OEWN definition
rank 1-indexed retrieval rank
mg_qid Retrieved MathGloss QID
mg_label Wikidata label
mg_def Wikidata description
cosine_sim Similarity score
is_task1_match True if this QID matches the Task 1 mg-wd for this synset
model Ollama model name

build/task2_results_{model}.txt — human-readable summary.

About

Link wordnet to MathGLoss to make a mathematical wordnet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors