wordnet_math

Link wordnet to MathGLoss to make a mathematical wordnet

Task One

Investigate the shared overlap between OEWN and MathGloss.

Running

uv venv
uv pip install wn pyyaml
.venv/bin/python scripts/task1_overlap.py

Wordnet data (oewn:2024 and omw-en:2.0) is downloaded once into build/wordnet_data/ and reused on subsequent runs.

Method

Domains covered: mathematics, algebra, arithmetic, geometry, statistics, matrix algebra, logic. Each domain is identified by one or more ILI identifiers that resolve to a synset in OEWN.

1a — domain_topic links (OEWN built-in) Synsets are counted as belonging to a domain if they carry a domain_topic relation pointing to the domain's synset (e.g. the (mathematics) synset 06009822-n, ILI i68341). This is the explicit, hand-curated annotation in OEWN.

1b — WordNet Domains (xwnd) The external/xwnd-30g/ directory contains per-domain .ppv files (Personalised PageRank Vectors over WN 3.0 synsets). Because OEWN uses WN 3.1 offsets, synsets are mapped via the Interlingual Index (ILI): each OEWN synset's ILI is looked up in omw-en:2.0 (which uses WN 3.0 offsets) to obtain the WN 3.0 ID, which is then checked against the .ppv file. A score threshold of XWND_THRESHOLD = 0.0001 is applied; the .ppv files cover the full WN 3.0 vocabulary so a threshold is needed to select the most domain-relevant synsets. Domains without a .ppv file (algebra, arithmetic, matrix algebra, logic) are only measured via method 1a.

2 — MathGloss overlap external/MathGloss/data/database.csv provides ~4,800 mathematical concepts with Wikidata IDs and names drawn from several sources (BCT, Chicago, Clowder, Context, Mathlib, nLab, PlanetMath). Overlap with OEWN is measured two ways:

by lemma: any OEWN lemma in a math-domain synset matches a MathGloss name (case-insensitive).
by Wikidata ID: the Wikidata QID stored in the OEWN YAML matches the QID in MathGloss. Note that Wikidata IDs are read directly from external/english-wordnet/src/yaml/ because the wn Python library does not expose them.

Results (oewn:2024, xwnd threshold 0.0001)

Measure	Count
OEWN total synsets	120,630
In math domain — domain_topic (union of all domains)	331
In math domain — xwnd (union of domains with .ppv)	2,192
In either method	2,245
MathGloss entries (unique lemmas)	7,502
OEWN in-domain synsets matching MathGloss by lemma	2,569 (34.2%)
OEWN in-domain synsets matching MathGloss by Wikidata ID	40 (0.8%)
In-domain synsets NOT in MathGloss	1,772 (78.9%)

Wikidata ID coverage across the 2,245 in-domain synsets:

	Count
WN Wikidata ID only	96
MathGloss Wikidata ID only	439
Both — same ID	27
Both — different ID	7
Neither	1,676

Examples

In both OEWN and MathGloss (matched by lemma, with source(s)):

OEWN lemma(s)	OEWN domain	MathGloss sources
idempotent	mathematics	chicago, context, nlab, planetmath
commute / transpose	mathematics	bct, chicago, context, nlab, planetmath
commutative	mathematics	chicago, nlab, planetmath
inverse	mathematics	bct, nlab
transposition	algebra	chicago, planetmath

In OEWN math domains but NOT in MathGloss (WordNet has it, MathGloss does not):

OEWN lemma(s)	OEWN domain
rounding / rounding error	mathematics
truncation error	mathematics
sampling	statistics
bimodal	statistics
combinatorial	mathematics

In MathGloss but NOT in OEWN math domains (MathGloss has it, WordNet does not):

MathGloss label	Wikidata
algebra over a field	Q1000660
discrete Fourier transform	Q1006032
algebraically closed field	Q1047547
topological group	Q1046291
finite group	Q1057968

Output files

Results are written to build/ on each run.

build/task1_results.txt — human-readable summary of all counts.

build/task1_synsets.tsv — one row per OEWN synset that appears in at least one math domain (via either method). Columns:

Column	Description
`lemma`	OEWN lemmas for the synset, `\|`-separated
`domain`	Domain names from domain_topic links, `\|`-separated (empty if none)
`xwndomain`	Domain names from xwnd `.ppv` files, `\|`-separated (empty if none)
`ili`	Interlingual Index identifier for the synset
`wn-wd`	Wikidata QID(s) from the OEWN YAML, `\|`-separated (empty if none)
`mg-wd`	Wikidata QID from MathGloss matched to this synset (empty if none)
`IN-mathgloss`	`lemma`, `wikidata`, `both`, or empty — how the synset matches MathGloss
`sources`	MathGloss source(s) where the match was found, `\|`-separated

Task Two

Measure the semantic similarity of OEWN definitions against MathGloss concept descriptions using sentence embeddings, to validate Task 1 matches and discover new alignments.

Running

uv pip install ollama numpy
# fetch Wikidata descriptions for all MathGloss QIDs (once, ~2 min)
.venv/bin/python scripts/fetch_wikidata_descriptions.py
# run embedding analysis
.venv/bin/python scripts/task2_embeddings.py --model embeddinggemma
.venv/bin/python scripts/task2_embeddings.py --model qwen3-embedding --reuse-embeddings

Embedding models are served via Ollama. Pull the models first:

ollama pull embeddinggemma
ollama pull qwen3-embedding

Embeddings are cached as .npy files in build/; use --reuse-embeddings to skip re-encoding on subsequent runs with the same model.

Method

Definition sources:

OEWN: ss.definition() from the wn library — a short English gloss per synset.
MathGloss: Wikidata descriptions (e.g. "polygon with six sides") fetched via the Wikidata API and cached in build/wikidata_descriptions.json. Falls back to the Wikidata label when the description is missing (~9% of QIDs).

Two analyses:

Pairwise scoring — for each of the 473 OEWN synsets matched to a MathGloss QID in Task 1, compute the cosine similarity of their respective embeddings. Low scores reveal false positive lemma matches (e.g. the genetics synset for gene/cistron/factor matching MathGloss's arithmetic factor).

Cross-corpus retrieval — embed all 2,245 in-domain OEWN synsets and all 4,814 MathGloss concepts, then find the top-k nearest MathGloss concepts for every OEWN synset. Recall@k measures how often the known Task 1 match appears in the top k retrieved results.

Use --sim-threshold T to flag pairs with cosine similarity below T as likely false matches in the output TSV. Use --wikidata-threshold to set the threshold automatically as the minimum similarity among Wikidata-confirmed pairs (those where both resources agree on the QID) — this guarantees all confirmed pairs pass while filtering out the worst false positives.

Results (oewn:2024, Wikidata descriptions, 473 matched pairs)

Metric	embeddinggemma	qwen3-embedding
Mean cosine similarity	0.521	0.637
Median	0.504	0.624
Min	0.177	0.310
Pairs ≥ 0.8	35 (7%)	82 (17%)
Mean sim — "both" matches	0.649	0.765
Mean sim — lemma-only matches	0.513	0.629
Recall@1	19.2%	27.3%
Recall@3	29.4%	36.2%
Recall@5	32.3%	40.4%

Pairs matched by both lemma and Wikidata ID score markedly higher than lemma-only matches, confirming the method self-validates. Recall@5 of 40% means embeddings alone recover 2 in 5 known matches without any lexical signal.

Wikidata-threshold filtering (--wikidata-threshold, qwen3-embedding):

The minimum similarity among the 27 Wikidata-confirmed pairs is 0.534 (the infinitesimal pair, where WN defines it as "a variable with zero as its limit" and MathGloss as "a nonzero positive number smaller than any positive real number"). Using this as a cut-off:

	Count
Wikidata-confirmed passing	27/27 (100%)
Lemma-only passing	300/446 (67%)
Total accepted links	327/473 (69%)
Flagged as likely false matches	146 (31%)

The 146 rejected pairs include cases like gene/factor (WN: genetics; MG: arithmetic, sim=0.32), crystal (WN: mineral; MG: fibered categories, sim=0.33), and triangulation (WN: surveying; MG: chess tactic, sim=0.33).

Highest-similarity pairs (qwen3-embedding):

OEWN lemma(s)	OEWN definition	MathGloss label	Sim
square matrix	a matrix with the same number of rows and columns	square matrix	0.977
equation	a mathematical statement that two expressions are equal	equation	0.962
hexagon	a six-sided polygon	hexagon	0.957
irrational number	a real number that cannot be expressed as a rational number	irrational number	0.930
electron	an elementary particle with negative charge	electron	0.953

Lowest-similarity pairs — likely false positives (qwen3-embedding):

OEWN lemma(s)	OEWN definition	MathGloss label	Sim
gene/cistron/factor	(genetics) a segment of DNA…	factor (multiplication operand)	0.321
crystal	a rock formed by solidification…	crystal (Cartesian sections of fibered categories)	0.326
triangulation	a method of surveying…	triangulation (chess tactic)	0.327

Output files

build/wikidata_descriptions.json — cached Wikidata label and description per QID.

build/task2_pair_scores_{model}.tsv — one row per matched (OEWN synset, MathGloss QID) pair.

Column	Description
`ili`	OEWN ILI
`oewn_lemma`	OEWN synset lemmas, `\|`-separated
`oewn_def`	OEWN definition text
`mg_qid`	MathGloss Wikidata QID
`mg_label`	Wikidata label for the QID
`mg_def`	Wikidata description used for embedding
`cosine_sim`	Cosine similarity (0–1)
`in_mathgloss`	Match type from Task 1: `lemma`, `wikidata`, or `both`
`likely_false_match`	`True` if sim < `--sim-threshold` (column absent if flag not set)
`model`	Ollama model name

build/task2_topk_retrieval_{model}.tsv — top-k MathGloss candidates for every in-domain OEWN synset.

Column	Description
`ili`	OEWN ILI
`oewn_lemma`	OEWN synset lemmas
`oewn_def`	OEWN definition
`rank`	1-indexed retrieval rank
`mg_qid`	Retrieved MathGloss QID
`mg_label`	Wikidata label
`mg_def`	Wikidata description
`cosine_sim`	Similarity score
`is_task1_match`	`True` if this QID matches the Task 1 `mg-wd` for this synset
`model`	Ollama model name

build/task2_results_{model}.txt — human-readable summary.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analysis		analysis
scripts		scripts
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordnet_math

Task One

Running

Method

Results (oewn:2024, xwnd threshold 0.0001)

Examples

Output files

Task Two

Running

Method

Results (oewn:2024, Wikidata descriptions, 473 matched pairs)

Output files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wordnet_math

Task One

Running

Method

Results (oewn:2024, xwnd threshold 0.0001)

Examples

Output files

Task Two

Running

Method

Results (oewn:2024, Wikidata descriptions, 473 matched pairs)

Output files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages