Is a Python-package focussing on the Structural Topic Model and machine-assisted reading of large text corpora. The implementation in Python aims for computational efficiency as well as ease-of-use.
Structural Topic Model (Roberts et al. 2014) can be used to extend the classical topic modelling approaches by including text metadata on a document level. The meta information can be introduced to the estimation procedure two-fold, via:
- topical content covariates that shape the word usage within topics
- topical prevalence covariates that shape the frequency of topic occurences.
The packages consists of three main parts:
-
Text Reading for various filetypes (*.csv, *.json)
-
Text Preparation
- Pre-processing
- Stopword-Removal
- Stemming
- Dropping Documents
- Removing Punctuation
- n-gram algorithm
- Corpus creation
- list of documents containing word indices and their count
- vector of words associated with the indices
- metadata matrix with document covariates
- Model Estimation
- Spectral Initialisation
- Topical Prevalence Model
- interaction terms, standard transforms and non-linear relations, such as splines
- Topical Content Model
- Model Evaluation
- Semantic Coherence Measure: Goodness-of-topics depends on whether most probable words in a given topic frequently co-occur together
- Exclusivity: Word-exclusivity on a topic level
- FREX: harmonic mean for semantic coherence and exclusivity
- Visualisation
- Corpus visualisations:
- wordclouds
- word frequencies
- tf-idf -> t-sne visualisation
- Estimate visualisation:
- Metadata estimates can be visualised w.r.t. their effect on the expected topic proportions as well as on the topical content
- Visualisation of identified topics and their distances in a topic - graph