GLOOM is a Python package for finding important LUAD genes from transcriptomic data.
In simple words, GLOOM does this:
- Reads raw tumor and normal gene-expression files.
- Cleans the data.
- Keeps genes that can be fairly compared.
- Builds many clues (features) about each gene.
- Trains machine-learning models on known cancer genes.
- Scores every gene.
- Builds and annotates a gene network.
- Exports tables, graphs, reports, and interactive dashboards.
Short version:
raw data → cleaned data → features → machine learning → gene ranking → network annotation → interactive visuals → final report
The reference pipeline run completed all 18 steps successfully in about 480.2 seconds (~8.0 minutes).
- Raw tumor expression after duplicate cleanup: 20,512 genes × 510 samples
- Raw normal expression after duplicate cleanup: 73,321 genes × 604 samples
- After preprocessing:
- tumor: 16,570 genes × 510 samples
- normal: 21,159 genes × 604 samples
- Shared genes after harmonization: 10,986
- Positive labeled genes retained from Cancer Gene Census: 488
- Integrated features: 10,986 genes × 42 features
- Final network size: 10,986 nodes, 36,943 edges
- Novel candidates found by ranking: 3,754
- Best model selected during training: logistic_regression
Why these numbers matter:
- they prove the pipeline ran correctly,
- they tell you the scale of the data,
- and they show what the package really produced.
This folder stores the data used by the package.
This is where GLOOM gets its input and where it saves cleaned data that later steps need.
data/raw/= original files from outside sourcesdata/processed/= cleaned and transformed files created by GLOOM
- raw = ingredients from the market
- processed = ingredients already washed and cut
This folder stores the original input files.
These are the files the package starts from. They should not be edited by hand.
- cBioPortal LUAD tumor RNA-seq matrix
- cBioPortal tumor metadata
- GTEx lung normal expression matrix
- GTEx normal metadata
- Cancer Gene Census file for known cancer genes
If these files are wrong or missing, the whole pipeline cannot start.
Use this folder when:
- you want to run the full pipeline from scratch,
- you want to replace the data with a new cohort,
- you want to check data provenance.
This folder stores cleaned and prepared files.
This folder makes the rest of the pipeline easier and safer. Instead of using messy raw files again and again, later steps use standardized processed files.
- cleaned tumor expression
- cleaned normal expression
- harmonized tumor and normal matrices
- expression feature tables
- network feature tables
- integrated feature matrix
- labels
- train/validation split files
This folder is the bridge between raw data and machine learning.
Use this folder when:
- you want to inspect how the data changed after cleaning,
- you want to rerun only later steps,
- you want to debug feature creation or splitting.
This folder stores the main answers from GLOOM.
This is where the package gives you the final useful outputs.
gene_rankings.csvnovel_candidates.csvfeature_importance.csvmodel_metrics.csv
If someone asks, “What did the pipeline discover?”, the answer is usually in this folder.
Use this folder when:
- you want the final ranked genes,
- you want model performance,
- you want interpretation of important features.
This folder stores network outputs.
GLOOM does not only rank genes. It also shows how genes are connected.
annotated_network.graphmlannotated_nodes.csvannotated_edges.csv
Some genes may be important not only because of their expression, but also because of where they sit inside a gene network.
Use this folder when:
- you want to view the network in Cytoscape or Gephi,
- you want to study hub genes,
- you want to inspect which top genes connect together.
This folder stores the network in many file formats.
Different tools like different formats. This folder lets you reuse the network anywhere.
network_full.graphmlnetwork_full.gmlnetwork_full_edgelist.tsvnetwork_nodes.tsvnetwork_cytoscape.json- small subnetworks such as top100, cgc, novel, candidates
- network statistics report files
It makes sharing and reuse much easier.
Use this folder when:
- you need a format for another program,
- you want a smaller subnetwork,
- you want a statistics summary of the graph.
This folder stores final summaries.
This folder gives you the easy-to-read final outputs.
pipeline_summary_table.csvpipeline_report.txtpipeline_summary_figure.pngpipeline_summary_figure.pdf
Not everybody wants raw tables. Some people want one summary they can read, share, or present.
Use this folder when:
- you need a quick report,
- you are preparing a poster or slide deck,
- you want a compact summary of the whole analysis.
This folder stores visual outputs.
It helps humans understand results faster.
- static plots
- interactive HTML pages
- dashboard files
interactive_volcano.htmlinteractive_ranking.htmlinteractive_network.htmlinteractive_feature_importance.htmlinteractive_dashboard.html
Use this folder when:
- you want to inspect results visually,
- you want to present findings,
- you want to browse results without looking at raw tables.
This folder stores trained machine-learning models.
Training can take time. This folder keeps the trained models so you can reuse them.
.joblibmodel filesbest_model.joblibbest_model_name.txt- cross-validation results
Use this folder when:
- you want to score new data,
- you want to reload a trained model,
- you want to compare models or deploy one later.
This folder stores execution logs.
This folder is the diary of the pipeline.
pipeline.log
Use this folder when:
- something failed,
- you want runtimes,
- you want proof that the full run completed,
- you want reference values for documentation.
This is the report card of the models.
Use it to compare models.
Simple idea:
- higher AUROC / AUPRC / MCC / F1 = better
- lower Brier score = better
optimal_threshold= decision cut-off used later in ranking
This tells you which clues the AI trusted most.
Use it to explain the model.
Simple idea:
rank = 1means strongest cluemean_importance= average usefulness of that cluefeature_grouptells whether the clue came from tumor stats, normal stats, differential expression, network, etc.
This is the big leaderboard of genes.
Use it to pick genes to study first.
Simple idea:
- each row = one gene
- higher
predicted_prob= more suspicious / more interesting - smaller
rank= closer to the top
This is the shortlist of new interesting genes.
Use it for discovery.
Simple idea:
- genes here are strongly scored by the model
- but they are not already known cancer genes in the label list
A .graphml file is a network file.
It stores genes as nodes and links between genes as edges, plus extra attributes.
Use this when you want a desktop biological network viewer.
Typical workflow:
- Open Cytoscape.
- Go to File → Import → Network → File.
- Select the
.graphmlfile. - Map node color to
node_color. - Map node size to
node_sizeorpredicted_prob. - Use
rank,novel_candidate, oris_cgc_geneto filter/highlight nodes.
Use this when you want fast graph layout and visual exploration.
Typical workflow:
- Open Gephi.
- Import the
.graphmlfile. - Run a layout such as ForceAtlas2.
- Color nodes by category.
- Resize nodes by degree or predicted probability.
Use this when you want programmatic work.
import networkx as nx
G = nx.read_graphml("annotated_network.graphml")
print(G.number_of_nodes(), G.number_of_edges())If the full graph is too big, use files such as:
subnetwork_top100.graphmlsubnetwork_cgc.graphmlsubnetwork_novel.graphmlsubnetwork_candidates.graphml
- network biologists
- Cytoscape users
- graph analysts
- anyone building visual biology stories
A .joblib file stores a trained machine-learning model.
It saves time because you do not need to train again.
import joblib
model = joblib.load("best_model.joblib")If you build features for another compatible dataset, you can use the saved model to predict on that new table.
The package itself reloads the model for evaluation, feature importance, and gene ranking.
It is a frozen copy of the trained model used in the reference analysis.
Only load .joblib files from a trusted source.
- ML developers
- bioinformaticians doing repeated scoring
- people deploying or reusing the best model
GLOOM is strongest when you need:
- one pipeline instead of many disconnected tools,
- reproducibility,
- gene ranking + network annotation together,
- final tables + final visuals + reports in one place.
You can do it manually, but it takes more work.
Tools:
- Python (pandas)
- or R
You would need to:
- remove duplicates
- replace invalid values
- log-transform
- filter low-expression genes
- filter low-variance genes
You must make sure tumor and normal use the same gene names.
You can do this with:
- Python statistics
- statsmodels
- or R packages like limma / DESeq2 / edgeR (if you change workflow)
You would manually calculate means, medians, variability, ranks, contrasts, etc.
You would manually calculate correlations and keep strong edges.
You would use NetworkX or igraph to compute:
- degree
- betweenness
- clustering
- eigenvector centrality
- component size
You would manually match known cancer genes from CGC to your gene list.
You would use scikit-learn by hand.
You would manually define models, parameters, cross-validation, and model persistence.
You would manually compute AUROC, AUPRC, MCC, F1, threshold selection, plots.
You would load the best model, score the full feature matrix, and build the candidate tables yourself.
You would manually attach node attributes and save GraphML/JSON/TSV.
You would manually write reporting code and interactive HTML pages.
Without GLOOM, the work is possible, but it is more fragmented, more error-prone, and harder to reproduce.
Because it reduces many steps into one reproducible workflow.
Because it links transcriptomics, networks, and machine learning in one place.
Because it helps find candidate genes and see how they connect in the network.
Because it gives a full example of how an omics ML pipeline is structured.
Because they can use final outputs like gene_rankings.csv, novel_candidates.csv, and the dashboard without understanding every technical step.
Because the package produces final figures, dashboards, and summary reports automatically.
- Python
- pandas
- scikit-learn
- matplotlib
- plotly
- NetworkX
- Cytoscape
- Gephi
- Excel
- LibreOffice Calc
- Google Sheets (for smaller files)
- Jupyter
- VS Code notebooks
- PowerPoint
- Word
- PDF viewers
- browser for HTML dashboards
- Run the full pipeline.
- Open
gene_rankings.csv. - Inspect
novel_candidates.csv. - Open
interactive_dashboard.html. - Use Cytoscape on
annotated_network.graphml.
- Check
model_metrics.csv. - Check
feature_importance.csv. - Load the
.joblibbest model. - Reuse the model on compatible new feature matrices.
- Use
annotated_network.graphml. - Use
network_full.graphmlorsubnetwork_*.graphml. - Open in Cytoscape/Gephi.
- Explore hubs and subnetworks.
- Use
pipeline_report.txt. - Use
pipeline_summary_figure.png. - Use
interactive_dashboard.htmlfor screenshots. - Use
novel_candidates.csvto highlight discoveries.