CausaLens is a causal inference project designed to move beyond correlation-based analysis by explicitly modeling causal structure, identifying confounders, and estimating interventional effects using principled causal reasoning.
The project focuses on:
- Representing causal assumptions as directed graphs
- Identifying valid adjustment sets using the backdoor criterion
- Preventing common causal errors such as conditioning on colliders or post-treatment variables
- Estimating causal effects under explicit assumptions
This repository is centered around an exploratory, notebook-based workflow that emphasizes interpretability, correctness, and causal validity.
In many applied ML and data science problems (e.g., churn prediction, delivery delays, product quality analysis), models answer:
"What is correlated with the outcome?"
But decision-making requires a stronger question:
"What causes the outcome?"
Without causal reasoning:
- Confounders bias estimates
- Interventions fail in the real world
- Models break under distribution shifts
CausaLens provides a structured way to:
- Explicitly encode causal assumptions
- Diagnose confounding paths
- Select correct controls
- Estimate effects that correspond to real-world interventions
This project relies on foundational concepts from causal inference:
- Represent causal relationships as directed edges (A β B means "A causes B")
- Nodes are variables; edges encode causal assumptions
- Used to visualize confounding paths and causal flows
- Identifies which variables to control for to block confounding
- A set of variables Z satisfies the backdoor criterion if:
- No node in Z is a descendant of treatment
- Z blocks all backdoor paths from treatment to outcome
- The minimal set of variables needed to obtain unbiased causal estimates
- Derived from DAG structure using graphical criteria
- Controls for confounders while avoiding colliders and mediators
- Confounder: Affects both treatment and outcome β must adjust for it
- Collider: Caused by both treatment and outcome β must NOT adjust for it
- Mediator: On the causal path (treatment β mediator β outcome) β adjusting blocks the effect
- Causes must precede effects in time
- Post-treatment variables cannot be confounders
- Ensures causal direction is valid
Goal: Transform raw e-commerce data into a causally valid analysis-ready dataset.
-
Import Multi-Table Data from Kaggle
- Load four tables:
orders,reviews,order_items,customers - Focus on order-level analysis (one row per order)
- Load four tables:
-
Parse Timestamps
- Convert all date columns to consistent datetime format using
pd.to_datetime() - Ensures temporal ordering can be validated
- Convert all date columns to consistent datetime format using
-
Create Intervention Variable (Treatment)
delayed_delivery_days=order_delivered_customer_date-order_estimated_delivery_datedelayed_delivery= binary indicator (1 if delay > 0 days, else 0)- This is the treatment whose effect we want to estimate
-
Merge Outcome Variable
- Join
reviewstable to getreview_score(1-5 stars) - This is the outcome we want to explain causally
- Keep only one review per order (drop duplicates)
- Join
-
Derive Order-Level Features (Pre-Treatment Confounders)
order_price: Total cost of all items in orderfreight_value: Shipping costnum_items: Number of products ordered- These affect both likelihood of delay AND customer satisfaction
-
Derive Customer History Features (Pre-Treatment Confounders)
customer_tenure_days: Days since first purchaseprior_orders: Number of previous orders (using.cumcount()for temporal validity)is_first_order: Binary flag for first-time customers- These capture customer loyalty and experience effects
-
Apply Validity Filters
- Keep only delivered orders with valid timestamps and reviews
- Remove incomplete or invalid observations
-
Save Preprocessed Dataset
- Export clean CSV with treatment, outcome, and confounders
- Ready for causal discovery and estimation
Output: olist_preprocessed_causal.csv (~96k orders, 12 columns)
Goal: Discover causal structure from data and identify which variables confound the treatment-outcome relationship.
-
Run Three Causal Discovery Algorithms
- PC (Constraint-Based): Tests conditional independence using Fisher's Z test
- GES (Score-Based): Optimizes Bayesian Information Criterion (BIC)
- NOTEARS (Continuous Optimization): Uses gradient descent with acyclicity constraint
- Each algorithm proposes edges based on different statistical principles
-
Sample Data for Efficiency
- Use 10,000 random orders (from ~96k) to reduce computational cost
- Causal structure learning doesn't require full data to detect relationships
- Standardize variables (mean=0, std=1) for numerical stability
-
Build Consensus DAG (Conservative Approach)
- Keep only edges agreed upon by β₯2 out of 3 algorithms
- This reduces false positives and improves robustness
- Pure data-driven structure discovery
-
Enforce Temporal Constraints
- Remove edges that violate time ordering:
- β
review_score β delayed_delivery(future cannot cause past) - β
delayed_delivery β order_price(treatment cannot cause pre-treatment)
- β
- Keep only valid temporal flows:
- β
order_price β delayed_delivery(past β present) - β
delayed_delivery β review_score(present β future) - β
order_price β review_score(past β future)
- β
- Remove edges that violate time ordering:
-
Incorporate Domain Knowledge
- Algorithms may miss important confounders due to sample size or weak signals
- Manually add theoretically justified edges:
order_price β delayed_delivery(expensive orders β special handling β delays)prior_orders β delayed_delivery(loyal customers β priority shipping)is_first_order β delayed_delivery(first-timers β default slower tier)
- Hybrid approach: data + domain expertise
-
Identify Adjustment Set (Backdoor Criterion)
- Find variables that create backdoor paths from treatment to outcome
- Backdoor path example:
(Order price affects both delay likelihood AND review score β confounding)
delayed_delivery β order_price β review_score - Adjustment set: Variables we must control for to block these paths
- In this project:
[order_price, prior_orders, is_first_order, num_items]
-
Visualize Final DAG
- Create graph with:
- Red node: Treatment (
delayed_delivery) - Teal node: Outcome (
review_score) - Green nodes: Pre-treatment confounders
- Red node: Treatment (
- Save as high-resolution PNG for reporting
- Create graph with:
Output:
causal_dag_edges.csv: List of directed edges in final DAGadjustment_set.txt: Variables to condition on in Step 3causal_dag_visual.png: DAG visualization
Why This Matters:
- Without this step, you'd either:
- Control for nothing β biased by confounding
- Control for everything β introduce collider bias or overadjustment
- The adjustment set tells you exactly which variables to include in regression
Goal: Estimate the unbiased causal effect of delivery delays on customer satisfaction.
-
Baseline: Naive Analysis
- Simple comparison of delayed vs on-time orders without adjustments
- Purpose: Demonstrates confounding bias when causal methods aren't used
- Shows overestimated effect due to unmeasured confounding
-
Causal Estimation Methods We apply three complementary approaches (Regression Adjustment, Inverse Propensity Weighting, and Matching) to ensure robustness.
-
Triangulation
- Compare estimates across all three methods
- Agreement across methods β high confidence in causal estimate
- Disagreement β signals potential assumption violations
Output:
causal_effect_estimates.csv: Effect estimates from all methods- Typical finding: Naive effect β -1.73 stars, Causal estimates β -1.74 stars (minimal confounding detected in this dataset)
Interpretation:
- "Delivery delays cause approximately a -1.7 star reduction in customer reviews, after accounting for order characteristics and customer history"
- This represents the interventional effect (expected impact if delays were eliminated)
Goal: Determine if the causal effect varies across different customer segments.
-
Stratified Analysis by Customer Type
- Split data into subgroups: first-time vs repeat customers
- Estimate delay effect within each group separately
- Compare: Do first-timers suffer more from delays than loyal customers?
-
Analysis by Order Value
- Create categories: low, medium, high value orders (using quantiles)
- Estimate effect within each value tier
- Check if expensive orders are more sensitive to delays
-
Analysis by Customer Tenure
- Categories: new, established, loyal customers
- Test if relationship with company affects delay tolerance
-
Statistical Interaction Test
- Regression with interaction term:
delayed_delivery Γ is_first_order - Tests if effect modification is statistically significant
- Coefficient = difference in delay effect between subgroups
- Regression with interaction term:
Typical Finding:
- Interaction coefficient β -0.15 (not significant, p > 0.05)
- Conclusion: Delay effects are uniform across customer types
- Business implication: No need for segment-specific interventions; universal on-time delivery improvement helps all customers equally
Why This Matters:
- Identifies high-value intervention targets (e.g., "prioritize first-timers")
- Or confirms effect is stable (one-size-fits-all strategy works)
- Prevents wasted resources on ineffective targeting
- Treatment: Delivery delay (binary: on-time vs delayed)
- Outcome: Customer review score (1-5 stars)
- Estimated Effect: -1.73 stars (95% CI: [-1.76, -1.70])
- Interpretation: Delays cause customers to rate ~1.7 stars lower on average
Variables that affect BOTH delay likelihood AND review scores:
order_price: Expensive orders get delayed more + rated differentlyprior_orders: Loyal customers get priority + are more forgivingis_first_order: First-timers get slower shipping + rate harshlynum_items: More items β longer processing + affects satisfaction
- No significant heterogeneity detected (interaction p > 0.05)
- Delay effects are consistent across:
- First-time vs repeat customers
- Low vs high value orders
- New vs loyal customers
- Implication: Universal improvement strategy is optimal
- Causal Discovery:
causal-learn,gcastle(PC, GES, NOTEARS algorithms) - Graph Analysis:
networkx(DAG manipulation and visualization) - Statistical Modeling:
scikit-learn,statsmodels(regression, propensity scores) - Data Processing:
pandas,numpy