E-commerce platforms often suffer from "Review Fatigue." While a product may have thousands of reviews, many are low-quality: gibberish, one-word sentiments ("Good"), or spam. Chronological sorting forces users to dig for helpful information, directly hurting conversion rates (CVR).
The Goal: Automatically filter out noise and rank reviews based on their informational density and feature coverage, surfacing the most helpful content for prospective buyers.
I developed a production-ready pipeline that transforms raw, messy text into a prioritized list of reviews using a hybrid approach of NLP feature engineering and machine learning.
- Pre-Processing Filter: Immediate removal of non-string data, language errors, profanity, and gibberish.
-
Feature Engineering: Calculation of custom metrics like Noun-Density (
$R_n$ ) and Product Coverage ($R_c$ ) to prioritize detailed descriptions. - Pairwise ML Ranking: A Random Forest model acts as a "judge" in a tournament-style comparison to determine the final helpfulness score.
Implementing this engine yields four primary commercial benefits:
- Increased Conversion Rate (CVR): By surfacing high-quality, feature-rich reviews, we reduce the "Information Search Cost" for customers, typically yielding a 5-10% uplift in conversion.
- Automated Brand Protection: Multi-lingual filters act as a "Brand Safety" layer, reducing manual moderation overhead by ~80%.
-
Enhanced Product Insights: Metrics like Noun-Density (
$R_n$ ) identify which specific product attributes (e.g., "battery life") customers actually care about. -
Scalability: An
$O(n)$ optimized solution capable of processing an entire day’s worth of global feedback in seconds.
-
Challenge: Comparing every review against every other review using nested loops created
$O(n^2)$ complexity, making the pipeline extremely slow. - Resolution: Refactored the logic to use Vectorized Cross-Joins in Pandas. By creating a comparison matrix of all matchups at once, I leveraged NumPy's underlying speed to reduce ranking time by over 90%.
- Challenge: Simple word counts are a poor proxy for quality.
-
Resolution: Engineered a Noun-Density Metric (
$R_n$ ). Using Spacy's POS tagging, I prioritized reviews that mentioned specific objects (features) like "camera" or "battery" over those filled purely with subjective adjectives.
- Challenge: Reviews often contain keyboard-mash gibberish or multi-lingual profanity (English/non-English).
- Resolution: Implemented a multi-stage
ReviewProcessorclass inutils.pythat uses Markov-chain probability for gibberish detection and localized swear filters before the ML model is triggered.
datapipeline.py: The production execution script with automated logging.utils.py: Modularized NLP logic and theReviewProcessorclass.notebooks/: Exploratory Data Analysis, Feature Engineering, and Model Training.models/: Serializedrandomforest.joblibjudge model.
- Install Dependencies:
pip install -r requirements.txt python -m spacy download en_core_web_sm
