π― Competition Result: SMAPE 58.66 | Rank: Top 2232/6000 (Top 42%)
An ensemble machine learning pipeline for predicting product prices from catalog text descriptions. This solution combines advanced feature engineering, multiple text vectorization techniques, and stacked ensemble modeling to achieve competitive performance.
| Metric | Value | Percentile |
|---|---|---|
| SMAPE | 58.6571% | Top 42% |
| Rank | < 2500 | 6000 participants |
| Models Used | 3 + Meta | Stacked Ensemble |
| Runtime | ~12 min | CPU-optimized |
- π§ Advanced Feature Engineering: Extracts 20+ features from text (quantities, quality indicators, materials, sizes)
- π Multi-Strategy Text Vectorization: TF-IDF with word/character n-grams + SVD dimensionality reduction
- π€ Ensemble Learning: XGBoost + LightGBM + Extra Trees with Ridge meta-model
- π― Competition-Safe: No external LLMs or APIs - pure ML approach
- β‘ Efficient: Optimized for CPU, runs in ~12 minutes
- π Well-Documented: Extensive comments explaining every concept
Python 3.8+
pandas
numpy
scikit-learn
xgboost
lightgbm# Clone the repository
git clone https://github.com/ramoware/Amazon-ML-Challenge-2025.git
cd price-prediction
# Install dependencies
pip install -r requirements.txt# Ensure your data files are in the same directory
# Required files: train.csv, test.csv
# Run the pipeline
python model.py
# Output: output.csvprice-prediction/
β
βββ model.py # Main pipeline script
βββ requirements.txt # Python dependencies
βββ README.md # This file
β
βββ train.csv(xlsx) # Training data (not included)
βββ test.csv(xlsx) # Test data (not included)
βββ output.csv(xlsx) # Generated predictions
The pipeline extracts comprehensive features from product descriptions:
# Quantity Detection
"Pack of 12" β quantity=12, is_multi_pack=1
# Quality Indicators
"Premium Leather" β premium_count=1, net_quality=+1
# Size Scoring
"XXL Jumbo" β max_size_score=3
# Material Analysis
"Stainless Steel" β premium_material_count=2Feature Categories:
- π¦ Quantity & Pack Size
- β Quality Level (Premium vs Economy)
- π Size Indicators
- π οΈ Material Composition
- π’ Numeric Patterns
- π Text Complexity
- π·οΈ Brand Indicators
Multiple complementary approaches capture different aspects of text:
| Method | N-grams | Features | Purpose |
|---|---|---|---|
| TF-IDF (Word 1-2) | 1-2 | 200 β 50 | Semantic meaning |
| TF-IDF (Word 1-3) | 1-3 | 150 β 50 | Contextual phrases |
| TF-IDF (Char 3-5) | 3-5 | 100 β 50 | Spelling patterns |
| Count (Char 3-6) | 3-6 | 100 | Robust to typos |
Total Text Features: 250 dimensions
βββββββββββββββ
β XGBoost ββββ
βββββββββββββββ β
β ββββββββββββββββ
βββββββββββββββ βββββΆβ Ridge Meta ββββΆ Final Prediction
β LightGBM ββββ€ β Model β
βββββββββββββββ β ββββββββββββββββ
β
βββββββββββββββ β
β Extra Trees ββββ
βββββββββββββββ
Why This Works:
- π― Diversity: Different algorithms capture different patterns
- π‘οΈ Robustness: Ensemble reduces overfitting
- π Performance: Stacking often beats individual models
Ensures realistic predictions:
- β Minimum price: $0.01
- π Maximum cap: 120% of training 98th percentile
- π§ Log-space smoothing (98% factor)
ββββββββββββββββββββββββββββββββββββββββ
β SMAPE: 58.6571% β
β Rank: < 2500 / 6000 β
β Percentile: Top 42% β
ββββββββββββββββββββββββββββββββββββββββ
| Model | Validation SMAPE | Training Time |
|---|---|---|
| XGBoost | ~60% | ~2 min |
| LightGBM | ~61% | ~1.5 min |
| Extra Trees | ~63% | ~1 min |
| Stacked Ensemble | ~59% | ~5 min total |
- Feature Engineering Impact: Engineered features provide 40% of predictive power
- Text Embeddings: Character-level n-grams surprisingly effective
- Ensemble Benefit: ~2% improvement over best single model
- Outlier Removal: Removing top 2% prices improved stability
n_estimators=500 # More trees = better fit
learning_rate=0.05 # Conservative learning
max_depth=8 # Moderate complexity
subsample=0.8 # 80% data per treen_estimators=500
learning_rate=0.05
max_depth=8
subsample=0.8n_estimators=100 # Fewer trees (faster)
max_depth=20 # Deeper trees
min_samples_split=10 # Regularizationalpha=1.0 # L2 regularization strength- Advanced NLP: Add sentiment analysis, POS tagging
- Deep Learning: BERT/transformer embeddings
- Feature Selection: Recursive feature elimination
- Hyperparameter Tuning: Bayesian optimization
- Cross-Validation: K-fold for robust validation
- Neural Network: Add deep learning to ensemble
- Category-Specific Models: Separate models per product category
- Price Bucketing: Classification + regression hybrid
Expected Improvements: 5-10% SMAPE reduction possible
import matplotlib.pyplot as plt
# Plot XGBoost feature importance
xgb.plot_importance(xgb_model, max_num_features=20)
plt.title("Top 20 Most Important Features")
plt.show()plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(y_clean, bins=50, edgecolor='black')
plt.title("Actual Price Distribution")
plt.subplot(1, 2, 2)
plt.hist(final_pred, bins=50, edgecolor='black')
plt.title("Predicted Price Distribution")
plt.tight_layout()
plt.show()Issue 1: Memory Error
# Reduce feature dimensions
max_features=100 # Instead of 200
n_components=30 # Instead of 50Issue 2: Slow Training
# Reduce ensemble size
n_estimators=200 # Instead of 500Issue 3: Poor Performance
# Check data quality
print(train['price'].describe())
print(train['catalog_content'].isnull().sum())pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
xgboost>=1.5.0
lightgbm>=3.3.0Install all at once:
pip install pandas numpy scikit-learn xgboost lightgbmContributions are welcome! Here's how you can help:
- π΄ Fork the repository
- πΏ Create a feature branch (
git checkout -b feature/AmazingFeature) - πΎ Commit your changes (
git commit -m 'Add AmazingFeature') - π€ Push to the branch (
git push origin feature/AmazingFeature) - π Open a Pull Request
- Additional feature engineering ideas
- Alternative text vectorization methods
- Hyperparameter optimization experiments
- Documentation improvements
- Bug fixes
Want to understand the concepts better?
- Ensemble Learning: Sklearn Ensemble Guide
- TF-IDF: Understanding TF-IDF
- XGBoost: Official XGBoost Tutorial
- SMAPE: Forecast Error Metrics
- Stacking: Stacked Generalization
This project is licensed under the MIT License - see the LICENSE file for details.
Ramdev Chaudhary [Team Leader]
- GitHub: @ramoware
- LinkedIn: ramdevchaudhary
- Email: ramoware@gmail.com
Pranita Jagtap [Co-Leader]
- GitHub: @PranitaJagtap
- LinkedIn: PranitaJagtap
- Email: jagtappranita2003@gmail.com
Vedant Wadekar [Associate]
- GitHub: @Vedantwadekar2112
- LinkedIn: vedant-wadekar-394948378
- Email: vedantwadekar49@gmail.com
Sony Yadav [Associate]
- GitHub: @ramoware
- LinkedIn: sony-yadav-17393232a
- Email: soniy11265@gmail.com
- Competition organizers for the dataset and challenge
- scikit-learn, XGBoost, and LightGBM communities
- Open source ML community for inspiration
If you found this project helpful, please consider giving it a star!
- π§ Email: ramoware@gmail.com
- π¬ Issues: GitHub Issues
- π Documentation: Wiki
Made with β€οΈ and β for the ML community