Skip to content

ramoware/Amazon-ML-Challenge-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ† Advanced Price Prediction Model

Python scikit-learn XGBoost LightGBM License Competition

🎯 Competition Result: SMAPE 58.66 | Rank: Top 2232/6000 (Top 42%)

An ensemble machine learning pipeline for predicting product prices from catalog text descriptions. This solution combines advanced feature engineering, multiple text vectorization techniques, and stacked ensemble modeling to achieve competitive performance.


πŸ“Š Performance Metrics

Metric Value Percentile
SMAPE 58.6571% Top 42%
Rank < 2500 6000 participants
Models Used 3 + Meta Stacked Ensemble
Runtime ~12 min CPU-optimized

✨ Key Features

  • πŸ”§ Advanced Feature Engineering: Extracts 20+ features from text (quantities, quality indicators, materials, sizes)
  • πŸ“ Multi-Strategy Text Vectorization: TF-IDF with word/character n-grams + SVD dimensionality reduction
  • πŸ€– Ensemble Learning: XGBoost + LightGBM + Extra Trees with Ridge meta-model
  • 🎯 Competition-Safe: No external LLMs or APIs - pure ML approach
  • ⚑ Efficient: Optimized for CPU, runs in ~12 minutes
  • πŸ“š Well-Documented: Extensive comments explaining every concept

πŸš€ Quick Start

Prerequisites

Python 3.8+
pandas
numpy
scikit-learn
xgboost
lightgbm

Installation

# Clone the repository
git clone https://github.com/ramoware/Amazon-ML-Challenge-2025.git
cd price-prediction

# Install dependencies
pip install -r requirements.txt

Usage

# Ensure your data files are in the same directory
# Required files: train.csv, test.csv

# Run the pipeline
python model.py

# Output: output.csv

πŸ“ Project Structure

price-prediction/
β”‚
β”œβ”€β”€ model.py             # Main pipeline script
β”œβ”€β”€ requirements.txt     # Python dependencies
β”œβ”€β”€ README.md            # This file
β”‚
β”œβ”€β”€ train.csv(xlsx)      # Training data (not included)
β”œβ”€β”€ test.csv(xlsx)       # Test data (not included)
└── output.csv(xlsx)     # Generated predictions

🧠 Methodology

1. Feature Engineering (20+ Features)

The pipeline extracts comprehensive features from product descriptions:

# Quantity Detection
"Pack of 12" β†’ quantity=12, is_multi_pack=1

# Quality Indicators  
"Premium Leather" β†’ premium_count=1, net_quality=+1

# Size Scoring
"XXL Jumbo" β†’ max_size_score=3

# Material Analysis
"Stainless Steel" β†’ premium_material_count=2

Feature Categories:

  • πŸ“¦ Quantity & Pack Size
  • ⭐ Quality Level (Premium vs Economy)
  • πŸ“ Size Indicators
  • πŸ› οΈ Material Composition
  • πŸ”’ Numeric Patterns
  • πŸ“ Text Complexity
  • 🏷️ Brand Indicators

2. Text Vectorization

Multiple complementary approaches capture different aspects of text:

Method N-grams Features Purpose
TF-IDF (Word 1-2) 1-2 200 β†’ 50 Semantic meaning
TF-IDF (Word 1-3) 1-3 150 β†’ 50 Contextual phrases
TF-IDF (Char 3-5) 3-5 100 β†’ 50 Spelling patterns
Count (Char 3-6) 3-6 100 Robust to typos

Total Text Features: 250 dimensions

3. Ensemble Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   XGBoost   │──┐
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                 β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”œβ”€β”€β”€β–Άβ”‚  Ridge Meta   │──▢ Final Prediction
β”‚  LightGBM   │───     β”‚    Model     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚ Extra Trees β”‚β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Works:

  • 🎯 Diversity: Different algorithms capture different patterns
  • πŸ›‘οΈ Robustness: Ensemble reduces overfitting
  • πŸ“ˆ Performance: Stacking often beats individual models

4. Post-Processing

Ensures realistic predictions:

  • βœ… Minimum price: $0.01
  • πŸ“Š Maximum cap: 120% of training 98th percentile
  • πŸ”§ Log-space smoothing (98% factor)

πŸ“ˆ Results Analysis

Competition Performance

╔══════════════════════════════════════╗
β•‘  SMAPE: 58.6571%                     β•‘
β•‘  Rank: < 2500 / 6000                 β•‘
β•‘  Percentile: Top 42%                 β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Individual Model Performance

Model Validation SMAPE Training Time
XGBoost ~60% ~2 min
LightGBM ~61% ~1.5 min
Extra Trees ~63% ~1 min
Stacked Ensemble ~59% ~5 min total

Key Insights

  1. Feature Engineering Impact: Engineered features provide 40% of predictive power
  2. Text Embeddings: Character-level n-grams surprisingly effective
  3. Ensemble Benefit: ~2% improvement over best single model
  4. Outlier Removal: Removing top 2% prices improved stability

πŸ”§ Hyperparameters

XGBoost

n_estimators=500      # More trees = better fit
learning_rate=0.05    # Conservative learning
max_depth=8           # Moderate complexity
subsample=0.8         # 80% data per tree

LightGBM

n_estimators=500
learning_rate=0.05
max_depth=8
subsample=0.8

Extra Trees

n_estimators=100      # Fewer trees (faster)
max_depth=20          # Deeper trees
min_samples_split=10  # Regularization

Ridge Meta-Model

alpha=1.0             # L2 regularization strength

🎯 Future Improvements

Potential Enhancements (Not Implemented)

  • Advanced NLP: Add sentiment analysis, POS tagging
  • Deep Learning: BERT/transformer embeddings
  • Feature Selection: Recursive feature elimination
  • Hyperparameter Tuning: Bayesian optimization
  • Cross-Validation: K-fold for robust validation
  • Neural Network: Add deep learning to ensemble
  • Category-Specific Models: Separate models per product category
  • Price Bucketing: Classification + regression hybrid

Expected Improvements: 5-10% SMAPE reduction possible


πŸ“Š Visualization Ideas

Feature Importance

import matplotlib.pyplot as plt

# Plot XGBoost feature importance
xgb.plot_importance(xgb_model, max_num_features=20)
plt.title("Top 20 Most Important Features")
plt.show()

Prediction Distribution

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(y_clean, bins=50, edgecolor='black')
plt.title("Actual Price Distribution")

plt.subplot(1, 2, 2)
plt.hist(final_pred, bins=50, edgecolor='black')
plt.title("Predicted Price Distribution")

plt.tight_layout()
plt.show()

πŸ› οΈ Troubleshooting

Common Issues

Issue 1: Memory Error

# Reduce feature dimensions
max_features=100  # Instead of 200
n_components=30   # Instead of 50

Issue 2: Slow Training

# Reduce ensemble size
n_estimators=200  # Instead of 500

Issue 3: Poor Performance

# Check data quality
print(train['price'].describe())
print(train['catalog_content'].isnull().sum())

πŸ“š Dependencies

pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
xgboost>=1.5.0
lightgbm>=3.3.0

Install all at once:

pip install pandas numpy scikit-learn xgboost lightgbm

🀝 Contributing

Contributions are welcome! Here's how you can help:

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
  3. πŸ’Ύ Commit your changes (git commit -m 'Add AmazingFeature')
  4. πŸ“€ Push to the branch (git push origin feature/AmazingFeature)
  5. πŸ”ƒ Open a Pull Request

Areas for Contribution

  • Additional feature engineering ideas
  • Alternative text vectorization methods
  • Hyperparameter optimization experiments
  • Documentation improvements
  • Bug fixes

πŸ“– Learning Resources

Want to understand the concepts better?


πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Team

Ramdev Chaudhary [Team Leader]

Pranita Jagtap [Co-Leader]

Vedant Wadekar [Associate]

Sony Yadav [Associate]


πŸ™ Acknowledgments

  • Competition organizers for the dataset and challenge
  • scikit-learn, XGBoost, and LightGBM communities
  • Open source ML community for inspiration

⭐ Star History

If you found this project helpful, please consider giving it a star!

Star History Chart


πŸ“ž Support


Made with ❀️ and β˜• for the ML community

⬆ Back to Top

About

An ensemble machine learning pipeline for predicting product prices from catalog text descriptions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages