🏆 Advanced Price Prediction Model

🎯 Competition Result: SMAPE 58.66 | Rank: Top 2232/6000 (Top 42%)

An ensemble machine learning pipeline for predicting product prices from catalog text descriptions. This solution combines advanced feature engineering, multiple text vectorization techniques, and stacked ensemble modeling to achieve competitive performance.

📊 Performance Metrics

Metric	Value	Percentile
SMAPE	58.6571%	Top 42%
Rank	< 2500	6000 participants
Models Used	3 + Meta	Stacked Ensemble
Runtime	~12 min	CPU-optimized

✨ Key Features

🔧 Advanced Feature Engineering: Extracts 20+ features from text (quantities, quality indicators, materials, sizes)
📝 Multi-Strategy Text Vectorization: TF-IDF with word/character n-grams + SVD dimensionality reduction
🤖 Ensemble Learning: XGBoost + LightGBM + Extra Trees with Ridge meta-model
🎯 Competition-Safe: No external LLMs or APIs - pure ML approach
⚡ Efficient: Optimized for CPU, runs in ~12 minutes
📚 Well-Documented: Extensive comments explaining every concept

🚀 Quick Start

Prerequisites

Python 3.8+
pandas
numpy
scikit-learn
xgboost
lightgbm

Installation

# Clone the repository
git clone https://github.com/ramoware/Amazon-ML-Challenge-2025.git
cd price-prediction

# Install dependencies
pip install -r requirements.txt

Usage

# Ensure your data files are in the same directory
# Required files: train.csv, test.csv

# Run the pipeline
python model.py

# Output: output.csv

📁 Project Structure

price-prediction/
│
├── model.py             # Main pipeline script
├── requirements.txt     # Python dependencies
├── README.md            # This file
│
├── train.csv(xlsx)      # Training data (not included)
├── test.csv(xlsx)       # Test data (not included)
└── output.csv(xlsx)     # Generated predictions

🧠 Methodology

1. Feature Engineering (20+ Features)

The pipeline extracts comprehensive features from product descriptions:

# Quantity Detection
"Pack of 12" → quantity=12, is_multi_pack=1

# Quality Indicators  
"Premium Leather" → premium_count=1, net_quality=+1

# Size Scoring
"XXL Jumbo" → max_size_score=3

# Material Analysis
"Stainless Steel" → premium_material_count=2

Feature Categories:

📦 Quantity & Pack Size
⭐ Quality Level (Premium vs Economy)
📏 Size Indicators
🛠️ Material Composition
🔢 Numeric Patterns
📝 Text Complexity
🏷️ Brand Indicators

2. Text Vectorization

Multiple complementary approaches capture different aspects of text:

Method	N-grams	Features	Purpose
TF-IDF (Word 1-2)	1-2	200 → 50	Semantic meaning
TF-IDF (Word 1-3)	1-3	150 → 50	Contextual phrases
TF-IDF (Char 3-5)	3-5	100 → 50	Spelling patterns
Count (Char 3-6)	3-6	100	Robust to typos

Total Text Features: 250 dimensions

3. Ensemble Architecture

┌─────────────┐
│   XGBoost   │──┐
└─────────────┘  │
                 │     ┌──────────────┐
┌─────────────┐  ├───▶│  Ridge Meta   │──▶ Final Prediction
│  LightGBM   │──┤     │    Model     │
└─────────────┘  │     └──────────────┘
                 │
┌─────────────┐  │
│ Extra Trees │──┘
└─────────────┘

Why This Works:

🎯 Diversity: Different algorithms capture different patterns
🛡️ Robustness: Ensemble reduces overfitting
📈 Performance: Stacking often beats individual models

4. Post-Processing

Ensures realistic predictions:

✅ Minimum price: $0.01
📊 Maximum cap: 120% of training 98th percentile
🔧 Log-space smoothing (98% factor)

📈 Results Analysis

Competition Performance

╔══════════════════════════════════════╗
║  SMAPE: 58.6571%                     ║
║  Rank: < 2500 / 6000                 ║
║  Percentile: Top 42%                 ║
╚══════════════════════════════════════╝

Individual Model Performance

Model	Validation SMAPE	Training Time
XGBoost	~60%	~2 min
LightGBM	~61%	~1.5 min
Extra Trees	~63%	~1 min
Stacked Ensemble	~59%	~5 min total

Key Insights

Feature Engineering Impact: Engineered features provide 40% of predictive power
Text Embeddings: Character-level n-grams surprisingly effective
Ensemble Benefit: ~2% improvement over best single model
Outlier Removal: Removing top 2% prices improved stability

🔧 Hyperparameters

XGBoost

n_estimators=500      # More trees = better fit
learning_rate=0.05    # Conservative learning
max_depth=8           # Moderate complexity
subsample=0.8         # 80% data per tree

LightGBM

n_estimators=500
learning_rate=0.05
max_depth=8
subsample=0.8

Extra Trees

n_estimators=100      # Fewer trees (faster)
max_depth=20          # Deeper trees
min_samples_split=10  # Regularization

Ridge Meta-Model

alpha=1.0             # L2 regularization strength

🎯 Future Improvements

Potential Enhancements (Not Implemented)

Advanced NLP: Add sentiment analysis, POS tagging
Deep Learning: BERT/transformer embeddings
Feature Selection: Recursive feature elimination
Hyperparameter Tuning: Bayesian optimization
Cross-Validation: K-fold for robust validation
Neural Network: Add deep learning to ensemble
Category-Specific Models: Separate models per product category
Price Bucketing: Classification + regression hybrid

Expected Improvements: 5-10% SMAPE reduction possible

📊 Visualization Ideas

Feature Importance

import matplotlib.pyplot as plt

# Plot XGBoost feature importance
xgb.plot_importance(xgb_model, max_num_features=20)
plt.title("Top 20 Most Important Features")
plt.show()

Prediction Distribution

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(y_clean, bins=50, edgecolor='black')
plt.title("Actual Price Distribution")

plt.subplot(1, 2, 2)
plt.hist(final_pred, bins=50, edgecolor='black')
plt.title("Predicted Price Distribution")

plt.tight_layout()
plt.show()

🛠️ Troubleshooting

Common Issues

Issue 1: Memory Error

# Reduce feature dimensions
max_features=100  # Instead of 200
n_components=30   # Instead of 50

Issue 2: Slow Training

# Reduce ensemble size
n_estimators=200  # Instead of 500

Issue 3: Poor Performance

# Check data quality
print(train['price'].describe())
print(train['catalog_content'].isnull().sum())

📚 Dependencies

pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
xgboost>=1.5.0
lightgbm>=3.3.0

Install all at once:

pip install pandas numpy scikit-learn xgboost lightgbm

🤝 Contributing

Contributions are welcome! Here's how you can help:

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
💾 Commit your changes (git commit -m 'Add AmazingFeature')
📤 Push to the branch (git push origin feature/AmazingFeature)
🔃 Open a Pull Request

Areas for Contribution

Additional feature engineering ideas
Alternative text vectorization methods
Hyperparameter optimization experiments
Documentation improvements
Bug fixes

📖 Learning Resources

Want to understand the concepts better?

Ensemble Learning: Sklearn Ensemble Guide
TF-IDF: Understanding TF-IDF
XGBoost: Official XGBoost Tutorial
SMAPE: Forecast Error Metrics
Stacking: Stacked Generalization

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍👩‍👧‍👦 Team

Ramdev Chaudhary [Team Leader]

GitHub: @ramoware
LinkedIn: ramdevchaudhary
Email: ramoware@gmail.com

Pranita Jagtap [Co-Leader]

Vedant Wadekar [Associate]

Sony Yadav [Associate]

🙏 Acknowledgments

Competition organizers for the dataset and challenge
scikit-learn, XGBoost, and LightGBM communities
Open source ML community for inspiration

⭐ Star History

If you found this project helpful, please consider giving it a star!

📞 Support

📧 Email: ramoware@gmail.com
💬 Issues: GitHub Issues
📖 Documentation: Wiki

Made with ❤️ and ☕ for the ML community

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
LICENSE		LICENSE
README.md		README.md
model.py		model.py
output.csv		output.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🏆 Advanced Price Prediction Model

📊 Performance Metrics

✨ Key Features

🚀 Quick Start

Prerequisites

Installation

Usage

📁 Project Structure

🧠 Methodology

1. Feature Engineering (20+ Features)

2. Text Vectorization

3. Ensemble Architecture

4. Post-Processing

📈 Results Analysis

Competition Performance

Individual Model Performance

Key Insights

🔧 Hyperparameters

XGBoost

LightGBM

Extra Trees

Ridge Meta-Model

🎯 Future Improvements

Potential Enhancements (Not Implemented)

📊 Visualization Ideas

Feature Importance

Prediction Distribution

🛠️ Troubleshooting

Common Issues

📚 Dependencies

🤝 Contributing

Areas for Contribution

📖 Learning Resources

📜 License

👨‍👩‍👧‍👦 Team

🙏 Acknowledgments

⭐ Star History

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages