This project is a web-based scraping task manager built with Flask, Celery, BeautifulSoup4, Selenium, and Redis. It provides a simple RESTful API and dashboard to manage asynchronous scraping tasks and view results in real-time.
- Web interface and API built with Flask
- Asynchronous task queue using Celery
- Task broker and result backend powered by Redis
- Dynamic web scraping using BeautifulSoup and Selenium
- Scalable architecture for handling multiple scraping jobs concurrently
- Flask — Web framework for Python
- Celery — Distributed task queue
- Redis — In-memory data store used as Celery broker and result backend
- BeautifulSoup4 — HTML/XML parser for scraping static content
- Selenium — Browser automation tool for scraping dynamic content
- Python 3.8+
- Redis server running locally or remotely
-
Clone the repository:
git clone https://github.com/daddyauden/crawler.git cd crawler
2. Create a virtual environment and activate it:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Start Redis (if not already running):
```bash
redis-server
```
5. Start the project:
```bash
docker-compose up
```
## Usage
* Submit a new scrape task via the API or web interface.
* Choose between static scraping (BeautifulSoup) or dynamic scraping (Selenium).
* Monitor task status and retrieve results using task IDs.
### API Endpoints
| Method | Endpoint | Description |
| ------ | ------------------- | -------------------------- |
| GET | `/crawl/ap-politics` | Get a new crawler |
## License
This project is licensed under the GNU License. See the [LICENSE](LICENSE) file for details.
```