Web Scraping Task Manager

This project is a web-based scraping task manager built with Flask, Celery, BeautifulSoup4, Selenium, and Redis. It provides a simple RESTful API and dashboard to manage asynchronous scraping tasks and view results in real-time.

Features

Web interface and API built with Flask
Asynchronous task queue using Celery
Task broker and result backend powered by Redis
Dynamic web scraping using BeautifulSoup and Selenium
Scalable architecture for handling multiple scraping jobs concurrently

Technologies Used

Flask — Web framework for Python
Celery — Distributed task queue
Redis — In-memory data store used as Celery broker and result backend
BeautifulSoup4 — HTML/XML parser for scraping static content
Selenium — Browser automation tool for scraping dynamic content

Setup

Prerequisites

Python 3.8+
Redis server running locally or remotely

Installation

Clone the repository:

git clone https://github.com/daddyauden/crawler.git
cd crawler


2. Create a virtual environment and activate it:

   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
   ```

3. Install dependencies:

   ```bash
   pip install -r requirements.txt
   ```

4. Start Redis (if not already running):

   ```bash
   redis-server
   ```

5. Start the project:

   ```bash
   docker-compose up
   ```

## Usage

* Submit a new scrape task via the API or web interface.
* Choose between static scraping (BeautifulSoup) or dynamic scraping (Selenium).
* Monitor task status and retrieve results using task IDs.

### API Endpoints

| Method | Endpoint            | Description                |
| ------ | ------------------- | -------------------------- |
| GET    | `/crawl/ap-politics` | Get a new crawler       |


## License

This project is licensed under the GNU License. See the [LICENSE](LICENSE) file for details.

```

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
libs		libs
source/ap		source/ap
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping Task Manager

Features

Technologies Used

Setup

Prerequisites

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

daddyauden/crawler

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Task Manager

Features

Technologies Used

Setup

Prerequisites

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages