Skip to content

Crawler is a web-based scraping task manager built with Flask, Celery, BeautifulSoup4, Selenium, and Redis. It provides a simple RESTful API and dashboard to manage asynchronous scraping tasks and view results in real-time.

License

Notifications You must be signed in to change notification settings

daddyauden/crawler

Repository files navigation

Web Scraping Task Manager

This project is a web-based scraping task manager built with Flask, Celery, BeautifulSoup4, Selenium, and Redis. It provides a simple RESTful API and dashboard to manage asynchronous scraping tasks and view results in real-time.

Features

  • Web interface and API built with Flask
  • Asynchronous task queue using Celery
  • Task broker and result backend powered by Redis
  • Dynamic web scraping using BeautifulSoup and Selenium
  • Scalable architecture for handling multiple scraping jobs concurrently

Technologies Used

  • Flask — Web framework for Python
  • Celery — Distributed task queue
  • Redis — In-memory data store used as Celery broker and result backend
  • BeautifulSoup4 — HTML/XML parser for scraping static content
  • Selenium — Browser automation tool for scraping dynamic content

Setup

Prerequisites

  • Python 3.8+
  • Redis server running locally or remotely

Installation

  1. Clone the repository:

    git clone https://github.com/daddyauden/crawler.git
    cd crawler

2. Create a virtual environment and activate it:

   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
   ```

3. Install dependencies:

   ```bash
   pip install -r requirements.txt
   ```

4. Start Redis (if not already running):

   ```bash
   redis-server
   ```

5. Start the project:

   ```bash
   docker-compose up
   ```

## Usage

* Submit a new scrape task via the API or web interface.
* Choose between static scraping (BeautifulSoup) or dynamic scraping (Selenium).
* Monitor task status and retrieve results using task IDs.

### API Endpoints

| Method | Endpoint            | Description                |
| ------ | ------------------- | -------------------------- |
| GET    | `/crawl/ap-politics` | Get a new crawler       |


## License

This project is licensed under the GNU License. See the [LICENSE](LICENSE) file for details.

```

About

Crawler is a web-based scraping task manager built with Flask, Celery, BeautifulSoup4, Selenium, and Redis. It provides a simple RESTful API and dashboard to manage asynchronous scraping tasks and view results in real-time.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •