Skip to content

MarlontheWizard/MarketNormalizationEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarketNormalizationEngine

Banner

Python License Status

Author Note

Please star this repository if you find it useful! It means alot to me.
Email marlon.dominguez307@gmail.com for any implementation requests, contribution requests, or bug requests.

Overview

A high-performance, parallelized market data ingestion and normalization engine designed to download, parse, organize, and resample raw Dukascopy forex tick data into a clean ML-ready format.

This project focuses on data normalization infrastructure, not trading logic.

Dukascopy (Forex Data)

Features

  • Parallelized hourly tick downloads
  • Automatic retry queue with exponential backoff
  • Corrupted/empty response detection
  • Structured parquet-based storage
  • BI5 → Parquet normalization pipeline
  • Multithreaded parsing and resampling
  • ML-ready dataframe generation
  • CLI and code-driven execution
  • Hierarchical dataset organization
  • Timeframe aggregation using Pandas resampling
  • Detailed logging and ingestion diagnostics

Architecture

The system is designed around a clean separation of concerns:

  • Downloader → fetch raw .bi5 tick data
  • Storage Layer → organize files by symbol/date/hour
  • Parser → decode bid/ask/mid and normalie
  • Resampler → Produce dataframe with requested timeframe

Completely ready for feature Extraction / ML!

Components

Downloader

Fetches raw .bi5 tick data directly from Dukascopy servers.

Parser

Decodes compressed binary tick data into normalized parquet datasets.

Storage Layer

Organizes data hierarchically by:

symbol/date/hour
Resampler

Aggregates tick data into configurable timeframes such as:

1min
5min
15min
1h
4h
1d
1w

Additional rarer timeframes supported:

1s 
5s   
15s
30s

Storage Structure

Raw BI5 Data

raw_data/
    EURUSD/
        2024-01-02/
            EURUSD_20240102_00h.bi5
            EURUSD_20240102_01h.bi5

Parsed Parquet Data

parsed_data/
    EURUSD/
        2024-01-02/
            EURUSD_20240102_00h.parquet
            EURUSD_20240102_01h.parquet

Resampled Data

resampled_data/
    EURUSD/
        1min/
            20240102.parquet

        5min/
            20240102.parquet

        1h/
            20240102.parquet

CLI Usage

By default, the engine performs download and parse operations automatically. In other words, if a specific operation is not specified then the downloader and parser are both performed.

Single Day Download

python dukascopy_data_engine.py --symbol EURUSD --start-date 2024-01-02

Range Download

python dukascopy_data_engine.py --symbol EURUSD --start-date 2024-01-01 --end-date 2024-01-10

Custom Output Directories

python dukascopy_data_engine.py 

--symbol EURUSD 

--start-date 2024-01-02 

--raw-data-dir custom_raw_data 

--parsed-data-dir custom_parsed_data

Resampling Through CLI

python dukascopy_data_engine.py --operation resample --symbol EURUSD --parsed-data-dir parsed_data --timeframe 1min

Code Usage

The engine can also be used programmatically.

Imports

from dukascopy_data_downloader import begin_downloader_process
from dukascopy_bi5_data_parser import begin_parser_process
import resampler

Using the Downloader

Function
begin_downloader_process(
    symbol,
    start_date,
    end_date=None,
    location="raw_data"
)
Example
from dukascopy_data_downloader import begin_downloader_process

begin_downloader_process(
    symbol="EURUSD",
    start_date="2024-01-02",
    end_date=None,
    location="raw_data"
)

Using the Parser

Example
from dukascopy_bi5_data_parser import begin_parser_process

begin_parser_process(
    "raw_data",
    "parsed_data"
)

Using the Resampler

Example
import resampler

results = resampler.invoke_resampler(
    parquet_dir="parsed_data",
    symbol="EURUSD",
    timeframe="1d"
)

The resampler returns:

dict[date] -> pandas.DataFrame

where each dataframe contains normalized OHLCV-style bars.

Example Columns
timestamp
open
high
low
close
bid_volume
ask_volume

License

This project is licensed under the Apache 2.0 License.

About

Market data pipeline for downloading, parsing, and normalizing high-frequency data. Converts raw .bi5 files into structured parquet datasets, with support for resampling and feature-ready data preparation for quantitative, research, and machine learning workflows. Currently only supports Forex data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages