Skip to content

rayshrey/osb-distribution-workloads

Repository files navigation

OSB Distribution Workloads

Synthetic OpenSearch Benchmark (OSB) workloads with varying text/numeric field distributions for benchmarking pluggable dataformat (Parquet/Lucene composite) ingestion performance.

Overview

Each workload has 101 fields (1 mandatory @timestamp + 100 distribution-specific fields) and 10 million documents. The field distributions vary from 100% numeric to 100% text, allowing you to measure how different data compositions affect ingestion throughput under various storage backends.

All field types are compatible with OpenSearch's pluggable dataformat feature — no geo_point, geo_shape, nested, join, or rank_feature fields are used.

Workloads

Workload Text Fields Numeric Fields Total Fields Compressed Size
dist-000-100 0 100 101 ~3.7 GB
dist-010-090 10 90 101 ~5.2 GB
dist-025-075 25 75 101 ~8.5 GB
dist-050-050 50 50 101 ~8.9 GB
dist-075-025 75 25 101 ~9.3 GB
dist-090-010 90 10 101 ~9.4 GB
dist-100-000 100 0 101 ~9.6 GB

Test Procedures

Each workload includes 6 test procedures covering all combinations of sorting and pluggable dataformat:

Procedure Index Sort Pluggable Dataformat Secondary Format
append-no-sort (default) N/A
append-sorted ✅ (@timestamp asc) N/A
append-parquet-only-sorted ✅ (composite/parquet) none
append-parquet-only-unsorted ✅ (composite/parquet) none
append-parquet-lucene-sorted ✅ (composite/parquet) lucene
append-parquet-lucene-unsorted ✅ (composite/parquet) lucene

All procedures execute the same schedule: delete-index → create-index → cluster-health-check → bulk-ingest → refresh → force-merge → refresh → wait-for-merges.

Prerequisites

  • Python 3.9+ with numpy installed
  • OpenSearch Benchmark (opensearch-benchmark) installed
  • A running OpenSearch cluster (local or remote)
# Install OSB
pip3 install opensearch-benchmark

# Verify
opensearch-benchmark --version

Step 1: Generate Data

The data files (documents.json.gz) are not included in the repository due to their size. You must generate them locally.

cd /path/to/osb-workloads

# Generate a single workload (~25-40 minutes depending on text ratio)
python3 generate.py dist-050-050

# Generate ALL workloads (~3.5 hours total)
python3 generate.py

Each workload generates a documents.json.gz file in its directory:

dist-050-050/
├── documents.json.gz   ← generated (10M docs, gzip compressed)
├── index.json          ← Jinja2 template for index mappings
└── workload.json       ← Jinja2 template for workload definition

Generation time estimates

Workload Time Compressed Output
dist-000-100 (all numeric) ~25 min ~3.7 GB
dist-050-050 (balanced) ~37 min ~8.9 GB
dist-100-000 (all text) ~28 min ~9.6 GB

Disk space required

  • Single workload: 4–10 GB
  • All 7 workloads: ~55 GB

Step 2: Link Data for OSB

OSB looks for data files in its local cache directory (~/.osb/benchmarks/data/). You need to symlink the generated data there:

# For a single workload
mkdir -p ~/.osb/benchmarks/data/dist-050-050
ln -sf $(pwd)/dist-050-050/documents.json.gz ~/.osb/benchmarks/data/dist-050-050/documents.json.gz

# For ALL workloads (run from repo root)
for d in dist-*; do
  mkdir -p ~/.osb/benchmarks/data/$d
  ln -sf $(pwd)/$d/documents.json.gz ~/.osb/benchmarks/data/$d/documents.json.gz
done

Step 3: Run a Benchmark

Basic run (default procedure, full ingestion)

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-no-sort

Quick test with 1% ingestion

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-no-sort \
  --workload-params='{"ingest_percentage": 1}'

Run with pluggable dataformat (parquet + lucene, sorted)

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-parquet-lucene-sorted

Run with custom parameters

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-sorted \
  --workload-params='{"ingest_percentage": 50, "bulk_size": 10000, "bulk_indexing_clients": 16, "number_of_shards": 3}'

View workload info (no cluster needed)

opensearch-benchmark info --workload-path=./dist-050-050

Workload Parameters

Pass these via --workload-params as JSON:

Parameter Default Description
ingest_percentage 100 Percentage of the 10M document corpus to ingest (1-100)
bulk_size 5000 Number of documents per bulk request
bulk_indexing_clients 8 Number of concurrent bulk indexing clients
number_of_shards 1 Number of primary shards for the index
number_of_replicas 0 Number of replica shards
warmup_time_period 120 Warmup period in seconds before measurement starts
cluster_health green Required cluster health before ingestion begins
error_level non-fatal Error handling for bulk responses

Field Types

Numeric bucket (used for the "numeric" portion of each distribution)

Type Count (in 100-field pool) Examples
date 9 event_time, pickup_datetime, session_start_time
integer 22 http_status_code, received_bytes, connect_timing_ms
short 28 elb_status_code, resolution_width, age, adv_engine_id
long 10 user_id, watch_id, referer_hash, metrics_size
float 5 request_processing_time_ms, memory_used_mb, gc_pause_ms
scaled_float 6 trip_distance, fare_amount, total_amount
half_float 2 tip_amount, cpu_percent
boolean 14 is_mobile, is_download, cookie_enable, javascript_enable

Text bucket (used for the "text" portion of each distribution)

Type Count (in 100-field pool) Examples
text 10 message, request_path, referrer_url, user_agent_string
match_only_text 4 log_message, stack_trace, request_body, response_body
keyword 75 http_method, country_name, trace_id, service_name
ip 5 client_ip, target_ip, source_ip
wildcard 2 request_url, referer_path
constant_keyword 4 data_stream_dataset, environment

Realistic data characteristics

  • message: ~250 chars, structured log line with timestamp, level, service, trace ID, and metrics
  • log_message: ~500-700 chars, multi-line with headers, method, path, status, duration, and optional error details
  • stack_trace: ~1000-2000 chars when present (25% of docs), realistic Java stack traces with caused-by chains
  • request_body/response_body: ~200-400 chars JSON payloads (30-40% of docs)
  • user_agent_string: ~120-140 chars, full browser UA strings
  • referrer_url: ~60-110 chars, realistic Google/Facebook/GitHub referrer URLs

Data Sources & Inspiration

Field names and data patterns are derived from these real-world workloads:

  • http_logs — HTTP server access logs
  • clickbench — Web analytics (ClickHouse hits dataset)
  • nyc_taxis — NYC taxi trip records
  • big5 — ECS-formatted log data
  • AWS ALB access logs — Load balancer request/response metrics

Regenerating Templates

If you modify the field definitions or test procedures, regenerate the index.json and workload.json files:

python3 create_templates.py

This overwrites all index.json and workload.json files in the workload directories. It does NOT regenerate the data files.

Repository Structure

osb-workloads/
├── .gitignore              ← excludes *.json.gz from git
├── README.md
├── generate.py             ← main data generator (run this to create documents.json.gz)
├── field_generators.py     ← field definitions, vocabularies, and generator functions
├── create_templates.py     ← generates index.json + workload.json for all distributions
└── dist-{000-100,...}/     ← one directory per workload
    ├── index.json          ← Jinja2 template (index mappings + settings)
    ├── workload.json       ← Jinja2 template (operations + 6 test procedures)
    └── documents.json.gz   ← generated data (NOT in git, create via generate.py)

Notes

  • Data generation is deterministic (seeded with numpy.random.default_rng(42) and random.seed(42)). Running the generator twice produces identical output.
  • The pluggable dataformat test procedures require OpenSearch with the experimental pluggable_dataformat feature flag enabled. Without it, the index settings are accepted but the feature won't activate.
  • @timestamp values span ~5.8 days starting from 2024-01-01 00:00:00 UTC (epoch millis), with ~50ms between consecutive documents.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages