OSB Distribution Workloads

Synthetic OpenSearch Benchmark (OSB) workloads with varying text/numeric field distributions for benchmarking pluggable dataformat (Parquet/Lucene composite) ingestion performance.

Overview

Each workload has 101 fields (1 mandatory @timestamp + 100 distribution-specific fields) and 10 million documents. The field distributions vary from 100% numeric to 100% text, allowing you to measure how different data compositions affect ingestion throughput under various storage backends.

All field types are compatible with OpenSearch's pluggable dataformat feature — no geo_point, geo_shape, nested, join, or rank_feature fields are used.

Workloads

Workload	Text Fields	Numeric Fields	Total Fields	Compressed Size
`dist-000-100`	0	100	101	~3.7 GB
`dist-010-090`	10	90	101	~5.2 GB
`dist-025-075`	25	75	101	~8.5 GB
`dist-050-050`	50	50	101	~8.9 GB
`dist-075-025`	75	25	101	~9.3 GB
`dist-090-010`	90	10	101	~9.4 GB
`dist-100-000`	100	0	101	~9.6 GB

Test Procedures

Each workload includes 6 test procedures covering all combinations of sorting and pluggable dataformat:

Procedure	Index Sort	Pluggable Dataformat	Secondary Format
`append-no-sort` (default)	❌	❌	N/A
`append-sorted`	✅ (`@timestamp` asc)	❌	N/A
`append-parquet-only-sorted`	✅	✅ (composite/parquet)	none
`append-parquet-only-unsorted`	❌	✅ (composite/parquet)	none
`append-parquet-lucene-sorted`	✅	✅ (composite/parquet)	lucene
`append-parquet-lucene-unsorted`	❌	✅ (composite/parquet)	lucene

All procedures execute the same schedule: delete-index → create-index → cluster-health-check → bulk-ingest → refresh → force-merge → refresh → wait-for-merges.

Prerequisites

Python 3.9+ with numpy installed
OpenSearch Benchmark (opensearch-benchmark) installed
A running OpenSearch cluster (local or remote)

# Install OSB
pip3 install opensearch-benchmark

# Verify
opensearch-benchmark --version

Step 1: Generate Data

The data files (documents.json.gz) are not included in the repository due to their size. You must generate them locally.

cd /path/to/osb-workloads

# Generate a single workload (~25-40 minutes depending on text ratio)
python3 generate.py dist-050-050

# Generate ALL workloads (~3.5 hours total)
python3 generate.py

Each workload generates a documents.json.gz file in its directory:

dist-050-050/
├── documents.json.gz   ← generated (10M docs, gzip compressed)
├── index.json          ← Jinja2 template for index mappings
└── workload.json       ← Jinja2 template for workload definition

Generation time estimates

Workload	Time	Compressed Output
dist-000-100 (all numeric)	~25 min	~3.7 GB
dist-050-050 (balanced)	~37 min	~8.9 GB
dist-100-000 (all text)	~28 min	~9.6 GB

Disk space required

Single workload: 4–10 GB
All 7 workloads: ~55 GB

Step 2: Link Data for OSB

OSB looks for data files in its local cache directory (~/.osb/benchmarks/data/). You need to symlink the generated data there:

# For a single workload
mkdir -p ~/.osb/benchmarks/data/dist-050-050
ln -sf $(pwd)/dist-050-050/documents.json.gz ~/.osb/benchmarks/data/dist-050-050/documents.json.gz

# For ALL workloads (run from repo root)
for d in dist-*; do
  mkdir -p ~/.osb/benchmarks/data/$d
  ln -sf $(pwd)/$d/documents.json.gz ~/.osb/benchmarks/data/$d/documents.json.gz
done

Step 3: Run a Benchmark

Basic run (default procedure, full ingestion)

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-no-sort

Quick test with 1% ingestion

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-no-sort \
  --workload-params='{"ingest_percentage": 1}'

Run with pluggable dataformat (parquet + lucene, sorted)

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-parquet-lucene-sorted

Run with custom parameters

opensearch-benchmark run \
  --workload-path=./dist-050-050 \
  --target-hosts=http://localhost:9200 \
  --pipeline=benchmark-only \
  --test-procedure=append-sorted \
  --workload-params='{"ingest_percentage": 50, "bulk_size": 10000, "bulk_indexing_clients": 16, "number_of_shards": 3}'

View workload info (no cluster needed)

opensearch-benchmark info --workload-path=./dist-050-050

Workload Parameters

Pass these via --workload-params as JSON:

Parameter	Default	Description
`ingest_percentage`	100	Percentage of the 10M document corpus to ingest (1-100)
`bulk_size`	5000	Number of documents per bulk request
`bulk_indexing_clients`	8	Number of concurrent bulk indexing clients
`number_of_shards`	1	Number of primary shards for the index
`number_of_replicas`	0	Number of replica shards
`warmup_time_period`	120	Warmup period in seconds before measurement starts
`cluster_health`	green	Required cluster health before ingestion begins
`error_level`	non-fatal	Error handling for bulk responses

Field Types

Numeric bucket (used for the "numeric" portion of each distribution)

Type	Count (in 100-field pool)	Examples
`date`	9	event_time, pickup_datetime, session_start_time
`integer`	22	http_status_code, received_bytes, connect_timing_ms
`short`	28	elb_status_code, resolution_width, age, adv_engine_id
`long`	10	user_id, watch_id, referer_hash, metrics_size
`float`	5	request_processing_time_ms, memory_used_mb, gc_pause_ms
`scaled_float`	6	trip_distance, fare_amount, total_amount
`half_float`	2	tip_amount, cpu_percent
`boolean`	14	is_mobile, is_download, cookie_enable, javascript_enable

Text bucket (used for the "text" portion of each distribution)

Type	Count (in 100-field pool)	Examples
`text`	10	message, request_path, referrer_url, user_agent_string
`match_only_text`	4	log_message, stack_trace, request_body, response_body
`keyword`	75	http_method, country_name, trace_id, service_name
`ip`	5	client_ip, target_ip, source_ip
`wildcard`	2	request_url, referer_path
`constant_keyword`	4	data_stream_dataset, environment

Realistic data characteristics

message: ~250 chars, structured log line with timestamp, level, service, trace ID, and metrics
log_message: ~500-700 chars, multi-line with headers, method, path, status, duration, and optional error details
stack_trace: ~1000-2000 chars when present (25% of docs), realistic Java stack traces with caused-by chains
request_body/response_body: ~200-400 chars JSON payloads (30-40% of docs)
user_agent_string: ~120-140 chars, full browser UA strings
referrer_url: ~60-110 chars, realistic Google/Facebook/GitHub referrer URLs

Data Sources & Inspiration

Field names and data patterns are derived from these real-world workloads:

http_logs — HTTP server access logs
clickbench — Web analytics (ClickHouse hits dataset)
nyc_taxis — NYC taxi trip records
big5 — ECS-formatted log data
AWS ALB access logs — Load balancer request/response metrics

Regenerating Templates

If you modify the field definitions or test procedures, regenerate the index.json and workload.json files:

python3 create_templates.py

This overwrites all index.json and workload.json files in the workload directories. It does NOT regenerate the data files.

Repository Structure

osb-workloads/
├── .gitignore              ← excludes *.json.gz from git
├── README.md
├── generate.py             ← main data generator (run this to create documents.json.gz)
├── field_generators.py     ← field definitions, vocabularies, and generator functions
├── create_templates.py     ← generates index.json + workload.json for all distributions
└── dist-{000-100,...}/     ← one directory per workload
    ├── index.json          ← Jinja2 template (index mappings + settings)
    ├── workload.json       ← Jinja2 template (operations + 6 test procedures)
    └── documents.json.gz   ← generated data (NOT in git, create via generate.py)

Notes

Data generation is deterministic (seeded with numpy.random.default_rng(42) and random.seed(42)). Running the generator twice produces identical output.
The pluggable dataformat test procedures require OpenSearch with the experimental pluggable_dataformat feature flag enabled. Without it, the index settings are accepted but the feature won't activate.
@timestamp values span ~5.8 days starting from 2024-01-01 00:00:00 UTC (epoch millis), with ~50ms between consecutive documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSB Distribution Workloads

Overview

Workloads

Test Procedures

Prerequisites

Step 1: Generate Data

Generation time estimates

Disk space required

Step 2: Link Data for OSB

Step 3: Run a Benchmark

Basic run (default procedure, full ingestion)

Quick test with 1% ingestion

Run with pluggable dataformat (parquet + lucene, sorted)

Run with custom parameters

View workload info (no cluster needed)

Workload Parameters

Field Types

Numeric bucket (used for the "numeric" portion of each distribution)

Text bucket (used for the "text" portion of each distribution)

Realistic data characteristics

Data Sources & Inspiration

Regenerating Templates

Repository Structure

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dist-000-100		dist-000-100
dist-010-090		dist-010-090
dist-025-075		dist-025-075
dist-050-050		dist-050-050
dist-075-025		dist-075-025
dist-090-010		dist-090-010
dist-100-000		dist-100-000
.gitignore		.gitignore
README.md		README.md
create_templates.py		create_templates.py
field_generators.py		field_generators.py
generate.py		generate.py

Folders and files

Latest commit

History

Repository files navigation

OSB Distribution Workloads

Overview

Workloads

Test Procedures

Prerequisites

Step 1: Generate Data

Generation time estimates

Disk space required

Step 2: Link Data for OSB

Step 3: Run a Benchmark

Basic run (default procedure, full ingestion)

Quick test with 1% ingestion

Run with pluggable dataformat (parquet + lucene, sorted)

Run with custom parameters

View workload info (no cluster needed)

Workload Parameters

Field Types

Numeric bucket (used for the "numeric" portion of each distribution)

Text bucket (used for the "text" portion of each distribution)

Realistic data characteristics

Data Sources & Inspiration

Regenerating Templates

Repository Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages