Synthetic OpenSearch Benchmark (OSB) workloads with varying text/numeric field distributions for benchmarking pluggable dataformat (Parquet/Lucene composite) ingestion performance.
Each workload has 101 fields (1 mandatory @timestamp + 100 distribution-specific fields) and 10 million documents. The field distributions vary from 100% numeric to 100% text, allowing you to measure how different data compositions affect ingestion throughput under various storage backends.
All field types are compatible with OpenSearch's pluggable dataformat feature — no geo_point, geo_shape, nested, join, or rank_feature fields are used.
| Workload | Text Fields | Numeric Fields | Total Fields | Compressed Size |
|---|---|---|---|---|
dist-000-100 |
0 | 100 | 101 | ~3.7 GB |
dist-010-090 |
10 | 90 | 101 | ~5.2 GB |
dist-025-075 |
25 | 75 | 101 | ~8.5 GB |
dist-050-050 |
50 | 50 | 101 | ~8.9 GB |
dist-075-025 |
75 | 25 | 101 | ~9.3 GB |
dist-090-010 |
90 | 10 | 101 | ~9.4 GB |
dist-100-000 |
100 | 0 | 101 | ~9.6 GB |
Each workload includes 6 test procedures covering all combinations of sorting and pluggable dataformat:
| Procedure | Index Sort | Pluggable Dataformat | Secondary Format |
|---|---|---|---|
append-no-sort (default) |
❌ | ❌ | N/A |
append-sorted |
✅ (@timestamp asc) |
❌ | N/A |
append-parquet-only-sorted |
✅ | ✅ (composite/parquet) | none |
append-parquet-only-unsorted |
❌ | ✅ (composite/parquet) | none |
append-parquet-lucene-sorted |
✅ | ✅ (composite/parquet) | lucene |
append-parquet-lucene-unsorted |
❌ | ✅ (composite/parquet) | lucene |
All procedures execute the same schedule: delete-index → create-index → cluster-health-check → bulk-ingest → refresh → force-merge → refresh → wait-for-merges.
- Python 3.9+ with
numpyinstalled - OpenSearch Benchmark (
opensearch-benchmark) installed - A running OpenSearch cluster (local or remote)
# Install OSB
pip3 install opensearch-benchmark
# Verify
opensearch-benchmark --versionThe data files (documents.json.gz) are not included in the repository due to their size. You must generate them locally.
cd /path/to/osb-workloads
# Generate a single workload (~25-40 minutes depending on text ratio)
python3 generate.py dist-050-050
# Generate ALL workloads (~3.5 hours total)
python3 generate.pyEach workload generates a documents.json.gz file in its directory:
dist-050-050/
├── documents.json.gz ← generated (10M docs, gzip compressed)
├── index.json ← Jinja2 template for index mappings
└── workload.json ← Jinja2 template for workload definition
| Workload | Time | Compressed Output |
|---|---|---|
| dist-000-100 (all numeric) | ~25 min | ~3.7 GB |
| dist-050-050 (balanced) | ~37 min | ~8.9 GB |
| dist-100-000 (all text) | ~28 min | ~9.6 GB |
- Single workload: 4–10 GB
- All 7 workloads: ~55 GB
OSB looks for data files in its local cache directory (~/.osb/benchmarks/data/). You need to symlink the generated data there:
# For a single workload
mkdir -p ~/.osb/benchmarks/data/dist-050-050
ln -sf $(pwd)/dist-050-050/documents.json.gz ~/.osb/benchmarks/data/dist-050-050/documents.json.gz
# For ALL workloads (run from repo root)
for d in dist-*; do
mkdir -p ~/.osb/benchmarks/data/$d
ln -sf $(pwd)/$d/documents.json.gz ~/.osb/benchmarks/data/$d/documents.json.gz
doneopensearch-benchmark run \
--workload-path=./dist-050-050 \
--target-hosts=http://localhost:9200 \
--pipeline=benchmark-only \
--test-procedure=append-no-sortopensearch-benchmark run \
--workload-path=./dist-050-050 \
--target-hosts=http://localhost:9200 \
--pipeline=benchmark-only \
--test-procedure=append-no-sort \
--workload-params='{"ingest_percentage": 1}'opensearch-benchmark run \
--workload-path=./dist-050-050 \
--target-hosts=http://localhost:9200 \
--pipeline=benchmark-only \
--test-procedure=append-parquet-lucene-sortedopensearch-benchmark run \
--workload-path=./dist-050-050 \
--target-hosts=http://localhost:9200 \
--pipeline=benchmark-only \
--test-procedure=append-sorted \
--workload-params='{"ingest_percentage": 50, "bulk_size": 10000, "bulk_indexing_clients": 16, "number_of_shards": 3}'opensearch-benchmark info --workload-path=./dist-050-050Pass these via --workload-params as JSON:
| Parameter | Default | Description |
|---|---|---|
ingest_percentage |
100 | Percentage of the 10M document corpus to ingest (1-100) |
bulk_size |
5000 | Number of documents per bulk request |
bulk_indexing_clients |
8 | Number of concurrent bulk indexing clients |
number_of_shards |
1 | Number of primary shards for the index |
number_of_replicas |
0 | Number of replica shards |
warmup_time_period |
120 | Warmup period in seconds before measurement starts |
cluster_health |
green | Required cluster health before ingestion begins |
error_level |
non-fatal | Error handling for bulk responses |
| Type | Count (in 100-field pool) | Examples |
|---|---|---|
date |
9 | event_time, pickup_datetime, session_start_time |
integer |
22 | http_status_code, received_bytes, connect_timing_ms |
short |
28 | elb_status_code, resolution_width, age, adv_engine_id |
long |
10 | user_id, watch_id, referer_hash, metrics_size |
float |
5 | request_processing_time_ms, memory_used_mb, gc_pause_ms |
scaled_float |
6 | trip_distance, fare_amount, total_amount |
half_float |
2 | tip_amount, cpu_percent |
boolean |
14 | is_mobile, is_download, cookie_enable, javascript_enable |
| Type | Count (in 100-field pool) | Examples |
|---|---|---|
text |
10 | message, request_path, referrer_url, user_agent_string |
match_only_text |
4 | log_message, stack_trace, request_body, response_body |
keyword |
75 | http_method, country_name, trace_id, service_name |
ip |
5 | client_ip, target_ip, source_ip |
wildcard |
2 | request_url, referer_path |
constant_keyword |
4 | data_stream_dataset, environment |
- message: ~250 chars, structured log line with timestamp, level, service, trace ID, and metrics
- log_message: ~500-700 chars, multi-line with headers, method, path, status, duration, and optional error details
- stack_trace: ~1000-2000 chars when present (25% of docs), realistic Java stack traces with caused-by chains
- request_body/response_body: ~200-400 chars JSON payloads (30-40% of docs)
- user_agent_string: ~120-140 chars, full browser UA strings
- referrer_url: ~60-110 chars, realistic Google/Facebook/GitHub referrer URLs
Field names and data patterns are derived from these real-world workloads:
- http_logs — HTTP server access logs
- clickbench — Web analytics (ClickHouse hits dataset)
- nyc_taxis — NYC taxi trip records
- big5 — ECS-formatted log data
- AWS ALB access logs — Load balancer request/response metrics
If you modify the field definitions or test procedures, regenerate the index.json and workload.json files:
python3 create_templates.pyThis overwrites all index.json and workload.json files in the workload directories. It does NOT regenerate the data files.
osb-workloads/
├── .gitignore ← excludes *.json.gz from git
├── README.md
├── generate.py ← main data generator (run this to create documents.json.gz)
├── field_generators.py ← field definitions, vocabularies, and generator functions
├── create_templates.py ← generates index.json + workload.json for all distributions
└── dist-{000-100,...}/ ← one directory per workload
├── index.json ← Jinja2 template (index mappings + settings)
├── workload.json ← Jinja2 template (operations + 6 test procedures)
└── documents.json.gz ← generated data (NOT in git, create via generate.py)
- Data generation is deterministic (seeded with
numpy.random.default_rng(42)andrandom.seed(42)). Running the generator twice produces identical output. - The pluggable dataformat test procedures require OpenSearch with the experimental
pluggable_dataformatfeature flag enabled. Without it, the index settings are accepted but the feature won't activate. @timestampvalues span ~5.8 days starting from 2024-01-01 00:00:00 UTC (epoch millis), with ~50ms between consecutive documents.