The AgentMemory benchmarking system provides a comprehensive framework for measuring, analyzing, and visualizing the performance of the AgentMemory system across various dimensions. It enables systematic evaluation of key performance metrics such as throughput, latency, memory efficiency, and scalability.
This benchmarking system allows developers to:
- Run standardized benchmarks to evaluate system performance
- Compare results across different configurations and implementations
- Generate detailed reports and visualizations
- Establish performance baselines and detect regressions
flowchart TD
Config[BenchmarkConfig] --> Runner[BenchmarkRunner]
Runner --> Benchmarks[Individual Benchmarks]
Benchmarks --> Results[BenchmarkResults]
Results --> Reports[HTML Reports]
Results --> Plots[Visualization Plots]
Results --> Comparison[Baseline Comparison]
The benchmarking system includes several categories of benchmarks, each focusing on a different aspect of the AgentMemory system:
- Storage Performance: Evaluates raw storage capabilities (write throughput, read latency, memory efficiency)
- Compression Effectiveness: Measures neural compression system capabilities (embedding quality, compression ratio)
- Memory Transition: Tests effectiveness of memory tier transitions (transition accuracy, importance scoring)
- Retrieval Performance: Evaluates memory retrieval capabilities (search latency, cross-tier retrieval)
- Scalability: Measures system performance under increasing load (agent count scaling, memory size scaling)
- Integration Tests: Evaluates integration with existing agent systems (API overhead, real-world integration)
Note: Currently, only the Storage Performance benchmark category is fully implemented. Other categories are planned for future development.
The benchmarking system consists of several components:
- BenchmarkConfig: Configuration classes for customizing benchmark parameters
- BenchmarkRunner: Core engine for discovering and executing benchmarks
- BenchmarkResults: Storage and analysis of benchmark results
- Benchmark Implementations: Individual benchmark modules for each category
- CLI Interface: Command-line tools for running benchmarks and analyzing results
The benchmarking system requires the following dependencies:
pandas
matplotlib
These are used for data analysis and visualization of benchmark results.
memory/benchmarking/
├── __init__.py # Package initialization
├── cli.py # Command-line interface
├── config.py # Configuration classes
├── runner.py # Benchmark runner
├── results.py # Results handling
├── README.md # Quick start documentation
└── benchmarks/ # Individual benchmark implementations
├── __init__.py # Benchmark package initialization
└── storage.py # Storage benchmarks
The benchmarking system provides a command-line interface for running benchmarks:
# Run all benchmarks
python -m memory.benchmarking.cli run
# Run a specific category
python -m memory.benchmarking.cli run --category storage
# Run a specific benchmark
python -m memory.benchmarking.cli run --category storage --benchmark write_throughput
# Run benchmarks in parallel
python -m memory.benchmarking.cli run --parallel --workers 8
# Run with custom configuration
python -m memory.benchmarking.cli run --config custom_config.json# Run all benchmarks
python scripts/benchmark.py run
# Run a specific category
python scripts/benchmark.py run --category storage
# Run a specific benchmark
python scripts/benchmark.py run --category storage --benchmark write_throughput
# Run benchmarks in parallel
python scripts/benchmark.py run --parallel --workers 8
# Run with custom configuration
python scripts/benchmark.py run --config custom_config.json# List available results
python -m memory.benchmarking.cli results list
# Generate a report from results
python -m memory.benchmarking.cli results report --category storage
# Compare results with a baseline
python -m memory.benchmarking.cli compare --current-dir results/run1 --baseline-dir results/baselineThe benchmarking system is designed to be easily extensible. You can create custom benchmarks by adding new benchmark functions to the appropriate module.
Benchmark functions should follow this structure:
def benchmark_my_custom_benchmark(param1: int = 100, param2: float = 0.5, **kwargs) -> Dict[str, Any]:
"""Description of what this benchmark measures.
Args:
param1: Description of parameter 1
param2: Description of parameter 2
Returns:
Dictionary with benchmark results
"""
results = {}
# Benchmark implementation
# ...
return resultsKey points:
- Function name must start with
benchmark_ - Accept configuration parameters with sensible defaults
- Include
**kwargsfor flexibility with configuration - Return a dictionary with structured results
- Include proper docstring with description and parameter documentation
Here's an example of a storage benchmark implementation:
def benchmark_write_throughput(batch_sizes: Optional[List[int]] = None,
data_complexity_levels: Optional[List[int]] = None,
**kwargs) -> Dict[str, Any]:
"""Benchmark write throughput across memory tiers.
Args:
batch_sizes: List of batch sizes to test
data_complexity_levels: List of data complexity levels to test
Returns:
Dictionary with benchmark results
"""
batch_sizes = batch_sizes or [10, 100, 1000, 10000]
data_complexity_levels = data_complexity_levels or [1, 5, 10]
results = {}
for complexity in data_complexity_levels:
complexity_results = {}
for batch_size in batch_sizes:
print(f"Testing write throughput with batch_size={batch_size}, complexity={complexity}")
# Create a fresh memory system for each test
config = MemoryConfig()
temp_db_path = os.path.join(tempfile.gettempdir(), f"benchmark_ltm_{complexity}_{batch_size}.db")
config.ltm_config.db_path = temp_db_path
memory_system = AgentMemorySystem(config)
# Generate test data
test_states = [generate_test_state(complexity) for _ in range(batch_size)]
# Test STM write throughput
start_time = time.time()
for i, state in enumerate(test_states):
memory_system.store_agent_state(
agent_id="benchmark_agent",
state_data=state,
step_number=i
)
end_time = time.time()
execution_time = end_time - start_time
stm_throughput = batch_size / execution_time if execution_time > 0 else 0
batch_results = {
"throughput_ops_per_second": stm_throughput,
"total_time_seconds": execution_time,
"items_processed": batch_size
}
complexity_results[f"batch_{batch_size}"] = batch_results
# Clean up temp file
if os.path.exists(temp_db_path):
try:
os.remove(temp_db_path)
except:
pass
results[f"complexity_{complexity}"] = complexity_results
return resultsBenchmark results are stored in a structured JSON format:
{
"benchmark": "write_throughput",
"category": "storage",
"timestamp": "20250323_145623",
"results": {
"complexity_1": {
"batch_100": {
"throughput_ops_per_second": 1250.5,
"total_time_seconds": 0.08,
"items_processed": 100
}
}
},
"metadata": {
"execution_time": 15.2,
"params": {
"batch_sizes": [10, 100, 1000, 10000],
"data_complexity_levels": [1, 5, 10]
}
}
}The benchmarking system uses a configuration system to customize benchmark parameters:
from memory.benchmarking.config import BenchmarkConfig
# Create default config
config = BenchmarkConfig()
# Customize configuration
config.storage.batch_sizes = [10, 50, 100]
config.retrieval.similarity_thresholds = [0.6, 0.7, 0.8]
config.output_dir = "custom_benchmark_results"
# Save configuration for reuse
config.save("custom_config.json")The BenchmarkResults class provides methods for analyzing benchmark results:
from memory.benchmarking.results import BenchmarkResults
# Initialize results manager
results = BenchmarkResults("benchmark_results")
# List results
storage_results = results.list_results(category="storage")
# Compare results
comparison_df = results.compare_results(storage_results)
# Generate a report
report_path = results.generate_report(category="storage")
# Create visualization
fig = results.plot_comparison(
storage_results,
metric="throughput_ops_per_second",
x_axis="batch_size"
)
results.save_plot(fig, "storage_throughput.png")The benchmarking system can be integrated with CI/CD pipelines to automatically detect performance regressions:
# Run benchmarks and compare with baseline
python -m memory.benchmarking.cli run --output-dir results/current
python -m memory.benchmarking.cli compare --current-dir results/current --baseline-dir results/baseline --threshold 0.1
# Exit code 2 indicates performance regression
if ($LASTEXITCODE -eq 2) {
Write-Host "Performance regression detected!"
exit 1
}- Isolation: Each benchmark should create a fresh instance of AgentMemorySystem
- Cleanup: Properly clean up temporary files and resources
- Parameterization: Make benchmarks configurable through parameters
- Documentation: Include detailed documentation about what each benchmark measures
- Consistency: Use consistent naming and result structures
- Error Handling: Implement proper error handling to prevent benchmark failures
- Resource Management: Be mindful of memory and CPU usage during benchmarks
By following these guidelines, you can create reliable and informative benchmarks that help monitor and improve the performance of the AgentMemory system.