A Python-based data analysis tool for visualizing and analyzing CPU and memory usage patterns from Parquet-formatted telemetry data. This tool provides comprehensive statistical analysis and visualization of resource utilization across different availability zones and overcommit configurations.
- Parquet Data Processing: Efficient reading and processing of large Parquet files with batch processing
- Multi-dimensional Analysis: Split and analyze data by availability zone and OS overcommit settings
- Time-series Resampling: Configurable time interval aggregation (1H, 1D, 1W, etc.)
- Comprehensive Visualizations: Six-panel analysis dashboard including:
- VCPU utilization over time
- Memory utilization over time
- Pearson correlation scatter plots
- Distance correlation analysis
- Memory/VCPU ratio trends
- Absolute ratio values with statistics
- Statistical Metrics:
- Pearson correlation coefficient
- Distance correlation
- Descriptive statistics for all metrics
- Standard deviation analysis
- Python 3.8+
- Required packages:
pandas numpy matplotlib seaborn scipy pyarrow
- Clone the repository:
git clone https://github.com/eugenezimin/data_utilization_analysis.git
cd data_utilization_analysis- Install dependencies:
pip install pandas numpy matplotlib seaborn scipy pyarrow.
├── main.py # Main entry point and data processing
├── analysis.py # Analysis and visualization functions
├── data/ # Generated CSV data files (created automatically)
├── plots/ # Generated visualization plots (created automatically)
└── README.md
from main import read_parquet_file, split_dataframe_by_host_overcommit
import analysis
# Read Parquet file
interval = '1H' # Options: '1H', '1D', '1W', etc.
data_frame = read_parquet_file("path/to/data.parquet", rows_to_read=100_000_000)
# Split data by availability zone and overcommit settings
dataframes = split_dataframe_by_host_overcommit(data_frame, interval)
# Run analysis on each subset
for (vm_zone, os_overcommit), df_subset in dataframes.items():
analysis.analyzer(
dataframe=df_subset,
interval=interval,
dataframe_name=f"{vm_zone}_{os_overcommit}"
)Modify the interval parameter in main.py to change the time aggregation:
'1H'- Hourly aggregation'1D'- Daily aggregation'1W'- Weekly aggregation- Other pandas-compatible frequency strings
The Parquet file should contain the following columns:
timestamp- Unix timestamp (seconds)vm- VM identifier with zone prefix (e.g., "zone1/vm123")cpu_usage- CPU usage percentagemem_usage- Memory usage in MBos_overcommit- OS overcommit configuration value
For each availability zone and overcommit combination, the tool generates:
File: plots/complete_plots_{interval}_{dataframe_name}.png
The visualization includes 6 panels:
- VCPU Utilization Over Time - Shows raw data points and interval means
- Memory Utilization Over Time - Shows raw data points and interval means
- Pearson Correlation Scatter - Linear relationship analysis
- Distance Correlation Scatter - Non-linear relationship detection
- Memory/VCPU Ratio Trends - Ratio analysis with mean and standard deviation
- Absolute Ratio Values - Detailed ratio statistics by time interval
Two CSV files are generated per analysis:
-
Original intervals:
data/analysis_data_original_{interval}_{dataframe_name}.csv- Columns: Time_Offset_Seconds, VCPU_Usage, Memory_MB, Mem_VCPU_Ratio
-
Aggregated intervals:
data/analysis_data_{interval}_{dataframe_name}.csv- Columns: Time_Bin_Seconds, VCPU_Usage_Mean, Memory_MB_Mean, Mem_VCPU_Ratio_Mean
The tool prints comprehensive statistics including:
- Record counts (original and aggregated)
- Value ranges for VCPU and memory
- Correlation coefficients (Pearson and Distance)
- Detailed descriptive statistics
- File paths for generated outputs
Reads a Parquet file with optional row limit using batch processing.
Splits dataframe by VM zone and OS overcommit, resampling to specified interval.
Calculates distance correlation coefficient for detecting non-linear relationships.
Generates synthetic test data for development purposes.
Main analysis function that generates all visualizations and statistics.
- Batch processing is used for efficient handling of large Parquet files
- Default batch size: 100,000 rows
- Memory usage scales with the number of rows read
- Visualization generation may take time for large datasets
DataFrame Statistics:
Total records (original intervals): 8760
Total records (1Hx10 intervals): 876
VCPU range: 15.2 - 89.7
Memory range: 2048.3 - 14336.8
Pearson correlation: 0.6234
Distance correlation: 0.5891
Complete 6-plot analysis with absolute values saved!
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
Dr. Eugene Zimin
- Distance correlation implementation based on Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007)
- Visualization design inspired by best practices in resource monitoring and analysis