Skip to content

eugenezimin/data_utilization_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Resource Utilization Analysis Tool

A Python-based data analysis tool for visualizing and analyzing CPU and memory usage patterns from Parquet-formatted telemetry data. This tool provides comprehensive statistical analysis and visualization of resource utilization across different availability zones and overcommit configurations.

Features

  • Parquet Data Processing: Efficient reading and processing of large Parquet files with batch processing
  • Multi-dimensional Analysis: Split and analyze data by availability zone and OS overcommit settings
  • Time-series Resampling: Configurable time interval aggregation (1H, 1D, 1W, etc.)
  • Comprehensive Visualizations: Six-panel analysis dashboard including:
    • VCPU utilization over time
    • Memory utilization over time
    • Pearson correlation scatter plots
    • Distance correlation analysis
    • Memory/VCPU ratio trends
    • Absolute ratio values with statistics
  • Statistical Metrics:
    • Pearson correlation coefficient
    • Distance correlation
    • Descriptive statistics for all metrics
    • Standard deviation analysis

Prerequisites

  • Python 3.8+
  • Required packages:
    pandas
    numpy
    matplotlib
    seaborn
    scipy
    pyarrow
    

Installation

  1. Clone the repository:
git clone https://github.com/eugenezimin/data_utilization_analysis.git
cd data_utilization_analysis
  1. Install dependencies:
pip install pandas numpy matplotlib seaborn scipy pyarrow

Project Structure

.
├── main.py              # Main entry point and data processing
├── analysis.py          # Analysis and visualization functions
├── data/               # Generated CSV data files (created automatically)
├── plots/              # Generated visualization plots (created automatically)
└── README.md

Usage

Basic Usage

from main import read_parquet_file, split_dataframe_by_host_overcommit
import analysis

# Read Parquet file
interval = '1H'  # Options: '1H', '1D', '1W', etc.
data_frame = read_parquet_file("path/to/data.parquet", rows_to_read=100_000_000)

# Split data by availability zone and overcommit settings
dataframes = split_dataframe_by_host_overcommit(data_frame, interval)

# Run analysis on each subset
for (vm_zone, os_overcommit), df_subset in dataframes.items():
    analysis.analyzer(
        dataframe=df_subset, 
        interval=interval, 
        dataframe_name=f"{vm_zone}_{os_overcommit}"
    )

Configuration

Modify the interval parameter in main.py to change the time aggregation:

  • '1H' - Hourly aggregation
  • '1D' - Daily aggregation
  • '1W' - Weekly aggregation
  • Other pandas-compatible frequency strings

Input Data Format

The Parquet file should contain the following columns:

  • timestamp - Unix timestamp (seconds)
  • vm - VM identifier with zone prefix (e.g., "zone1/vm123")
  • cpu_usage - CPU usage percentage
  • mem_usage - Memory usage in MB
  • os_overcommit - OS overcommit configuration value

Output

Visualizations

For each availability zone and overcommit combination, the tool generates:

File: plots/complete_plots_{interval}_{dataframe_name}.png

The visualization includes 6 panels:

  1. VCPU Utilization Over Time - Shows raw data points and interval means
  2. Memory Utilization Over Time - Shows raw data points and interval means
  3. Pearson Correlation Scatter - Linear relationship analysis
  4. Distance Correlation Scatter - Non-linear relationship detection
  5. Memory/VCPU Ratio Trends - Ratio analysis with mean and standard deviation
  6. Absolute Ratio Values - Detailed ratio statistics by time interval

Data Files

Two CSV files are generated per analysis:

  1. Original intervals: data/analysis_data_original_{interval}_{dataframe_name}.csv

    • Columns: Time_Offset_Seconds, VCPU_Usage, Memory_MB, Mem_VCPU_Ratio
  2. Aggregated intervals: data/analysis_data_{interval}_{dataframe_name}.csv

    • Columns: Time_Bin_Seconds, VCPU_Usage_Mean, Memory_MB_Mean, Mem_VCPU_Ratio_Mean

Console Output

The tool prints comprehensive statistics including:

  • Record counts (original and aggregated)
  • Value ranges for VCPU and memory
  • Correlation coefficients (Pearson and Distance)
  • Detailed descriptive statistics
  • File paths for generated outputs

Functions Reference

main.py

read_parquet_file(file: str, rows_to_read: int=0) -> pd.DataFrame

Reads a Parquet file with optional row limit using batch processing.

split_dataframe_by_host_overcommit(df: pd.DataFrame, interval='1H') -> dict

Splits dataframe by VM zone and OS overcommit, resampling to specified interval.

analysis.py

distance_correlation(x, y) -> float

Calculates distance correlation coefficient for detecting non-linear relationships.

generate_dataframe(length: int) -> pd.DataFrame

Generates synthetic test data for development purposes.

analyzer(dataframe: pd.DataFrame, interval: str, dataframe_name: str)

Main analysis function that generates all visualizations and statistics.

Performance Considerations

  • Batch processing is used for efficient handling of large Parquet files
  • Default batch size: 100,000 rows
  • Memory usage scales with the number of rows read
  • Visualization generation may take time for large datasets

Example Output

DataFrame Statistics:
Total records (original intervals): 8760
Total records (1Hx10 intervals): 876
VCPU range: 15.2 - 89.7
Memory range: 2048.3 - 14336.8
Pearson correlation: 0.6234
Distance correlation: 0.5891

Complete 6-plot analysis with absolute values saved!

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Authors

Dr. Eugene Zimin

Acknowledgments

  • Distance correlation implementation based on Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007)
  • Visualization design inspired by best practices in resource monitoring and analysis

About

This is the script performing analysis between vCPU and Memory Utilization for every AZ and overbooking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages