Skip to content

Updates/20251204-01#605

Open
stewartshea wants to merge 4 commits intorunwhen-contrib:mainfrom
stewartshea:updates/20251204-01
Open

Updates/20251204-01#605
stewartshea wants to merge 4 commits intorunwhen-contrib:mainfrom
stewartshea:updates/20251204-01

Conversation

@stewartshea
Copy link
Contributor

@stewartshea stewartshea commented Dec 4, 2025

  • Added comprehensive analysis for server errors, throttling, user errors, and storage capacity in the service bus metrics script, providing detailed metrics and recommendations for each issue.
  • Improved queue and topic health scripts to include analysis of disabled queues/topics and message backlog, enhancing visibility into operational issues.
  • Introduced context, investigation steps, and recommendations for each identified issue, aiding in troubleshooting and resolution.
  • Updated runbook to reflect new timeout settings for improved reliability during execution of cost health analysis tasks.

Note

Enhances Service Bus diagnostics, introduces data-driven storage savings (tiering/redundancy) and improved VM underutilization logic, and raises runbook timeouts for cost analyses.

  • Service Bus Health:
    • Namespace Metrics (service_bus_metrics.sh): Adds detailed calculations and rich context for ServerErrors, ThrottledRequests, UserErrors, and Size (totals/max/averages, error rates, message imbalance) with actionable recommendations and updated issue titles.
    • Queue Health (service_bus_queue_health.sh): Adds disabled-queue detection with counts/timestamps; expands backlog and size analyses with additional metrics (scheduled/transfer counts, config context) and structured guidance.
    • Topic & Subscriptions (service_bus_topic_health.sh): Adds disabled-topic analysis; augments topic size checks (subscriptions, partitioning, status) and subscription checks (dead-letter, backlogs, disabled state) with detailed remediation steps.
  • Storage Cost Optimization (analyze_storage_optimization.sh):
    • Introduces blob tier pricing helpers and capacity retrieval via Azure Monitor; computes per-account potential savings for missing lifecycle policies (tiering to Cool/Archive) and geo-redundancy downgrades, rolling up monthly/annual estimates and severity; enriches issue titles with savings.
  • VM Optimization (analyze_vm_optimization.sh):
    • Refines underutilization detection when memory metrics are unavailable (average-CPU–based with peak guard), improves logging/recommendations, and surfaces savings/context in outputs.
  • Runbook (runbook.robot):
    • Increases timeouts to 900s across tasks for more reliable long-running analyses.

Written by Cursor Bugbot for commit 63566f7. This will update automatically on new commits. Configure here.

- Added logging for VM names during optimization analysis to improve traceability.
- Updated CPU-only analysis logic to use average CPU as the primary metric when memory metrics are unavailable, refining underutilization detection.
- Enhanced reporting for underutilized VMs, providing clearer recommendations based on average and peak CPU metrics.
- Improved documentation within the script to clarify the analysis approach and thresholds used for identifying underutilization.
…nalysis

- Added comprehensive analysis for server errors, throttling, user errors, and storage capacity in the service bus metrics script, providing detailed metrics and recommendations for each issue.
- Improved queue and topic health scripts to include analysis of disabled queues/topics and message backlog, enhancing visibility into operational issues.
- Introduced context, investigation steps, and recommendations for each identified issue, aiding in troubleshooting and resolution.
- Updated runbook to reflect new timeout settings for improved reliability during execution of cost health analysis tasks.
@stewartshea stewartshea requested a review from a team as a code owner December 4, 2025 22:04
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Undefined variables used in subscription backlog analysis

The MESSAGE BACKLOG ANALYSIS section references $status and $max_delivery_count variables that are only defined inside the dead-letter check block (lines 272-273). When a subscription has high active message count but NOT high dead-letter count, these variables will be unset. Since the script uses set -u, this causes a script failure. The service_bus_queue_health.sh file correctly fetches these variables within the active message count block, but this pattern was not followed in the topic health script.

codebundles/azure-servicebus-health/service_bus_topic_health.sh#L332-L334

- Topic: $topic_name
- Subscription Status: $status
- Max Delivery Count: $max_delivery_count

Fix in Cursor Fix in Web


- Updated the message imbalance calculation to use `bc` for float-safe arithmetic, enhancing accuracy in metrics analysis.
- Improved comments for clarity on the calculation process, ensuring better understanding of the script's functionality.
local savings_note="N/A"

# Calculate potential savings if we have capacity data
if [[ "$access_tier" == "Hot" ]] && (( $(echo "$capacity_gb > 0" | bc -l) )); then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Hot tier accounts miscounted when capacity data unavailable

The hot_tier_accounts counter is only incremented when both the access tier is "Hot" AND capacity_gb > 0. However, if there are Hot tier storage accounts whose capacity metrics are unavailable (returning 0), they won't be counted. Later, at lines 601-608, when hot_tier_accounts is 0, the message incorrectly states "No Hot tier accounts found" even though Hot tier accounts may exist - they just lack capacity data. This causes misleading output and incorrect severity assignment.

Additional Locations (1)

Fix in Cursor Fix in Web

- Updated the service bus metrics script to ensure that calculations for total errors, throttled requests, incoming messages, outgoing messages, and user errors default to zero when no data is available, improving reliability and preventing potential errors during execution.
- Enhanced the analysis of storage metrics to include similar default handling, ensuring consistent behavior across the script and better handling of edge cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant