This guide provides practical techniques for debugging compute kernels, validating correctness, and troubleshooting common issues.
Enable comprehensive debugging during development using logging and performance monitoring:
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using DotCompute.Runtime;
var host = Host.CreateApplicationBuilder(args);
// Configure detailed logging for debugging
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug);
logging.AddFilter("DotCompute", LogLevel.Trace); // Verbose DotCompute logging
});
// Add DotCompute services with performance monitoring
host.Services.AddDotComputeRuntime();
host.Services.AddPerformanceMonitoring(); // Enable metrics collection
var app = host.Build();Behavior:
- Detailed logging of kernel execution
- Performance metrics collection
- Memory usage tracking
- Helps identify issues during development
For CI/CD environments, use appropriate logging levels:
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Information); // Less verbose for CI
});
host.Services.AddDotComputeRuntime();Behavior:
- Standard logging level for test runs
- Captures important events and errors
- Suitable for automated testing
Use minimal logging in production:
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Warning); // Only warnings and errors
});
host.Services.AddDotComputeRuntime();Behavior:
- Only logs warnings and errors
- Minimal overhead
- Safe for production use
The most powerful debugging technique:
var debugService = services.GetRequiredService<IKernelDebugService>();
var validation = await debugService.ValidateCrossBackendAsync(
kernelName: "MyKernel",
parameters: new { input, output },
primaryBackend: AcceleratorType.CUDA, // GPU implementation
referenceBackend: AcceleratorType.CPU // Trusted reference
);
if (!validation.IsValid)
{
Console.WriteLine($"❌ Validation FAILED");
Console.WriteLine($"Found {validation.Differences.Count} differences");
Console.WriteLine($"Severity: {validation.Severity}");
Console.WriteLine($"Recommendation: {validation.Recommendation}");
// Print first 10 differences
foreach (var diff in validation.Differences.Take(10))
{
Console.WriteLine(
$" Index {diff.Index}: " +
$"GPU={diff.PrimaryValue:F6}, " +
$"CPU={diff.ReferenceValue:F6}, " +
$"Error={diff.RelativeError:E2}"
);
}
}
else
{
Console.WriteLine($"✅ Validation PASSED");
Console.WriteLine($"GPU speedup: {validation.Speedup:F2}x");
}Valid Result (all differences within tolerance):
✅ Validation PASSED
GPU speedup: 47.32x
No differences found (tolerance: 1e-5)
Invalid Result (differences exceed tolerance):
❌ Validation FAILED
Found 127 differences
Severity: Medium
Recommendation: Check for race conditions in parallel sections
First 10 differences:
Index 42: GPU=3.141593, CPU=3.141592, Error=3.18e-07
Index 108: GPU=2.718282, CPU=2.718281, Error=3.68e-07
...
// Strict (default for testing)
options.ToleranceThreshold = 1e-5; // 0.001% relative error
// Lenient (for accumulating operations)
options.ToleranceThreshold = 1e-3; // 0.1% relative error
// Very lenient (for known precision issues)
options.ToleranceThreshold = 1e-2; // 1% relative errorRule of Thumb:
- Simple operations (add, multiply): 1e-5
- Accumulating operations (sum, dot product): 1e-3
- Transcendental functions (sin, exp, log): 1e-4
var determinism = await debugService.TestDeterminismAsync(
kernelName: "MyKernel",
parameters: new { input, output },
backend: AcceleratorType.CUDA,
runs: 100 // Run 100 times with same input
);
if (!determinism.IsDeterministic)
{
Console.WriteLine($"⚠️ Kernel is NON-DETERMINISTIC!");
Console.WriteLine($"Found {determinism.Violations.Count} violations");
Console.WriteLine($"Likely cause: {determinism.Cause}");
// Show some violations
foreach (var violation in determinism.Violations.Take(5))
{
Console.WriteLine(
$" Run {violation.RunIndex}, " +
$"Index {violation.ElementIndex}: " +
$"Expected {violation.ExpectedValue}, " +
$"Got {violation.ActualValue}"
);
}
}
else
{
Console.WriteLine("✅ Kernel is deterministic");
}1. Race Conditions:
// ❌ Race condition: Multiple threads writing same location
[Kernel]
public static void HasRaceCondition(Span<float> output)
{
int idx = Kernel.ThreadId.X;
output[0] += idx; // Race! All threads write to output[0]
}
// ✅ Fixed: Each thread writes unique location
[Kernel]
public static void NoRaceCondition(Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] += idx; // Each thread has unique index
}
}2. Unordered Reduction:
// ❌ Non-deterministic: Floating-point addition is not associative
[Kernel]
public static void UnorderedSum(ReadOnlySpan<float> input, Span<float> partialSums)
{
int idx = Kernel.ThreadId.X;
float sum = 0;
// Different thread scheduling = different accumulation order = different result
for (int i = idx; i < input.Length; i += Kernel.GridDim.X)
{
sum += input[i];
}
partialSums[Kernel.BlockId.X] = sum;
}Solution: Use Kahan summation or accept small non-determinism
Symptoms:
- GPU produces different results than expected
- Cross-backend validation fails
- Results are NaN or Inf
Debug Steps:
Step 1: Validate against CPU
var validation = await debugService.ValidateCrossBackendAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
AcceleratorType.CPU
);Step 2: Check for common issues
// Check for NaN/Inf
if (result.Any(float.IsNaN))
{
Console.WriteLine("❌ Result contains NaN");
// Causes: Division by zero, sqrt of negative, log of negative
}
if (result.Any(float.IsInfinity))
{
Console.WriteLine("❌ Result contains Infinity");
// Causes: Overflow, division by zero
}Step 3: Validate numerical stability
var stability = await debugService.ValidateNumericalStabilityAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA
);
if (!stability.IsStable)
{
Console.WriteLine($"⚠️ Numerical instability detected");
Console.WriteLine($"NaN count: {stability.NaNCount}");
Console.WriteLine($"Inf count: {stability.InfCount}");
Console.WriteLine($"Overflow count: {stability.OverflowCount}");
}Common Causes:
- Missing bounds check
- Race condition
- Uninitialized memory
- Integer overflow
- Division by zero
Symptoms:
- Kernel is slower than expected
- GPU slower than CPU
- Performance varies widely
Debug Steps:
Step 1: Profile the kernel
var profile = await debugService.ProfileKernelAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
iterations: 1000
);
Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");
// High std dev indicates variable performance
if (profile.StandardDeviation.TotalMilliseconds > profile.AverageTime.TotalMilliseconds * 0.1)
{
Console.WriteLine("⚠️ High variability in execution time");
}Step 2: Analyze memory patterns
var memoryReport = await debugService.AnalyzeMemoryPatternsAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA
);
Console.WriteLine($"Sequential access: {memoryReport.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {memoryReport.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {memoryReport.BandwidthUtilization:P1}");
foreach (var suggestion in memoryReport.Suggestions)
{
Console.WriteLine($"💡 {suggestion}");
}Step 3: Compare backends
var cpuTime = await BenchmarkBackend(AcceleratorType.CPU);
var gpuTime = await BenchmarkBackend(AcceleratorType.CUDA);
Console.WriteLine($"CPU: {cpuTime:F2}ms");
Console.WriteLine($"GPU: {gpuTime:F2}ms");
if (gpuTime > cpuTime)
{
Console.WriteLine("⚠️ GPU is slower than CPU!");
Console.WriteLine("Possible causes:");
Console.WriteLine(" - Data too small (< 10,000 elements)");
Console.WriteLine(" - Memory-bound operation");
Console.WriteLine(" - Transfer overhead dominates");
}Common Causes:
- Poor memory access pattern
- Too many branches
- Low parallelism
- Small data size
- Transfer overhead
Symptoms:
- Kernel passes sometimes, fails other times
- Non-deterministic results
- Hard to reproduce
Debug Steps:
Step 1: Test determinism
var determinism = await debugService.TestDeterminismAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
runs: 100
);
if (!determinism.IsDeterministic)
{
Console.WriteLine($"❌ Non-deterministic (cause: {determinism.Cause})");
}Step 2: Stress test
var stressTest = await debugService.StressTestKernelAsync(
"MyKernel",
inputGenerator: new RandomInputGenerator(),
backend: AcceleratorType.CUDA,
iterations: 10_000
);
Console.WriteLine($"Success rate: {stressTest.SuccessRate:P1}");
Console.WriteLine($"Failures: {stressTest.FailureCount}");
if (stressTest.FailureCount > 0)
{
Console.WriteLine("Sample failures:");
foreach (var failure in stressTest.Failures.Take(5))
{
Console.WriteLine($" Input: {failure.Input}");
Console.WriteLine($" Error: {failure.Error}");
}
}Step 3: Detect race conditions
var raceReport = await debugService.DetectRaceConditionsAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
concurrentExecutions: 100
);
if (raceReport.HasRaceConditions)
{
Console.WriteLine($"❌ Race conditions detected");
Console.WriteLine($"Conflicts: {raceReport.ConflictCount}");
foreach (var conflict in raceReport.Conflicts.Take(5))
{
Console.WriteLine($" Location: {conflict.MemoryLocation}");
Console.WriteLine($" Threads: {string.Join(", ", conflict.ConflictingThreads)}");
}
}Common Causes:
- Race conditions
- Unordered reduction
- Thread-unsafe operations
- Shared memory conflicts
Symptoms:
OutOfMemoryExceptionthrown- Kernel fails to allocate buffers
- System becomes unresponsive
Debug Steps:
Step 1: Check memory usage
var memoryStats = memoryManager.GetStatistics();
Console.WriteLine($"Total allocated: {memoryStats.TotalAllocated / 1024 / 1024:F2} MB");
Console.WriteLine($"Total pooled: {memoryStats.TotalPooled / 1024 / 1024:F2} MB");
Console.WriteLine($"Active buffers: {memoryStats.ActiveBuffers}");
Console.WriteLine($"Peak usage: {memoryStats.PeakUsage / 1024 / 1024:F2} MB");
Console.WriteLine($"Pool hit rate: {memoryStats.HitRate:P1}");Step 2: Check GPU memory
var accelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CUDA);
var deviceStats = accelerator.GetMemoryStatistics();
Console.WriteLine($"Total GPU memory: {deviceStats.TotalMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Used GPU memory: {deviceStats.UsedMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Free GPU memory: {deviceStats.FreeMemory / 1024 / 1024:F2} MB");
if (deviceStats.FreeMemory < 100 * 1024 * 1024) // < 100 MB
{
Console.WriteLine("⚠️ Low GPU memory!");
}Step 3: Find memory leaks
// Track allocations
var initialActiveBuffers = memoryStats.ActiveBuffers;
// Run kernel
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
// Force GC
GC.Collect();
GC.WaitForPendingFinalizers();
var finalActiveBuffers = memoryManager.GetStatistics().ActiveBuffers;
if (finalActiveBuffers > initialActiveBuffers)
{
Console.WriteLine($"⚠️ Memory leak detected!");
Console.WriteLine($"Leaked buffers: {finalActiveBuffers - initialActiveBuffers}");
}Solutions:
- Use
usingstatements for buffers - Return buffers to pool
- Reduce batch size
- Use streaming for large data
[Kernel]
public static void DebugPrint(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
// Only works on CPU backend
if (idx < 10) // Print first 10 threads
{
Console.WriteLine($"Thread {idx}: input={input[idx]}");
}
if (idx < output.Length)
{
output[idx] = input[idx] * 2;
}
}
// Force CPU execution for debugging
await orchestrator.ExecuteKernelAsync(
"DebugPrint",
parameters,
forceBackend: AcceleratorType.CPU
);Note: Console.WriteLine only works on CPU backend
// Create known-good output
var goldenOutput = ComputeExpectedOutput(input);
// Test kernel against golden reference
var validation = await debugService.ValidateAgainstGoldenAsync(
"MyKernel",
parameters: new { input },
expectedOutput: goldenOutput,
backend: AcceleratorType.CUDA
);
if (!validation.IsValid)
{
Console.WriteLine($"❌ Failed to match golden reference");
Console.WriteLine($"Differences: {validation.Differences.Count}");
}[Fact]
public async Task MyKernel_ProducesSameResultsAsPreviousVersion()
{
// Load results from previous version
var previousResults = LoadPreviousResults("v0.1.0");
// Execute current version
var currentResults = await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters
);
// Compare
Assert.Equal(previousResults, currentResults);
}Diagnostic Warnings:
- DC001-DC012 diagnostics show as error squiggles
- Hover for quick explanation
- Click lightbulb for automated fixes
Debugging:
- Set breakpoints in kernel code (CPU only)
- Step through execution
- Watch variables
- Call stack shows kernel invocation
C# Dev Kit Extension:
code --install-extension ms-dotnettools.csdevkitFeatures:
- Same diagnostics as Visual Studio
- Quick fixes via lightbulb
- IntelliSense for generated code
services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug);
// Filter to DotCompute only
logging.AddFilter("DotCompute", LogLevel.Trace);
});[Trace] DotCompute.Core.KernelExecutionService: Discovering kernel 'VectorAdd'
[Debug] DotCompute.Core.KernelExecutionService: Backend selection: DataSize=4000000, Intensity=Low
[Debug] DotCompute.Core.KernelExecutionService: Selected backend: CPU (rule: small data)
[Trace] DotCompute.Backends.CPU.CpuAccelerator: Compiling kernel 'VectorAdd' (SIMD=AVX2)
[Debug] DotCompute.Memory.UnifiedMemoryManager: Allocated 4.00 MB from pool (hit rate: 92.3%)
[Info] DotCompute.Core.KernelExecutionService: Executed 'VectorAdd' in 2.34ms
public class CustomDiagnostics
{
private readonly ILogger<CustomDiagnostics> _logger;
public async Task DiagnoseKernel(string kernelName, object parameters)
{
_logger.LogInformation("=== Diagnostics for {Kernel} ===", kernelName);
// 1. Check kernel exists
var registry = GetService<IKernelRegistry>();
var metadata = registry.GetKernel(kernelName);
if (metadata == null)
{
_logger.LogError("❌ Kernel not found: {Kernel}", kernelName);
return;
}
_logger.LogInformation("✅ Kernel found: {Namespace}.{Type}.{Method}",
metadata.Namespace, metadata.DeclaringType, metadata.Name);
// 2. Check backend availability
var manager = GetService<IAcceleratorManager>();
var availableBackends = manager.GetAvailableBackends();
_logger.LogInformation("Available backends: {Backends}",
string.Join(", ", availableBackends));
// 3. Profile execution
var profile = await ProfileKernel(kernelName, parameters);
_logger.LogInformation("Average time: {Time:F2}μs", profile.AverageTime.TotalMicroseconds);
// 4. Validate correctness
var validation = await ValidateKernel(kernelName, parameters);
if (validation.IsValid)
{
_logger.LogInformation("✅ Validation passed");
}
else
{
_logger.LogWarning("⚠️ Validation failed: {Count} differences",
validation.Differences.Count);
}
_logger.LogInformation("=== Diagnostics complete ===");
}
}- Enable debug validation in development - Catches issues early
- Use cross-backend validation - Most reliable correctness check
- Test determinism for critical kernels - Avoid subtle bugs
- Profile before and after optimization - Verify improvements
- Use golden reference tests - Prevent regressions
- Log diagnostic information - Helps troubleshoot production issues
- Don't disable validation in tests - May miss correctness issues
- Don't ignore analyzer warnings - DC001-DC012 catch real problems
- Don't assume GPU is correct - Validate against CPU
- Don't skip stress testing - Catches intermittent issues
- Don't forget to dispose buffers - Causes memory leaks
When a kernel misbehaves:
- Enable debug validation
- Run cross-backend validation
- Check for NaN/Inf in results
- Test determinism (run 100 times)
- Profile performance (check for anomalies)
- Analyze memory access patterns
- Check for race conditions
- Verify bounds checking
- Test with small, known inputs
- Review analyzer warnings (DC001-DC012)
- Check memory usage (no leaks)
- Compare CPU vs GPU results
- Kernel Development Guide - Writing correct kernels
- Performance Tuning Guide - Optimization techniques
- Architecture: Debugging System - Technical details
- Diagnostic Rules Reference - DC001-DC012 reference
- Troubleshooting Guide - Common issues and solutions
Debug Early • Validate Often • Trust But Verify