Analyzing Performance Reports with Intel VTune Profiler: A Detailed Example
Intel VTune Profiler is a powerful performance analysis tool that helps identify bottlenecks in your applications. Let me walk you through analyzing a concrete performance report.
Example Scenario: Matrix Multiplication Performance Analysis
Let’s assume we’ve profiled a matrix multiplication program using VTune and want to analyze the results.
1. Understanding the Report Structure
A typical VTune report contains several key sections:
- Summary View: High-level metrics
- Bottom-up View: Detailed breakdown by function
- Caller/Callee View: Function call relationships
- Platform View: Hardware utilization
- Top-down Tree View: Hierarchical performance breakdown
2. Starting with the Summary View
The summary might show:
Elapsed Time: 12.345 seconds
CPU Time: 48.276 seconds (4 logical cores)
CPU Utilization: 97.8%
Clockticks: 98,765,432,100
Instructions Retired: 87,654,321,000
CPI Rate: 1.13 (clockticks per instruction)
Key Observations:
- Good CPU utilization (97.8%) suggests the workload is CPU-bound
- CPI of 1.13 is decent (lower is better, ideal is <1)
- High instruction count relative to clockticks indicates compute-intensive workload
3. Bottom-up Analysis
Drilling into the bottom-up view, we might see:
Function CPU Time % Instructions CPI Thread Time
----------------------------------------------------------------------------
multiply_matrices 92.3% 80,123,456,789 1.05 44.567s
matrix_initialize 5.1% 4,987,654,321 1.42 2.462s
memory_alloc 2.6% 2,543,210,987 1.85 1.256s
Analysis:
multiply_matrices
is our hotspot (92.3% of CPU time)- Its CPI (1.05) is better than average (1.13), suggesting it’s relatively efficient
- Initialization and allocation take non-trivial time but aren’t the main bottlenecks
4. Hotspot Function Drill-down
Looking specifically at multiply_matrices
:
Event Count % of Parent
----------------------------------------------------------------
Instructions Retired 80,123,456,789 100.0%
Clockticks 84,129,629,628 100.0%
L1 Misses 1,234,567,890 1.5%
L2 Misses 345,678,901 0.4%
LLC Misses 123,456,789 0.1%
DRAM Accesses 12,345,678 0.015%
Branch Mispredicts 1,234,567 0.0015%
Memory Access Analysis:
- Low cache miss rates (especially LLC and DRAM) suggest good cache utilization
- The 1.5% L1 miss rate might be worth investigating for potential optimization
5. Platform View Analysis
The platform view might show:
CPU Utilization: 97.8%
FPU Utilization: 85.2%
Vectorization: 65.4%
Memory Bandwidth Used: 12.3 GB/s (of 38.4 GB/s available)
Observations:
- High CPU but not max FPU utilization suggests some floating-point inefficiency
- 65.4% vectorization indicates room for SIMD optimization
- Memory bandwidth is underutilized, confirming this is a compute-bound problem
6. Top-down Microarchitecture Analysis
The top-down tree might reveal:
Category % of Pipeline Slots
-----------------------------------------------
Retiring 70.1% (good)
Bad Speculation 0.3% (good)
Front-End Bound 5.2% (moderate)
Back-End Bound 24.4% (high)- Memory Bound 15.1%- Core Bound 9.3%
Interpretation:
- 70.1% retiring is good (actual work being done)
- 24.4% back-end bound suggests execution ports are sometimes stalled
- 15.1% memory bound aligns with our earlier cache miss observations
7. Source Code Correlation
VTune can show performance metrics alongside source code. For our matrix multiplication:
// Hotspot identified - inner loop
for (int i = 0; i < N; i++) {for (int j = 0; j < N; j++) {for (int k = 0; k < N; k++) { // CPI: 1.08, 45% of total timeC[i][j] += A[i][k] * B[k][j]; // L1 misses: 1.2%}}
}
Optimization Opportunities:
- Loop Order: The k-loop is inner-most, which is bad for cache locality on B matrix
- Vectorization: The compiler may not have vectorized optimally (65.4% vectorization)
- Blocking: Could add cache blocking/tiling to improve locality
8. Suggested Optimizations
Based on this analysis:
-
Change loop order to i-k-j for better cache locality:
for (int i = 0; i < N; i++) {for (int k = 0; k < N; k++) {for (int j = 0; j < N; j++) {C[i][j] += A[i][k] * B[k][j];}} }
-
Add compiler directives to ensure vectorization:
#pragma omp simd for (int j = 0; j < N; j++) {C[i][j] += A[i][k] * B[k][j]; }
-
Implement cache blocking:
const int blockSize = 64; for (int ii = 0; ii < N; ii += blockSize) {for (int kk = 0; kk < N; kk += blockSize) {for (int jj = 0; jj < N; jj += blockSize) {for (int i = ii; i < min(ii+blockSize, N); i++) {for (int k = kk; k < min(kk+blockSize, N); k++) {for (int j = jj; j < min(jj+blockSize, N); j++) {C[i][j] += A[i][k] * B[k][j];}}}}} }
9. Post-Optimization Analysis
After applying these changes, a new VTune report might show:
Elapsed Time: 6.543 seconds (47% reduction)
CPI Rate: 0.89 (from 1.13)
Vectorization: 92.1% (from 65.4%)
L1 Miss Rate: 0.6% (from 1.5%)
Key VTune Metrics to Monitor
- CPI (Clocks Per Instruction): Lower is better. <1 is excellent, >2 may indicate problems.
- Cache Miss Rates: L1 (expect <5%), L2 (❤️%), LLC (<1%).
- Vectorization: Percentage of code using SIMD instructions.
- Memory Bandwidth: Compare to system maximum.
- Top-down Categories: Aim for high Retiring, low Bad Speculation and Bound categories.
This example demonstrates how to systematically analyze a VTune report to identify and address performance bottlenecks in a compute-intensive application.