perf性能分析(6) -- perf实战(1) -- 分支预测
使用Debug
模式编译 sort
和 unsort
代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <algorithm>
#include <ctime>
#include <iostream>
int main(int argc, char* argv[]) {
constexpr int kArrLen = 1024 * 1024;
int* data = new int[kArrLen];
for (int c = 0; c < kArrLen; c++)
data[c] = std::rand() % 256;
// std::sort(data, data + kArrLen); // 是否排序
long long sum = 0;
for (int i = 0; i < 30000; i++) {
for (int c = 0; c < kArrLen; c++) {
if (data[c] >= 128) { // 故意選 256 一半
sum += data[c];
}
}
}
delete[] data;
std::cout << "sum = " << sum << std::endl;
}
1. sort/unsort 性能测试及比较: perf stat
The CPU is Intel(R) Core(TM) i5-8260U CPU @ 1.60GHz
.
perf stat ./sort
测试结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ perf stat ./sort
sum = 3016043160000
Performance counter stats for './sort':
63,995.28 msec task-clock # 1.000 CPUs utilized
1,039 context-switches # 16.236 /sec
7 cpu-migrations # 0.109 /sec
1,151 page-faults # 17.986 /sec
240,342,649,070 cycles # 3.756 GHz
473,524,864,500 instructions # 1.97 insn per cycle
63,163,134,842 branches # 986.997 M/sec
7,402,185 branch-misses # 0.01% of all branches
64.008866998 seconds time elapsed
63.983827000 seconds user
0.010999000 seconds sys
perf stat ./unsort
测试结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ perf stat ./unsort
sum = 3016043160000
Performance counter stats for './unsort':
206,441.99 msec task-clock # 1.000 CPUs utilized
484 context-switches # 2.344 /sec
44 cpu-migrations # 0.213 /sec
1,150 page-faults # 5.571 /sec
771,936,412,315 cycles # 3.739 GHz
473,410,337,438 instructions # 0.61 insn per cycle
63,144,272,701 branches # 305.869 M/sec
15,732,825,641 branch-misses # 24.92% of all branches
206.439026771 seconds time elapsed
206.440760000 seconds user
0.002000000 seconds sys
1.1. Result comparison
unsort
的分支预测失败率为24.92%
, 且CPU IPC
为0.61
, 小于1.0.
sort
分支预测失败率为0.01%
, CPU IPC
为1.97
.
2. analysis using perf record
and perf ananotate
2.1. program unsort
运行 perf record
和 perf annotate
命令:
1
2
perf record ./unsort # 结束后生产 perf.data 文件
perf annotate # 自动定位到最耗时的地方
if
语句比较失败跳转至循环体, 循环体占用24.44%
的CPU时间. 原因是分支预测失败
– if
语句在for
循环中失败次数较多.
2.2. program sort
分支预测失败率为.051%
+ 4.31%
, 且CPU IPC
为1.97
.
- 内存
load
操作:41.68%
. add
指令:10.54%
.
3. When build with Release
如果使用-O2
或者-O3
编译,编译器使用SIMD
指令,且消除了if
语句。其结果就是,sort
与unsort
的性能差距不大,甚至sort
的性能更好:
pcmpgtd
–SIMD
比较指令,得到位掩码结果,存在结果寄存器中;pand
–SIMD
按bit进行and操作;
基本等效于:
1
sum += data[c] & -(data[c] >= 128);
references
- 在 Linux 上使用 Perf 做效能分析(入門篇)
- Category:Architecture/x86-assembly
- StackOverflow: Why is processing a sorted array faster than processing an unsorted array?
- StackOverflow: Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?
- 知乎 – 分支对代码性能的影响和优化
- Modern Microprocessors – A 90-Minute Guide!
本文由作者按照 CC BY 4.0 进行授权