文章

perf性能分析(6) -- perf实战(1) -- 分支预测

使用Debug模式编译 sortunsort 代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <algorithm>
#include <ctime>
#include <iostream>

int main(int argc, char* argv[]) {
  constexpr int kArrLen = 1024 * 1024;
  int* data = new int[kArrLen];

  for (int c = 0; c < kArrLen; c++)
    data[c] = std::rand() % 256;

  // std::sort(data, data + kArrLen);  // 是否排序

  long long sum = 0;
  for (int i = 0; i < 30000; i++) {
    for (int c = 0; c < kArrLen; c++) {
      if (data[c] >= 128) {  // 故意選 256 一半
        sum += data[c];
      }
    }
  }

  delete[] data;
  std::cout << "sum = " << sum << std::endl;
}

1. sort/unsort 性能测试及比较: perf stat

The CPU is Intel(R) Core(TM) i5-8260U CPU @ 1.60GHz.

perf stat ./sort 测试结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ perf stat ./sort

sum = 3016043160000

 Performance counter stats for './sort':

         63,995.28 msec task-clock                       #    1.000 CPUs utilized
             1,039      context-switches                 #   16.236 /sec
                 7      cpu-migrations                   #    0.109 /sec
             1,151      page-faults                      #   17.986 /sec
   240,342,649,070      cycles                           #    3.756 GHz
   473,524,864,500      instructions                     #    1.97  insn per cycle
    63,163,134,842      branches                         #  986.997 M/sec
         7,402,185      branch-misses                    #    0.01% of all branches

      64.008866998 seconds time elapsed

      63.983827000 seconds user
       0.010999000 seconds sys

perf stat ./unsort 测试结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ perf stat ./unsort

sum = 3016043160000

 Performance counter stats for './unsort':

        206,441.99 msec task-clock                       #    1.000 CPUs utilized
               484      context-switches                 #    2.344 /sec
                44      cpu-migrations                   #    0.213 /sec
             1,150      page-faults                      #    5.571 /sec
   771,936,412,315      cycles                           #    3.739 GHz
   473,410,337,438      instructions                     #    0.61  insn per cycle
    63,144,272,701      branches                         #  305.869 M/sec
    15,732,825,641      branch-misses                    #   24.92% of all branches

     206.439026771 seconds time elapsed

     206.440760000 seconds user
       0.002000000 seconds sys

1.1. Result comparison

unsort的分支预测失败率为24.92%, 且CPU IPC0.61, 小于1.0.

sort分支预测失败率为0.01%, CPU IPC1.97.

2. analysis using perf record and perf ananotate

2.1. program unsort

运行 perf recordperf annotate 命令:

1
2
perf record ./unsort # 结束后生产 perf.data 文件
perf annotate        # 自动定位到最耗时的地方

perf_annotate_unsort

if语句比较失败跳转至循环体, 循环体占用24.44%的CPU时间. 原因是分支预测失败if语句在for循环中失败次数较多.

2.2. program sort

perf_annotate_sort

分支预测失败率为.051% + 4.31%, 且CPU IPC1.97.

  • 内存load操作: 41.68%.
  • add指令: 10.54%.

3. When build with Release

如果使用-O2或者-O3编译,编译器使用SIMD指令,且消除了if语句。其结果就是,sortunsort的性能差距不大,甚至sort的性能更好:

  • pcmpgtdSIMD比较指令,得到位掩码结果,存在结果寄存器中;
  • pandSIMD按bit进行and操作;

基本等效于:

1
sum += data[c] & -(data[c] >= 128);

references

本文由作者按照 CC BY 4.0 进行授权