autofdo调研

发表于 2020-10-23 分类于 cpp-devel

autoFDO调研

背景

传统 FDO（Feedback-Directed Optimization）需要两步编译：先插桩编译出 instrumented 二进制，用它跑 workload 生成 profile，再基于 profile 重新编译。插桩版本运行开销大，且需要维护两套二进制。

AutoFDO 的解决思路：直接用硬件性能计数器（perf）采集 CPU 采样+Branch 数据，离线转换为 GCC 可消费的 profile 格式，省去插桩编译这一步。核心依赖 Intel LBR（Last Branch Record）和 PEBS 特性。

安装

1. 安装 autofdo 工具

git clone https://github.com/google/autofdo.git
cd autofdo
./configure
make -j$(nproc)
make install

依赖项：

autoconf / automake
libunwind（栈回溯）
libpfm4（perf event 编码）
gflags / glog（日志与选项）

2. 确保 perf 支持 LBR 和 PEBS

perf record -e cycles:pp -b -c 100003 -o /tmp/raw.perf -- ./your_binary

关键参数：

-e cycles:pp：PEBS 精确事件，记录指令地址
-b：开启 LBR 分支采样
-c 100003：采样周期（质数避免锁步），需要按场景调优

3. 确保 GCC 版本 >= 5.0

gcc --version   # 需要 5.0+，支持 -fauto-profile

简单案例

以一个微基准测试为例：

编译（第一阶段：源码编译，带 debuginfo）

gcc -O2 -g -fno-omit-frame-pointer -o bench bench.c

收集 profile

# 用 perf 采集 LBR + PEBS 数据
perf record -e cycles:pp -b -c 100003 -o bench.perf.data -- ./bench

# 转换为 GCC auto-profile 格式
create_gcov --binary=./bench --profile=bench.perf.data --gcov=bench.afdo

编译（第二阶段：基于 profile 重编译）

gcc -O2 -g -fauto-profile=bench.afdo -o bench.optimized bench.c

此时 GCC 会根据 bench.afdo 中的热点和分支概率，重新决策内联、基本块重排、循环展开等优化。

复杂案例：多源合并与采样调优

实际业务中通常需要在多台机器、多个 workload 上采集后合并，常用的流水线：

1. 多 workload 分别采集

perf record -e cycles:pp -b -c 100003 -o wl1.perf.data -- ./svr --workload=a
perf record -e cycles:pp -b -c 100003 -o wl2.perf.data -- ./svr --workload=b
perf record -e cycles:pp -b -c 100003 -o wl3.perf.data -- ./svr --workload=c

2. 转换为文本 profile 后合并

create_gcov --binary=./svr --profile=wl1.perf.data --gcov=wl1.afdo --gcov_version=1
create_gcov --binary=./svr --profile=wl2.perf.data --gcov=wl2.afdo --gcov_version=1
create_gcov --binary=./svr --profile=wl3.perf.data --gcov=wl3.afdo --gcov_version=1

# profile_merger 按权重合并多个 profile
profile_merger --input=wl1.afdo,wl2.afdo,wl3.afdo \
               --weight=3,2,1 \
               --output=merged.afdo

3. 重编译

gcc -O2 -g -fauto-profile=merged.afdo -o svr.optimized svr.c

采样参数调优建议

参数	作用	调优方向
`perf record -c <N>`	采样周期	N 太小则 overhead 大，太大则 profile 稀疏。通常以 CPU 利用率 `<= 2%` 为目标
`-e cycles:pp`	PEBS 精确事件	用 `perf list` 查看硬件支持的事件，优先选 `:pp` 后缀
LBR 栈深度	硬件限制	Intel Skylake 及之后为 32 条，Haswell/Broadwell 为 16 条。过浅影响间接调用解虚拟化效果

Profile 质量检查

# 总采样数
dump_gcov --gcov=merged.afdo | head -20

# 热点函数覆盖率，top 30 热点
dump_gcov --gcov=merged.afdo | sort -t, -k2 -nr | head -30

dump_gcov 输出格式为 函数名,计数,入口计数,类型，检查：

热点函数的采样计数是否充足（建议 top 函数计数 > 10000）
关键函数是否都有覆盖

注意事项

debuginfo 质量：编译时必须加 -g 且不可 strip。release 用 -g1 或 Debug Fission（-gsplit-dwarf）减少体积。
LBR 限制：虚拟机环境 LBR 可能不可用（需 host 透传），检查 dmesg | grep -i lbr。
profile 时效性：源码改动后 profile 会失效，CI 中需要定期重新采集。

参考

gcc autoFDO教程
gcc autoFDO教程翻译
autoFDO代码库
gcc 优化参数