Laurenfrost's Blog.

今朝有酒今朝醉，明日愁来明日愁。

Some topics about CPU Cache

关于 CPU Cache 的一些问题

Apr 28, 2020 4 min read
系统结构

Contents

对于 CPU 的缓存，我的了解基本止步于本科阶段的专业课《计算机系统结构》。有时候也会遇到一些完全不了解的问题，就把它们记录下来。

问题一：如何在 Linux 上获取到 cache 信息

1. 使用 `getconf` 命令来查询 cache 信息

getconf 命令可以查询计算机硬件的很多信息，这其中就包括了 cache，因此我们可以使用如下的命令来获取 cache 相关信息：

$ getconf -a | grep CACHE

在我的电脑里获取到了如下信息：

// CPU: AMD Ryzen 7 2700
LEVEL1_ICACHE_SIZE                 65536
LEVEL1_ICACHE_ASSOC                4
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  524288
LEVEL2_CACHE_ASSOC                 8
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  16777216
LEVEL3_CACHE_ASSOC                 16
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

这些数据的单位均为字节（Byte）。
一级缓存分“指令缓存”和“数据缓存”两种，分别用 ICACHE 和 DCACHE 来表示。
SIZE 指的是该级缓存的总大小。
LINESIZE 指的是该级缓存cache行的大小。即在内存层次模型（Memory Hierarchy）中，该级 cache 向低一级 cache 访问时，一次性抓取的数据量。
ASSOC 指的是该级缓存组相联的组数。

2. 通过系统自带的库来获取 cache 信息

头文件 unistd.h 封装了大量针对系统调用的 API，可以藉此获取相应的信息：

#include <stdio.h>
#include <unistd.h>
 
int main (void)
{
  long l1_cache_line_size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
  long l2_cache_line_size = sysconf(_SC_LEVEL2_CACHE_LINESIZE); 
  long l3_cache_line_size = sysconf(_SC_LEVEL3_CACHE_LINESIZE);
 
  printf("L1 Cache Line Size is %ld bytes.\n", l1_cache_line_size); 
  printf("L2 Cache Line Size is %ld bytes.\n", l2_cache_line_size); 
  printf("L3 Cache Line Size is %ld bytes.\n", l3_cache_line_size); 
 
  return (0);
}

gcc 编译后运行可得如下信息：

$ vim cache-info.c
$ gcc cache-info.c
$ ./a.out
L1 Cache Line Size is 64 bytes.
L2 Cache Line Size is 64 bytes.
L3 Cache Line Size is 64 bytes.

3. 通过不同系统相应的文件来获取 cache 信息

参考：https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size

#ifndef GET_CACHE_LINE_SIZE_H_INCLUDED
#define GET_CACHE_LINE_SIZE_H_INCLUDED

#include <stddef.h>
size_t cache_line_size();

#if defined(__APPLE__)

#include <sys/sysctl.h>
size_t cache_line_size() {
    size_t line_size = 0;
    size_t sizeof_line_size = sizeof(line_size);
    sysctlbyname("hw.cachelinesize", &line_size, &sizeof_line_size, 0, 0);
    return line_size;
}

#elif defined(_WIN32)

#include <stdlib.h>
#include <windows.h>
size_t cache_line_size() {
    size_t line_size = 0;
    DWORD buffer_size = 0;
    DWORD i = 0;
    SYSTEM_LOGICAL_PROCESSOR_INFORMATION * buffer = 0;

    GetLogicalProcessorInformation(0, &buffer_size);
    buffer = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION *)malloc(buffer_size);
    GetLogicalProcessorInformation(&buffer[0], &buffer_size);

    for (i = 0; i != buffer_size / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); ++i) {
        if (buffer[i].Relationship == RelationCache && buffer[i].Cache.Level == 1) {
            line_size = buffer[i].Cache.LineSize;
            break;
        }
    }

    free(buffer);
    return line_size;
}

#elif defined(linux)

#include <stdio.h>
size_t cache_line_size() {
    FILE * p = 0;
    p = fopen("/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size", "r");
    unsigned int i = 0;
    if (p) {
        fscanf(p, "%d", &i);
        fclose(p);
    }
    return i;
}

#else
#error Unrecognized platform
#endif

问题二：如何清掉 CPU 的 cache

三一老哥最近在群里问了一个问题：Linux 有命令能刷新 CPU cache 吗？

课本里都讲了，CPU 的 cache 是对用户应该是透明的。换句话说，CPU 的 cache 从来是自己独立运行，没有办法直接控制的。所以应该不存在一种命令直接刷新 cache。

但真的一点办法都没有吗？

1. 大量访存

cache 嘛，众所周知，帮助 CPU 访存的东西。CPU 需要什么，cache 就帮你从内存里抓过来。无论 cache 是用哪一种算法实现的，它总会有一个更新 cache 内容的机制，把用过了的数据丢回内存，从而腾出空间存放新的数据。既然如此，那么我们就故意申请一个大于 cache 大小的空间，然后把它们挨个访问一遍。这样不就能实现“刷新” cache 了吗。

通过 CPU（而非 DMA）反复读取大量数据：[2]

int main() {
    const int size = 20*1024*1024; // Allocate 20M. Set much larger than L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 0xffff; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
}

但这又存在新的问题：

绝大多数的现代 CPU 有两个 L1 cache：Data Cache 和 Instruction Cache。这种大量访存的方式只能清除 L1 的 Data Cache，无法清除 L1 的 Instruction Cache。
因为不知道 CPU 内部的具体实现方式，所以无法保证 CPU 会把 cache 里的所有旧数据全部替换掉。如果上述程序所访问的数据只在 cache 的一个 section 里打转，那么就根本算不上“清除”了 cache。
最致命的一点就是：这种方式与随便找一堆代码执行一下又有什么分别呢？

我还有一些别的想法：现在多核 CPU 基本上 L1 和 L2 是每个核独占，而 L3 则是同一个 numa 结点内的所有核共享，所以这种方式是绝对无法做到“清除” L3 缓存的。

2. 干等着

#!/usr/bin/ruby
puts "main:"
200000.times { puts "  nop" }
puts "  xor rax, rax"
puts "  ret"

Running a few times under different names (code produced not the script) should do the work [2]

3. 特殊的 CPU 指令

经过查阅资料，发现了一个有趣的事情，现代 CPU 提供了一种能直接作用于 cache line 的指令：CLFLUSH（即 Flush Cache Line）。

所以严格来说，能直接接触 CPU cache 的方法是存在的

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from L1 caches.

The CLFLUSH instruction does not flush only the L1 cache. From the Intel x86-64 reference manual: “The CLFLUSH (flush cache line) instruction writes and invalidates the cache line associated with a specified linear address. The invalidation is for all levels of the processor’s cache hierarchy, and it is broadcast throughout the cache coherency domain.” [1]

另外似乎还存在指令 wbinvd 和 invd，使指定 cache line 的数据变成 invalid。

Fortunately, there is more than one way to explicitly flush the caches.

The instruction “wbinvd” writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately, it is a privileged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the “OS” very small.

Additionally, there is the “invd” instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.[1]

关于这些指令的讨论可以参考[4]。
更完整的论述可以参考[3]。

参考内容

stack overflow: