laike9m's blog

DEC 23RD, 2020

Why Is GIL Worse Than We Thought?

以前每当看到有人抱怨 GIL（Global Interpreter Lock），我总会告诉他们不用慌，各种场景都有对应的解决方案，比如主 IO 操作用 async，主 CPU 操作用多进程。我也一直认为，Python 的慢主要慢在“纯”执行速度，而 GIL 只不过是一个瑕疵。

然而最近我意识到，GIL 是一个比想象中严重得多的问题，因为它阻碍了程序的按需并行。

什么是“按需并行”？这个词是我造的，用来描述编程中的一种常见 pattern，即把最耗时的那部分操作并行化，而程序整体仍保持单线程。通常来讲，耗时的部分往往是在遍历一个巨大的列表，并对列表中的元素做某种操作。而并行化也非常简单，只要开多个线程分别处理列表的一部分就行了。

这里我们只讨论 CPU 密集的情况。由于 GIL 的存在，开多个线程并不会让程序跑得更快（如果不是更慢的话），因此我们必须用到多进程。那么多进程是不是就能解决问题呢？并不总是，有一系列难点：

进程不共享内存，计算的输入必须被传到每个工作进程里，比如列表中的元素；
能被传递的东西必须 picklable，而有相当多的东西是 unpicklable 的；
如果后续程序执行需要并行计算的输出，那么这些输出也得 picklable；
Pickle -> unpickle 操作带来了额外的性能开销。

这样一来，多进程的应用范围就大大减小了。比如我最近在 Cyberbrain 中遇到的一个问题，其中一段代码是这样的：

for event in frame.events:   
  frame_proto.events.append(_transform_event_to_proto(event))
    event_ids = _get_event_sources_uids(event, frame)
    if event_ids:
      frame_proto.tracing_result[event.uid].event_ids[:] = event_ids

这段代码遍历 frame.events，处理之后更新 frame_proto。events 数量很大，导致这部分代码成为了性能瓶颈，因此我想把它并行化。然后我就发现这是一个不可能完成的任务，为什么呢？因为 protocol buffer 对象不 pickable。这意味着，我既不能把 frame_proto 传进每个进程，也不能把 _transform_event_to_proto(event) 的结果传出来，因为它们都是 protocol buffer 对象。如果是 C++ 或者 Java，这里直接多线程就解决了（每个线程分别更新 frame_proto）。

总结一下：

GIL 让在大部分语言里可以用多线程解决的事必须要用多进程解决。
多进程的诸多限制让它无法无缝替代多线程。即使在能替代的场景，也要做很多额外工作，以及承担序列化和反序列化带来的性能开销。

之前我们探讨了 GIL 对“并行”的阻碍，下面聊聊 GIL 对“按需”的阻碍。这是更本质的问题，却极少被人注意到。我们都知道，过早的优化是万恶之源。除了明显需要优化的场景（比如避免数据库 N+1），一般而言都是先实现，再 profile，最后优化。换句话说，类似“一个循环成了性能瓶颈”这种发现，写代码的时候一般是不知道的。假设你的程序写完了，然后需要优化某一部分，你当然希望能够不动其它代码，只修改瓶颈部分即可。这种优化的场景就非常适合多线程——因为变量是共享的，所以程序的整体完全不用动。而一旦涉及多进程，则往往需要对程序进行更大程度的修改，甚至重新设计整个架构。这样一来，“按需”优化就不存在了。这不仅导致优化困难，更给项目管理带来了不确定性，甚至可能导致延期或性能不达标。

那么，PEP 554 - 多解释器是不是救世主呢？显然也不是。多解释器说白了就是 goroutine 的 Python 实现，问题是它限制了 channel 能传递的变量类型，quote:

Along those same lines, we will initially restrict the types that may be passed through channels to the following:

None

bytes

str

int

channels

所以，多解释器虽然是好的，但恐怕还是不能解决“按需并行”的问题。

注：Python 里多进程可以共享内存，然而能共享的变量类型同样有限，具体可参考：multiprocessing.shared_memory

Update：发现一个遇到了类似问题的哥们儿，以及我的回复

NOV 30TH, 2020

Stars Don’t Make Me Happy. Feedbacks Do.

Like many other programmers, I used to get hyped when somebody stars my project on GitHub. However, I find myself becoming less and less excited about it. It's not to say that I hate it. Nobody hates stars, and I'm still happy to see the number grows. Yet, I found something that values more to me: user feedbacks.

So what are user feedbacks? Almost anything you can think of counts: issues, comments, questions, suggestions, PRs, articles, usage, etc. Feedbacks show that people are actually using your project, and it can help you improve. A negative feedback is way better than no feedback, because it tells you what you should work on next.

Feedbacks usually comes as a natural result of exposure. More people know it thus more people use it. But sometimes, you got a bunch of stars, yet nobody gives you any feedback. That's the situation Cyberbrain is currently facing, which bothers me a lot. Tian said he has a similar feeling for VizTracer, also xintao for iRedis. Why is it bothering? Because it leaves you in a clueless state, and you keep doubting yourself: am I doing good? Am I doing bad? Why do people seem to be interested but don't really use it? Do I just sit and wait, or should I reach out to some of the people, but whom? What should I work on next, are the planned features what users need the most?

I don't have an answer, and I hope it's because people are already satisfied or shocked by Cyberbrain. Luckily, I did receive some feedback from my friends, and I want to say thank you to all of you. It matters a lot to me.

JUN 10TH, 2020

Python 真的是靠一个 switch 来执行字节码的吗？

分支预测
优化技巧
实现
- 编译中判断要不要开启 computed goto
- ceval.c 中的实现

如果你读任何 (C)Python 源码分析的书或者文章，里面都会讲 Python 的 ceval.c 中有一个大 switch，会根据不同的 opcode 跳到相应的 case 去执行字节码。甚至 Anthony Shaw 的新书《CPython Internals》也是这么讲的。

然而实际上，CPython （在大部分情况下）早就不这么做了。这篇文章会解释为什么。下文中 Python 均指 CPython。

分支预测

为了解释清楚整个过程，我们必须从底层讲起。分支预测（Branch Prediction）是 CPU 必备的功能。准确的分支预测能大大提高指令流水线的性能，反之则导致性能损耗。

Python 刚发明的时候，每一个 bytecode 的确需要进行一次 switch 跳转。而 C 语言的 switch 语句会导致分支预测不准，因为 CPU 无法预测程序会跳到哪个 case（下一个指令）。如果不进行优化，每执行一条字节码指令都将导致一次分支预测错误，可以想象对性能的损耗有多大。

优化技巧

问题的关键就在于 switch...case 的跳转是个一对多的情况。想象你面对一道有十个选项的选择题，你不会知道哪个是正确答案。然而如果有十道题各一个选项，答案就都确定了。Python 采用的优化业无非也就是这么回事。

Python 用了一种叫做 computed goto 的技巧，依赖于编译器的 label as value 功能。详细原理大家就看链接里的解释吧。总之效果就是，可以把解释器主循环中的 switch...case 替换成 while loop，在每个 Python 的指令处理完之后，显示地用 goto 跳转到下一条指令。大概长这样：

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {&&do_op1, &&do_op2, &&do_op3};
    #define DISPATCH() goto *dispatch_table[code[pc++]]

    int pc = 0;
    int val = initval;

    DISPATCH();
    while (1) {
        do_op1:
            return val;
        do_op2:
            val++;
            DISPATCH();
        do_op3:
            val--;
            DISPATCH();
    }
}

每个 goto 都是一次跳转，并且只有一个目标（下一条指令），这就让分支预测变得容易，从而性能也获得了极大提升（根据 Python 的注释，提升可达 15% ~ 20%）。

实现

下面讲一下 Python 是如何实现上面提到的优化的。

编译中判断要不要开启 computed goto

这部分。。非常蛋疼。我花了不少时间试图弄清楚，最后还是放弃了。这个回答里记录了我的一些发现。总之大概的意思就是，Python 允许用户通过 --with-computed-gotos 和 --without-computed-gotos 参数在编译的时候控制要不要开启 computed goto。默认是开启的。同时编译期间还会判断当前使用的编译器支不支持 computed goto（GCC, Clang 支持，MSVC 不支持）。如果两个条件都满足，在 ceval.c 中就会启用 computed goto。

ceval.c 中的实现

首先，label 是在 opcode_targets.h 中定义，并被 include 到 ceval.c 的。

这一段定义了 computed goto enabled 和 disabled 时的跳转行为，我把简化过的代码贴在下面：

#if USE_COMPUTED_GOTOS
/* Import the static jump table */
#include "opcode_targets.h"

#define TARGET(op) op: TARGET_##op

#define DISPATCH() { \
    _Py_CODEUNIT word = *next_instr; \
    opcode = _Py_OPCODE(word); \
    oparg = _Py_OPARG(word); \
    next_instr++; \
    goto *opcode_targets[opcode]; \
}

#else

#define TARGET(op) op
#define DISPATCH() continue

#endif

嗯，简化了相当多的东西，但这能让我们更清晰地看出只有在 USE_COMPUTED_GOTOS 的情况下，DISPATCH() 才会用 goto 去跳到下一个 opcode 的 label。

下面是极简版的主循环代码：

for (;;) {
  switch (opcode) {
      case TARGET(LOAD_CONST): {
        // 处理 LOAD_CONST
        DISPATCH();
      }
      // 其它 opcode 省略了
  }
}

先看 computed goto 未启用时，把代码中的宏展开（替换 TARGET 和 DISPATCH）的结果：

for (;;) {
  switch (opcode) {
      case LOAD_CONST: {
        // 处理 LOAD_CONST
        continue;
      }
  }
}

嗯，就是这样了，很直接了当的 switch...case。

再看 computed goto 启用时，把代码中的宏展开的结果：

for (;;) {
  switch (opcode) {
    case LOAD_CONST:
    TARGET_LOAD_CONST: { // 注意这里多了一个 label
      // 处理 LOAD_CONST
      _Py_CODEUNIT word = *next_instr;
      opcode = _Py_OPCODE(word);
      oparg = _Py_OPARG(word);
      next_instr++;
      goto *opcode_targets[opcode];
    }
  }
}

可以看到，在处理完 LOAD_CONST 之后，通过 next_instr 拿到下一个 opcode，然后执行goto *opcode_targets[opcode]; ，跳到另一个由 opname 命名的 label（参见 opcode_targets.h ），比如 TARGET_STORE_NAME:。这里的关键点在于，代码直接跳转到了下一个指令对应的 label，并不经过 switch。这也就回答了文章一开始的问题：

Python 真的是靠一个 switch 来执行字节码的吗？

答案是：

只要 Python 启用了 computed goto （比如在 Mac 和 Linux 上），字节码的执行就不依赖 switch。

顺便一提，这个功能是在 Python 3.2 中变为默认开启的：

Computed gotos are now enabled by default on supported compilers (which are detected by the configure script). They can still be disabled selectively by specifying --without-computed-gotos.

← Newer Blog Archives Older →