Requests' secret: pool_connections and pool_maxsize

Requests is one of the, if not the most well-known Python third-party library for Python programmers. With its simple API and high performance, people tend to use requests rather than urllib2 in standard library for HTTP requests. However, people who use requests every day may not know the internals, and today I want to explain two concepts: pool_connections and pool_maxsize.

Let's start with Session:

import requests

s = requests.Session()
s.get('https://www.google.com')

It's pretty simple. You probably know requests' Session persists cookie. Cool. But do you know Session has a mount method?

mount(prefix, adapter)
Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length.

No? Well, in fact you've already used this method when you initialize a Session object:

class Session(SessionRedirectMixin):

    def __init__(self):
        ...
        # Default connection adapters.
        self.adapters = OrderedDict()
        self.mount('https://', HTTPAdapter())
        self.mount('http://', HTTPAdapter())

Now comes the interesting part. If you've read Ian Cordasco's article Retries in Requests, you should know that HTTPAdapter can be used to provide retry functionality. But what is an HTTPAdapter really? Quoted from doc:

class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)

The built-in HTTP Adapter for urllib3.

Provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface. This class will usually be created by the Session class under the covers.

Parameters:
* pool_connections – The number of urllib3 connection pools to cache. * pool_maxsize – The maximum number of connections to save in the pool. * max_retries(int) – The maximum number of retries each connection should attempt. Note, this applies only to failed DNS lookups, socket connections and connection timeouts, never to requests where data has made it to the server. By default, Requests does not retry failed connections. If you need granular control over the conditions under which we retry a request, import urllib3’s Retry class and pass that instead. * pool_block – Whether the connection pool should block for connections. Usage:

>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)

If the above documentation confuses you, here's my explanation: what HTTP Adapter does is simply providing different configurations for different requests according to target url. Remember the code above?

self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())

It creates two HTTPAdapter objects with the default argument pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False, and mount to https:// and http:// respectively, which means configuration of the first HTTPAdapter() will be used if you try to send a request to http://xxx, and the second HTTPAdapter() for requests to https://xxx. Though in this case two configurations are the same, requests to http and https are still handled separately. We'll see what it means later.

As I said, the main purpose of this article is to explain pool_connections and pool_maxsize.

First let's look at pool_connections. Yesterday I raised a question on stackoverflow cause I'm not sure if my understanding is correct, the answer eliminates my uncertainty. HTTP, as we all know, is based on TCP protocol. An HTTP connection is also a TCP connection, which is identified by a tuple of five values:

(<protocol>, <src addr>, <src port>, <dest addr>, <dest port>)

Say you've established an HTTP/TCP connection with www.example.com, assume the server supports Keep-Alive, next time you send request to www.example.com/a or www.example.com/b, you could just use the same connection cause none of the five values change. In fact, requests' Session automatically does this for you and will reuse connections as long as it can.

The question is, what determines if you can reuse old connection or not? Yes, pool_connections!

pool_connections – The number of urllib3 connection pools to cache.

I know, I know, I don't want to brought so many terminologies either, this is the last one, I promise. For easy understanding, one connection pool corresponds to one host, that's what it is.

Here's an example(unrelated lines are ignored):

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')

"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2621
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

HTTPAdapter(pool_connections=1) is mounted to https://, which means only one connection pool persists at a time. After calling s.get('https://www.baidu.com'), the cached connection pool is connectionpool('https://www.baidu.com'). Now s.get('https://www.zhihu.com') came, and the session found that it cannot use the previously cached connection because it's not the same host(one connection pool corresponds to one host, remember?). Therefore the session had to create a new connection pool, or connection if you would like. Since pool_connections=1, session cannot hold two connection pools at the same time, thus it abandoned the old one which is connectionpool('https://www.baidu.com') and kept the new one which is connectionpool('https://www.zhihu.com'). Next get is the same. This is why we see three Starting new HTTPS connection in log.

What if we set pool_connections to 2:

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=2))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

Great, now we only created connection twice and saved one connection establishing time.

Finally, pool_maxsize.

First and foremost, you should be caring about pool_maxsize only if you use Session in a multithreaded environment, like making concurrent requests from multiple threads using the same Session.

Actually, pool_maxsize is an argument for initializing urllib3's HTTPConnectionPool, which is exactly the connection pool we mentioned above. HTTPConnectionPool is a container for a collection of connections to a specific host, and pool_maxsize is the number of connections to save that can be reused. If you're running your code in one thread, it's neither possible nor needed to create multiple connections to the same host, cause requests library is blocking, so that HTTP request are always sent one after another.

Things are different if there are multiple threads.

def thread_get(url):
    s.get(url)

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
"""

See? It established two connections for the same host www.zhihu.com, like I said, this can only happen in a multithreaded environment. In this case, we create a connectionpool with pool_maxsize=2, and there're no more than two connections at the same time, so it's enough. We can see that requests from t3 and t4 did not create new connections, they reused the old ones.

What if there's not enough size?

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start()
t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (3): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
"""

Now, pool_maxsize=1,warning came as expected:

Connection pool is full, discarding connection: www.zhihu.com

We can also noticed that since only one connection can be saved in this pool, a new connection is created again for t3 or t4. Obviously this is very inefficient. That's why in urllib3's documentation it says:

If you’re planning on using such a pool in a multithreaded environment, you should set the maxsize of the pool to a higher number, such as the number of threads.

Last but not least, HTTPAdapter instances mounted to different prefixes are independent.

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
s.mount('https://baidu.com', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 =Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57669
"""

The above code is easy to understand so I'm not gonna explain.

I guess that's all. Hope this article help you understand requests better. BTW I created a gist here which contains all of the testing code used in this article. Just download and play with it :)

Appendix

  1. For https, requests uses urllib3's HTTPSConnectionPool, but it's pretty much the same as HTTPConnectionPool so I didn't differeniate them in this article.
  2. Session's mount method ensures the longest prefix gets matched first. Its implementation is pretty interesting so I posted it here.

    def mount(self, prefix, adapter):
    """Registers a connection adapter to a prefix.
    Adapters are sorted in descending order by key length."""
    self.adapters[prefix] = adapter
    keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
    for key in keys_to_move:
        self.adapters[key] = self.adapters.pop(key)
    

    Note that self.adapters is an OrderedDict.

  3. A more advanced config option pool_block
    Notice that when creating an adapter, we didn't introduce the last parameter pool_block HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False) Setting it to True is equivalent to setting block=True when creating a urllib3.connectionpool.HTTPConnectionPool, the effect is this, quoted from urllib3's doc:

    If set to True, no more than maxsize connections will be used at a time. When no free connections are available, the call will block until a connection has been released. This is a useful side effect for particular multithreaded situations where one does not want to use more than maxsize connections per host to prevent flooding.

    But actually, requests doesn't set a timeout(https://github.com/shazow/urllib3/issues/1101), which means that if there's no free connection, an EmptyPoolError will be raised immediately.

  4. 刚发现,我就说为什么白天晕乎乎的,原来是吃了白加黑的黑片_(:3」∠)_

喜欢的日本歌手

今天是 2016 年的第一天。刚才瞬间意识到我还是喜欢把文章叫做“日志”,这是长时间使用人人网带来的习惯。于是又想到人人网在 2015 年愈发衰落了。虽然我现在还会登陆,不过能见到的活人已然不多。如果不是加了一些现实中不认识的人,基本就没得可看了。人人作为中国少有的真正的社交网络,走到如今这个地步实在让人唏嘘。

今天想介绍一些个人喜欢的日本歌手。虽然没怎么写过和音乐相关的文章(除了版权那篇),日常中也看不出迹象,但其实我听歌很多。

首先要说的是 The Pillows

FLCL 这么好看,The Pillows 的音乐功不可没。那首 Little Busters 我不知道听了多少遍。他们的几乎每一首歌我都喜欢,风格上很对胃口。

然后是 nano.RIPE,应该很多人都知道。

从花开物语认识他们,当初我还以为主唱和绪花的声优是一个人,后来看了啪社的那个联合演唱会才发现不是。主唱きみコ的声音实在太好听了。现在已经唱了许多动画 OP/ED,大概也是公司觉得这种声线适合吧。

再来是川濑智子

确切地说上面是 Tommy heavenly6。她是分身歌手,以两个形象分别活动,很有意思。虾米上有人评论“完爆艾薇儿”,《Hey my friend》确实有点那个意思,至少我觉得不输于艾薇儿的任何单曲,而且她的英语也很好,根本感觉不到是日本人。实话说我觉得她好听的歌数量还是少了点。

最后是渥美佐织

声音和风格我都很喜欢的歌手,用一个词形容就是清新。歌曲平均水准极高。个人最喜欢的是现视研的 ED 《びいだま

试验一下 HTML5 Audio。发现 GitHub 渲染 markdown 会将不认识的 tag 直接丢掉……只能手动加入 html 里面

《LITTLE BUSTERS》- The Pillows
《絵空事.mp3》- nano.RIPE
《Hey my friend》- 川濑智子
《びいだま》- 渥美佐织

最后祝大家新年快乐!

解决 Mac 无法编译带 C 扩展库的问题

当初为了无痛开发换到 Mac,之后大大小小也遇到过不少奇怪的问题,不过都没有这次的奇怪。考虑到之后也可能遇到,姑且记录一下。

问题的表现是无法安装任何带 C 扩展的 Python 库。在编译的过程中会提示 <stdio.h> not found。系统是 10.11 El Capitan。

于是就去网上搜,得知 Mac 下 stdio.h 应该位于 /usr/include。搞笑的是,include 文件夹消失了!去网上搜,发现很多人都遇到这个问题,估计是 Xcode 升级带来的 bug,然后基本所有人都说要通过 xcode-select --install 命令装 commandline tools。问题是我早就装了。

终于找到一个地方提示说可以手动把 Xcode 里面的文件夹链接过去。尝试执行

sudo ln -s /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include /usr/include

提示

ln: /usr/include: Operation not permitted

搜索发现原因是 Mac 默认开启了 Configuring System Integrity Protection,简单来说就是一些系统相关的文件夹被设为只读。解决方法参考这里

  1. 重启电脑
  2. 在启动过程中按住 cmd + r,进入 Recover Mode
  3. 选择 utilities > terminal,输入命令
csrutil disable   
reboot
  1. 再次执行 ln,成功。

再次尝试安装,这里以 greenlet 为例

$ pip install greenlet
...
Building wheels for collected packages: greenlet
  Running setup.py bdist_wheel for greenlet
  Complete output from command /usr/local/opt/python3/bin/python3.5 -c "import setuptools;__file__='/private/var/folders/sy/msgzx60s2_332s1wdb92fqw80000gn/T/pip-build-amdm1ven/greenlet/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /var/folders/sy/msgzx60s2_332s1wdb92fqw80000gn/T/tmp4a1nacfbpip-wheel-:
  running bdist_wheel
  running build
  running build_ext
  building 'greenlet' extension
  creating build
  creating build/temp.macosx-10.10-x86_64-3.5
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c greenlet.c -o build/temp.macosx-10.10-x86_64-3.5/greenlet.o
  Assertion failed: (!contains(New, this) && "this->replaceAllUsesWith(expr(this)) is NOT valid!"), function replaceAllUsesWith, file /Users/laike9m/Dev/C_CPP/Lib/cling/src/lib/IR/Value.cpp, line 343.
 ...

    ********************
    error: command 'clang' failed with exit status 254

    ----------------------------------------
Command "/usr/local/opt/python3/bin/python3.5 -c "import setuptools, tokenize;__file__='/private/var/folders/sy/msgzx60s2_332s1wdb92fqw80000gn/T/pip-build-amdm1ven/greenlet/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/sy/msgzx60s2_332s1wdb92fqw80000gn/T/pip-l6qwwxng-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/sy/msgzx60s2_332s1wdb92fqw80000gn/T/pip-build-amdm1ven/greenlet 

我不相信 greenlet 的代码有问题,遂怀疑是 clang 的问题。于是想用 gcc 来编译。尝试 export CC=gcc 未果,发现仍然在使用 clang,报同样的错。还是在 SO 上找到了解释

It has been this way for a long time already. The "GCC" that came with 10.8 was really GCC front-end with LLVM back-end.

解决方法是用 homebrew 安装 gcc,这样不会和系统的 gcc 冲突。

$ brew install gcc49

然后

export CC=/usr/local/Cellar/gcc49/4.9.3/bin/gcc-4.9

这下终于安装成功。

2017-2-9
最近发现有时候必须得在 virtualenv 里才能安装成功……


top