HAproxy X86 vs ARM64性能比拼

译者: wangxiyuan
作者: Martin Grigorov
原文链接: https://medium.com/@martin.grigorov/compare-haproxy-performance-on-x86-64-and-arm64-cpu-architectures-bfd55d1d5566

本文是由Apache Tomcat PMC Martin带来的Haproxy最新版本的性能测试报告。

HAProxy v2.2在几天前刚刚发布,所以我决定在 x86_64和 aarch64 虚拟机上对它运行负载测试:

  • x86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Architecture:                    x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 42 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 4 MiB
L3 cache: 30.3 MiB
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nons
top_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpref
etch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx51
2cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities
  • aarch64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Architecture:                    aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
CPU max MHz: 2400.0000
CPU min MHz: 2400.0000
BogoMIPS: 200.00
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

注意: 我尽可能的让虚拟机的硬件配置更接近,即使用相同的 RAM 类型和大小、相同的磁盘、网卡和带宽。此外,cpu 尽可能相似,但难免有一些差异:

  • CPU 频率: 3000 MHz (x86 _ 64) vs 2400 MHz (aarch64)
  • BogoMIPS: 6000(x86 _ 64) vs 200(aarch64)
  • 一级缓存: 128 KiB (x86 _ 64) vs 512 KiB (aarch64)

两个虚拟机都运行在最新版的Ubuntu 20.04上。

我的HAProxy 是从master分支的源代码构建的,代码与HAProxy v2.2的几乎没有区别。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
TARGET = linux-glibc
CPU = generic
CC = clang-9
CFLAGS = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS

Available services : none

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app

我已经试图通过遵循我在官方文档和网络上找到的所有最佳实践来尽可能地优化它。

HAProxy的配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
global
log stdout format raw local0 err
# nbproc 8
nbthread 32
cpu-map 1/all 0-7
tune.ssl.default-dh-param 2048
tune.ssl.capture-cipherlist-size 1
ssl-server-verify none
maxconn 32748
daemon

defaults
timeout client 60s
timeout client-fin 1s
timeout server 30s
timeout server-fin 1s
timeout connect 10s
timeout http-request 10s
timeout http-keep-alive 10s
timeout queue 10m
timeout check 10s
mode http
log global
option dontlog-normal
option httplog
option dontlognull
option http-use-htx
option http-server-close
option http-buffer-request
option redispatch
retries 3000

frontend test_fe
bind :::8080
#bind :::8080 ssl crt /home/ubuntu/tests/tls/server.pem
default_backend test_be

backend test_be
#balance roundrobin
balance leastconn
#balance random(2)
server go1 127.0.0.1:8081 no-check
server go2 127.0.0.1:8082 no-check
server go3 127.0.0.1:8083 no-check
server go4 127.0.0.1:8084 no-check

通过这种方式,HAProxy 被用作四个 HTTP 服务的前端负载均衡器。

想使用 SSL方式的话,只需要注释掉第34行并取消注释第35行。

我使用了多线程设置以获得最佳结果。正如文档所说,这是推荐的设置,而且它也使吞吐量提高了近两倍!此外经过我把吞吐量从8个线程增加到16个线程,再从16个线程增加到32个线程的设置后,发现使用32个线程的效果最好,当使用64个线程时吞吐量开始下降。

我还使用CPU-map 1/all 0-7将线程固定在同一个 CPU中。

另一个重要的设置是用于平衡后端的算法。就像Willy Tarreau的测试一样。

正如在 HAProxy Enterprice 文档中所推荐的,我已经禁用了irqbalance

最后,我应用了以下内核设置:

1
2
3
4
5
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65024"
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=100000
sudo sysctl -w net.core.netdev_max_backlog=100000
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w fs.file-max=500000

fs.file-max 也与/etc/security/limits. conf中的一些更改有关:

1
2
3
4
root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000

对于后端,我使用了用 Golang 编写的非常简单的 HTTP 服务器。他们只是将“ Hello World”写回客户机,而不从磁盘或网络读/写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package main

// run with: env PORT=8081 go run http-server.go

import (
"fmt"
"log"
"net/http"
"os"
)

func main() {

port := os.Getenv("PORT")
if port == "" {
log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
}

http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
fmt.Fprintf(w, "Hello World")
})

log.Fatal(http.ListenAndServe(":" + port, nil))

}

对用负载测试客户端,我使用了与测试Apache Tomcat相同设置的WRK

结果如下:

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.67ms 8.82ms 196.74ms 89.85%
Req/Sec 2.60k 337.06 5.79k 75.79%
621350 requests in 30.09s, 75.85MB read
Requests/sec: 20651.69
Transfer/sec: 2.52MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.32ms 4.46ms 75.42ms 94.58%
Req/Sec 4.71k 538.41 8.84k 82.41%
1127664 requests in 30.10s, 137.65MB read
Requests/sec: 37464.85
Transfer/sec: 4.57MB
  • aarch64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.92ms 12.50ms 248.52ms 91.18%
Req/Sec 2.42k 338.67 4.34k 80.88%
578210 requests in 30.08s, 70.58MB read
Requests/sec: 19220.81
Transfer/sec: 2.35MB
  • x86_64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.56ms 4.83ms 111.51ms 94.25%
Req/Sec 4.46k 609.37 7.23k 85.60%
1066831 requests in 30.07s, 130.23MB read
Requests/sec: 35474.26
Transfer/sec: 4.33MB

我们可以发现:

  • 在 x86_64 VM 上,HAProxy 的速度几乎是 aarch64 VM 的两倍。
  • 并且 TLS offloading减少了5-8% 的吞吐量

更新1(2020年7月10日) : 为了确定基于 Golang 的 HTTP 服务器是否是上述测试中的瓶颈,我决定直接针对一个后端(即跳过 HAProxy)运行相同的 WRK 负载测试。

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 615.31us 586.70us 22.44ms 90.61%
Req/Sec 20.05k 1.57k 42.29k 73.62%
4794299 requests in 30.09s, 585.24MB read
Requests/sec: 159319.75
Transfer/sec: 19.45MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 774.24us 484.99us 36.43ms 97.04%
Req/Sec 15.28k 413.04 16.89k 73.57%
3658911 requests in 30.10s, 446.64MB read
Requests/sec: 121561.40
Transfer/sec: 14.84MB

在这里我们看到运行在 aarch64上的 HTTP 服务比运行在 x86– 64上的要快30% !

更重要的观察结果是,当根本不使用负载均衡器时,arm64的吞吐量要好几倍!我认为问题在于我的设置ーー HAProxy 和4个后端服务器都运行在同一个虚拟机上,所以它们在争夺资源!下面我计划把Golang服务固定到他们自己的 CPU 核心上,让 HAProxy 只使用其他4个 CPU 核心!敬请期待最新消息!


更新2(2020年7月10日) :

为了将进程固定到特定的 cpu,我将使用numactl

1
2
3
4
5
6
7
8
$ numactl — hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16012 MB
node 0 free: 170 MB
node distances:
node 0
0: 10

我已经将 Golang HTTP 服务固定在以下几个方面:

1
2
numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server.
go

例如,这个后端实例被固定到 CPU 节点0和物理 CPU 4。其他三个后端服务分别固定在物理 cpu 5、6和7上。

我还对 HAProxy 的配置做了一些改动:

1
2
3
4
nbthread 4
cpu-map 1/all 0–3

Nbthread 4cpu-map 1/all 0-3

也就是说,HAProxy 将产生4个线程,它们将被固定到物理 cpu 0-3上。

通过这些改变,aarch64的结果保持不变:

1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.44ms 2.11ms 36.48ms 88.36%
Req/Sec 4.98k 651.34 6.62k 74.40%
596102 requests in 30.10s, 72.77MB read
Requests/sec: 19804.19
Transfer/sec: 2.42MB

但是 x86_64下降了:

1
2
3
4
5
6
7
8
9

Running 30s test @ https://192.168.0.206:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 767.40us 153.24us 19.07ms 97.72%
Req/Sec 5.21k 173.41 5.51k 63.46%
623911 requests in 30.10s, 76.16MB read
Requests/sec: 20727.89
Transfer/sec: 2.53MB

对于 HTTP (没有 TLS)也是如此:

  • aarch64
1
2
3
4
5
6
7
8
9

Running 30s test @ http://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.40ms 2.16ms 36.55ms 88.08%
Req/Sec 5.55k 462.65 6.97k 69.85%
665269 requests in 30.10s, 81.21MB read
Requests/sec: 22102.12
Transfer/sec: 2.70MB
  • x86_64
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080                                                                                                                                                                   
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 726.01us 125.04us 6.42ms 93.95%
Req/Sec 5.51k 165.80 5.80k 57.24%
658777 requests in 30.10s, 80.42MB read
Requests/sec: 21886.50
Transfer/sec: 2.67MB

因此,现在 HAProxy 在 aarch64上的速度比 x86_64稍快一些,但仍然远远低于每秒120000多个请求的“空负载均衡器”方法。


更新3(2020年7月10日) : 在看到 Golang HTTP 服务的性能非常好(120-160K reqs/sec)并简化设置之后,我决定从 Update 2中删除 CPU固定,并使用来自其他 VM 的后端,例如,当在aarch64虚拟机上运行HAProxy时,它将在x86_64上运行的后端之间进行负载均衡;当使用WRK在x86_64上运行HAProxy时,它将使用aarch64虚拟机上运行的 Golang HTTP服务。以下是新的结果:

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.33ms 4.93ms 76.85ms 89.14%
Req/Sec 2.10k 316.84 3.52k 74.50%
501840 requests in 30.07s, 61.26MB read
Requests/sec: 16688.53
Transfer/sec: 2.04MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.32ms 6.71ms 71.29ms 90.25%
Req/Sec 3.26k 639.12 4.14k 65.52%
779297 requests in 30.08s, 95.13MB read
Requests/sec: 25908.50
Transfer/sec: 3.16MB
  • aarch64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.17ms 5.41ms 292.21ms 91.08%
Req/Sec 2.13k 238.74 3.85k 86.32%
506111 requests in 30.09s, 61.78MB read
Requests/sec: 16821.60
Transfer/sec: 2.05MB
  • x86_64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.40ms 2.54ms 58.66ms 97.27%
Req/Sec 3.82k 385.85 4.55k 92.10%
914329 requests in 30.10s, 111.61MB read
Requests/sec: 30376.95
Transfer/sec: 3.71MB

祝你黑客生活愉快,注意安全!

HAProxy 2.2 has been released few days ago so I’ve decided to run my load tests against it on my x86_64 and aarch64 VMs:

  • x86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Architecture:                    x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 42 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 4 MiB
L3 cache: 30.3 MiB
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nons
top_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpref
etch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx51
2cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities
  • aarch64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Architecture:                    aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
CPU max MHz: 2400.0000
CPU min MHz: 2400.0000
BogoMIPS: 200.00
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences

  • the CPU frequency: 3000 MHz (x86_64) vs 2400 MHz (aarch64)
  • BogoMIPS: 6000 (x86_64) vs 200 (aarch64)
  • Level 1 caches: 128 KiB (x86_64) vs 512 KiB (aarch64)

Both VMs run Ubuntu 20.04 with latest software updates.

HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
TARGET = linux-glibc
CPU = generic
CC = clang-9
CFLAGS = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS

Available services : none

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app

I’ve tried to fine tune it as much as I could by following all best practices I was able to find in the official documentation and in the web.

The HAProxy config is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
global
log stdout format raw local0 err
# nbproc 8
nbthread 32
cpu-map 1/all 0-7
tune.ssl.default-dh-param 2048
tune.ssl.capture-cipherlist-size 1
ssl-server-verify none
maxconn 32748
daemon

defaults
timeout client 60s
timeout client-fin 1s
timeout server 30s
timeout server-fin 1s
timeout connect 10s
timeout http-request 10s
timeout http-keep-alive 10s
timeout queue 10m
timeout check 10s
mode http
log global
option dontlog-normal
option httplog
option dontlognull
option http-use-htx
option http-server-close
option http-buffer-request
option redispatch
retries 3000

frontend test_fe
bind :::8080
#bind :::8080 ssl crt /home/ubuntu/tests/tls/server.pem
default_backend test_be

backend test_be
#balance roundrobin
balance leastconn
#balance random(2)
server go1 127.0.0.1:8081 no-check
server go2 127.0.0.1:8082 no-check
server go3 127.0.0.1:8083 no-check
server go4 127.0.0.1:8084 no-check

This way HAProxy is used as a load balancer in front of four HTTP servers.

To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.

The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.

I’ve also pinned the threads to stay at the same CPU for its lifetime with cpu-map 1/all 0–7.

The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me leastconn gave the best performance.

As recommended at HAProxy Enterprice documentation I’ve disabled irqbalance.

Finally I’ve applied the following kernel settings:

1
2
3
4
5
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65024"
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=100000
sudo sysctl -w net.core.netdev_max_backlog=100000
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w fs.file-max=500000

fs.file-max is related also with a change in /etc/security/limits.conf:

1
2
3
4
root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000

For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package main

// run with: env PORT=8081 go run http-server.go

import (
"fmt"
"log"
"net/http"
"os"
)

func main() {

port := os.Getenv("PORT")
if port == "" {
log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
}

http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
fmt.Fprintf(w, "Hello World")
})

log.Fatal(http.ListenAndServe(":" + port, nil))

}

As load testing client I have used WRK with the same setup as for testing Apache Tomcat.

And now the results:

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.67ms 8.82ms 196.74ms 89.85%
Req/Sec 2.60k 337.06 5.79k 75.79%
621350 requests in 30.09s, 75.85MB read
Requests/sec: 20651.69
Transfer/sec: 2.52MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.32ms 4.46ms 75.42ms 94.58%
Req/Sec 4.71k 538.41 8.84k 82.41%
1127664 requests in 30.10s, 137.65MB read
Requests/sec: 37464.85
Transfer/sec: 4.57MB
  • aarch64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.92ms 12.50ms 248.52ms 91.18%
Req/Sec 2.42k 338.67 4.34k 80.88%
578210 requests in 30.08s, 70.58MB read
Requests/sec: 19220.81
Transfer/sec: 2.35MB
  • x86_64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.56ms 4.83ms 111.51ms 94.25%
Req/Sec 4.46k 609.37 7.23k 85.60%
1066831 requests in 30.07s, 130.23MB read
Requests/sec: 35474.26
Transfer/sec: 4.33MB

What we see here is:

  • that HAProxy is almost twice faster on the x86_64 VM than the aarch64 VM!
  • and also that TLS offloading decreases the throughput with around 5–8%

Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 615.31us 586.70us 22.44ms 90.61%
Req/Sec 20.05k 1.57k 42.29k 73.62%
4794299 requests in 30.09s, 585.24MB read
Requests/sec: 159319.75
Transfer/sec: 19.45MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 774.24us 484.99us 36.43ms 97.04%
Req/Sec 15.28k 413.04 16.89k 73.57%
3658911 requests in 30.10s, 446.64MB read
Requests/sec: 121561.40
Transfer/sec: 14.84MB

Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!

And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!


Update 2 (Jul 10 2020):

To pin the processes to specific CPUs I will use numactl.

1
2
3
4
5
6
7
8
$ numactl — hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16012 MB
node 0 free: 170 MB
node distances:
node 0
0: 10

I’ve pinned the Golang HTTP servers with:

1
2
numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server.
go

i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.

Also I’ve changed slightly the HAProxy configuration:

1
2
3
4
nbthread 4
cpu-map 1/all 0–3

Nbthread 4cpu-map 1/all 0-3

i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.

With these changes the results stayed the same for aarch64:

1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.44ms 2.11ms 36.48ms 88.36%
Req/Sec 4.98k 651.34 6.62k 74.40%
596102 requests in 30.10s, 72.77MB read
Requests/sec: 19804.19
Transfer/sec: 2.42MB

but dropped for x86_64:

1
2
3
4
5
6
7
8
9

Running 30s test @ https://192.168.0.206:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 767.40us 153.24us 19.07ms 97.72%
Req/Sec 5.21k 173.41 5.51k 63.46%
623911 requests in 30.10s, 76.16MB read
Requests/sec: 20727.89
Transfer/sec: 2.53MB

and same for HTTP (no TLS):

  • aarch64
1
2
3
4
5
6
7
8
9

Running 30s test @ http://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.40ms 2.16ms 36.55ms 88.08%
Req/Sec 5.55k 462.65 6.97k 69.85%
665269 requests in 30.10s, 81.21MB read
Requests/sec: 22102.12
Transfer/sec: 2.70MB
  • x86_64
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080                                                                                                                                                                   
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 726.01us 125.04us 6.42ms 93.95%
Req/Sec 5.51k 165.80 5.80k 57.24%
658777 requests in 30.10s, 80.42MB read
Requests/sec: 21886.50
Transfer/sec: 2.67MB

So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.


Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:

  • aarch64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.33ms 4.93ms 76.85ms 89.14%
Req/Sec 2.10k 316.84 3.52k 74.50%
501840 requests in 30.07s, 61.26MB read
Requests/sec: 16688.53
Transfer/sec: 2.04MB
  • x86_64, HTTP
1
2
3
4
5
6
7
8
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.32ms 6.71ms 71.29ms 90.25%
Req/Sec 3.26k 639.12 4.14k 65.52%
779297 requests in 30.08s, 95.13MB read
Requests/sec: 25908.50
Transfer/sec: 3.16MB
  • aarch64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.17ms 5.41ms 292.21ms 91.08%
Req/Sec 2.13k 238.74 3.85k 86.32%
506111 requests in 30.09s, 61.78MB read
Requests/sec: 16821.60
Transfer/sec: 2.05MB
  • x86_64, HTTPS
1
2
3
4
5
6
7
8
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.40ms 2.54ms 58.66ms 97.27%
Req/Sec 3.82k 385.85 4.55k 92.10%
914329 requests in 30.10s, 111.61MB read
Requests/sec: 30376.95
Transfer/sec: 3.71MB

Happy hacking and stay safe!

#

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×