2020-07-14

Web

HAproxy X86 vs ARM64性能比拼

译者: wangxiyuan
作者: Martin Grigorov
原文链接: https://medium.com/@martin.grigorov/compare-haproxy-performance-on-x86-64-and-arm64-cpu-architectures-bfd55d1d5566

本文是由Apache Tomcat PMC Martin带来的Haproxy最新版本的性能测试报告。

中文
English

HAProxy v2.2在几天前刚刚发布，所以我决定在 x86_64和 aarch64 虚拟机上对它运行负载测试:

x86_64

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   42 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         3000.000
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        4 MiB
L3 cache:                        30.3 MiB
NUMA node0 CPU(s):               0-7
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nons
                                 top_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpref
                                 etch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx51
                                 2cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities

aarch64

Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       0x48
Model:                           0
Stepping:                        0x1
CPU max MHz:                     2400.0000
CPU min MHz:                     2400.0000
BogoMIPS:                        200.00
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-7
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

注意: 我尽可能的让虚拟机的硬件配置更接近，即使用相同的 RAM 类型和大小、相同的磁盘、网卡和带宽。此外，cpu 尽可能相似，但难免有一些差异：

CPU 频率: 3000 MHz (x86 _ 64) vs 2400 MHz (aarch64)
BogoMIPS: 6000(x86 _ 64) vs 200(aarch64)
一级缓存: 128 KiB (x86 _ 64) vs 512 KiB (aarch64)

两个虚拟机都运行在最新版的Ubuntu 20.04上。

我的HAProxy 是从master分支的源代码构建的，代码与HAProxy v2.2的几乎没有区别。

HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = clang-9
  CFLAGS  = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1 

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace
	[CACHE] cache
	[FCGI] fcgi-app

我已经试图通过遵循我在官方文档和网络上找到的所有最佳实践来尽可能地优化它。

HAProxy的配置如下:

global
  log stdout format raw local0 err
#  nbproc                            8
  nbthread                          32
  cpu-map                           1/all 0-7
  tune.ssl.default-dh-param         2048
  tune.ssl.capture-cipherlist-size  1
  ssl-server-verify                 none
  maxconn                           32748
  daemon

defaults
  timeout client                    60s
  timeout client-fin                 1s
  timeout server                    30s
  timeout server-fin                 1s
  timeout connect                   10s
  timeout http-request              10s
  timeout http-keep-alive           10s
  timeout queue                     10m
  timeout check                     10s
  mode                              http
  log                               global
  option                            dontlog-normal
  option                            httplog
  option                            dontlognull
  option                            http-use-htx
  option                            http-server-close
  option                            http-buffer-request
  option                            redispatch
  retries                           3000

frontend test_fe
  bind :::8080
  #bind :::8080 ssl crt /home/ubuntu/tests/tls/server.pem
  default_backend test_be

backend test_be
  #balance roundrobin
  balance leastconn
  #balance random(2)
  server go1 127.0.0.1:8081 no-check 
  server go2 127.0.0.1:8082 no-check
  server go3 127.0.0.1:8083 no-check 
  server go4 127.0.0.1:8084 no-check

通过这种方式，HAProxy 被用作四个 HTTP 服务的前端负载均衡器。

想使用 SSL方式的话，只需要注释掉第34行并取消注释第35行。

我使用了多线程设置以获得最佳结果。正如文档所说，这是推荐的设置，而且它也使吞吐量提高了近两倍！此外经过我把吞吐量从8个线程增加到16个线程，再从16个线程增加到32个线程的设置后，发现使用32个线程的效果最好，当使用64个线程时吞吐量开始下降。

我还使用CPU-map 1/all 0-7将线程固定在同一个 CPU中。

另一个重要的设置是用于平衡后端的算法。就像Willy Tarreau的测试一样。

正如在 HAProxy Enterprice 文档中所推荐的，我已经禁用了irqbalance。

最后，我应用了以下内核设置:

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65024"
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=100000
sudo sysctl -w net.core.netdev_max_backlog=100000
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w fs.file-max=500000

fs.file-max 也与/etc/security/limits. conf中的一些更改有关:

root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000

对于后端，我使用了用 Golang 编写的非常简单的 HTTP 服务器。他们只是将“ Hello World”写回客户机，而不从磁盘或网络读/写:

package main

// run with: env PORT=8081 go run http-server.go

import (
    "fmt"
    "log"
    "net/http"
    "os"
)

func main() {

    port := os.Getenv("PORT")
    if port == "" {
      log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
    }

    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
        fmt.Fprintf(w, "Hello World")
    })

    log.Fatal(http.ListenAndServe(":" + port, nil))

}

对用负载测试客户端，我使用了与测试Apache Tomcat相同设置的WRK。

结果如下:

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
 8 threads and 96 connections
 Thread Stats Avg    Stdev   Max +/-  Stdev
 Latency     6.67ms  8.82ms 196.74ms  89.85%
 Req/Sec     2.60k   337.06   5.79k   75.79%
 621350 requests in 30.09s, 75.85MB read
Requests/sec: 20651.69
Transfer/sec: 2.52MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
 8 threads and 96 connections
 Thread Stats  Avg    Stdev    Max +/-  Stdev
 Latency      3.32ms  4.46ms  75.42ms   94.58%
 Req/Sec      4.71k   538.41   8.84k    82.41%
 1127664 requests in 30.10s, 137.65MB read
Requests/sec: 37464.85
Transfer/sec: 4.57MB

aarch64, HTTPS

Running 30s test @ https://192.168.0.232:8080
 8 threads and 96 connections
 Thread Stats    Avg   Stdev    Max +/-  Stdev
 Latency       7.92ms  12.50ms  248.52ms 91.18%
 Req/Sec       2.42k   338.67   4.34k    80.88%
 578210 requests in 30.08s, 70.58MB read
Requests/sec: 19220.81
Transfer/sec: 2.35MB

x86_64, HTTPS

Running 30s test @ https://192.168.0.206:8080
 8 threads and 96 connections
 Thread Stats   Avg   Stdev   Max +/-  Stdev
 Latency       3.56ms 4.83ms  111.51ms 94.25%
 Req/Sec       4.46k  609.37   7.23k   85.60%
 1066831 requests in 30.07s, 130.23MB read
Requests/sec: 35474.26
Transfer/sec: 4.33MB

我们可以发现：

在 x86_64 VM 上，HAProxy 的速度几乎是 aarch64 VM 的两倍。
并且 TLS offloading减少了5-8% 的吞吐量

更新1(2020年7月10日) : 为了确定基于 Golang 的 HTTP 服务器是否是上述测试中的瓶颈，我决定直接针对一个后端(即跳过 HAProxy)运行相同的 WRK 负载测试。

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   615.31us  586.70us  22.44ms   90.61%
    Req/Sec    20.05k     1.57k   42.29k    73.62%
  4794299 requests in 30.09s, 585.24MB read
Requests/sec: 159319.75
Transfer/sec:     19.45MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   774.24us  484.99us  36.43ms   97.04%
    Req/Sec    15.28k   413.04    16.89k    73.57%
  3658911 requests in 30.10s, 446.64MB read
Requests/sec: 121561.40
Transfer/sec:     14.84MB

在这里我们看到运行在 aarch64上的 HTTP 服务比运行在 x86– 64上的要快30% ！

更重要的观察结果是，当根本不使用负载均衡器时，arm64的吞吐量要好几倍！我认为问题在于我的设置ーー HAProxy 和4个后端服务器都运行在同一个虚拟机上，所以它们在争夺资源！下面我计划把Golang服务固定到他们自己的 CPU 核心上，让 HAProxy 只使用其他4个 CPU 核心！敬请期待最新消息！

更新2(2020年7月10日) :

为了将进程固定到特定的 cpu，我将使用numactl。

$ numactl — hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16012 MB
node 0 free: 170 MB
node distances:
node 0
0: 10

我已经将 Golang HTTP 服务固定在以下几个方面:

1 2	numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server. go

例如，这个后端实例被固定到 CPU 节点0和物理 CPU 4。其他三个后端服务分别固定在物理 cpu 5、6和7上。

我还对 HAProxy 的配置做了一些改动:

nbthread 4
cpu-map 1/all 0–3

Nbthread 4cpu-map 1/all 0-3

也就是说，HAProxy 将产生4个线程，它们将被固定到物理 cpu 0-3上。

通过这些改变，aarch64的结果保持不变:

Running 30s test @ https://192.168.0.232:8080
  4 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.44ms    2.11ms  36.48ms   88.36%
    Req/Sec     4.98k   651.34     6.62k    74.40%
  596102 requests in 30.10s, 72.77MB read
Requests/sec:  19804.19
Transfer/sec:      2.42MB

但是 x86_64下降了:


Running 30s test @ https://192.168.0.206:8080
  4 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   767.40us  153.24us  19.07ms   97.72%
    Req/Sec     5.21k   173.41     5.51k    63.46%
  623911 requests in 30.10s, 76.16MB read
Requests/sec:  20727.89
Transfer/sec:      2.53MB

对于 HTTP (没有 TLS)也是如此:

aarch64


Running 30s test @ http://192.168.0.232:8080                                                                                                                                                                   
  4 threads and 16 connections                                                                                                                                                                                 
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                            
    Latency     1.40ms    2.16ms  36.55ms   88.08%                                                                                                                                                             
    Req/Sec     5.55k   462.65     6.97k    69.85%                                                                                                                                                             
  665269 requests in 30.10s, 81.21MB read                                                                                                                                                                      
Requests/sec:  22102.12                                                                                                                                                                                        
Transfer/sec:      2.70MB

x86_64

Running 30s test @ http://192.168.0.206:8080                                                                                                                                                                   
  4 threads and 16 connections                                                                                                                                                                                 
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                            
    Latency   726.01us  125.04us   6.42ms   93.95%                                                                                                                                                             
    Req/Sec     5.51k   165.80     5.80k    57.24%                                                                                                                                                             
  658777 requests in 30.10s, 80.42MB read                                                                                                                                                                      
Requests/sec:  21886.50                                                                                                                                                                                        
Transfer/sec:      2.67MB

因此，现在 HAProxy 在 aarch64上的速度比 x86_64稍快一些，但仍然远远低于每秒120000多个请求的“空负载均衡器”方法。

更新3(2020年7月10日) : 在看到 Golang HTTP 服务的性能非常好(120-160K reqs/sec)并简化设置之后，我决定从 Update 2中删除 CPU固定，并使用来自其他 VM 的后端，例如，当在aarch64虚拟机上运行HAProxy时，它将在x86_64上运行的后端之间进行负载均衡；当使用WRK在x86_64上运行HAProxy时，它将使用aarch64虚拟机上运行的 Golang HTTP服务。以下是新的结果：

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.33ms    4.93ms  76.85ms   89.14%
    Req/Sec     2.10k   316.84     3.52k    74.50%
  501840 requests in 30.07s, 61.26MB read
Requests/sec:  16688.53
Transfer/sec:      2.04MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.32ms    6.71ms  71.29ms   90.25%
    Req/Sec     3.26k   639.12     4.14k    65.52%
  779297 requests in 30.08s, 95.13MB read
Requests/sec:  25908.50
Transfer/sec:      3.16MB

aarch64, HTTPS

Running 30s test @ https://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.17ms    5.41ms 292.21ms   91.08%
    Req/Sec     2.13k   238.74     3.85k    86.32%
  506111 requests in 30.09s, 61.78MB read
Requests/sec:  16821.60
Transfer/sec:      2.05MB

x86_64, HTTPS

Running 30s test @ https://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.40ms    2.54ms  58.66ms   97.27%
    Req/Sec     3.82k   385.85     4.55k    92.10%
  914329 requests in 30.10s, 111.61MB read
Requests/sec:  30376.95
Transfer/sec:      3.71MB

祝你黑客生活愉快，注意安全！

HAProxy 2.2 has been released few days ago so I’ve decided to run my load tests against it on my x86_64 and aarch64 VMs:

x86_64

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   42 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         3000.000
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        4 MiB
L3 cache:                        30.3 MiB
NUMA node0 CPU(s):               0-7
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nons
                                 top_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpref
                                 etch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx51
                                 2cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities

aarch64

Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       0x48
Model:                           0
Stepping:                        0x1
CPU max MHz:                     2400.0000
CPU min MHz:                     2400.0000
BogoMIPS:                        200.00
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-7
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences

the CPU frequency: 3000 MHz (x86_64) vs 2400 MHz (aarch64)
BogoMIPS: 6000 (x86_64) vs 200 (aarch64)
Level 1 caches: 128 KiB (x86_64) vs 512 KiB (aarch64)

Both VMs run Ubuntu 20.04 with latest software updates.

HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!

HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = clang-9
  CFLAGS  = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1 

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[COMP] compression
	[TRACE] trace
	[CACHE] cache
	[FCGI] fcgi-app

I’ve tried to fine tune it as much as I could by following all best practices I was able to find in the official documentation and in the web.

The HAProxy config is:

global
  log stdout format raw local0 err
#  nbproc                            8
  nbthread                          32
  cpu-map                           1/all 0-7
  tune.ssl.default-dh-param         2048
  tune.ssl.capture-cipherlist-size  1
  ssl-server-verify                 none
  maxconn                           32748
  daemon

defaults
  timeout client                    60s
  timeout client-fin                 1s
  timeout server                    30s
  timeout server-fin                 1s
  timeout connect                   10s
  timeout http-request              10s
  timeout http-keep-alive           10s
  timeout queue                     10m
  timeout check                     10s
  mode                              http
  log                               global
  option                            dontlog-normal
  option                            httplog
  option                            dontlognull
  option                            http-use-htx
  option                            http-server-close
  option                            http-buffer-request
  option                            redispatch
  retries                           3000

frontend test_fe
  bind :::8080
  #bind :::8080 ssl crt /home/ubuntu/tests/tls/server.pem
  default_backend test_be

backend test_be
  #balance roundrobin
  balance leastconn
  #balance random(2)
  server go1 127.0.0.1:8081 no-check 
  server go2 127.0.0.1:8082 no-check
  server go3 127.0.0.1:8083 no-check 
  server go4 127.0.0.1:8084 no-check

This way HAProxy is used as a load balancer in front of four HTTP servers.

To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.

The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.

I’ve also pinned the threads to stay at the same CPU for its lifetime with cpu-map 1/all 0–7.

The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me leastconn gave the best performance.

As recommended at HAProxy Enterprice documentation I’ve disabled irqbalance.

Finally I’ve applied the following kernel settings:

sudo sysctl -w net.ipv4.ip_local_port_range="1024 65024"
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=100000
sudo sysctl -w net.core.netdev_max_backlog=100000
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w fs.file-max=500000

fs.file-max is related also with a change in /etc/security/limits.conf:

root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000

For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:

package main

// run with: env PORT=8081 go run http-server.go

import (
    "fmt"
    "log"
    "net/http"
    "os"
)

func main() {

    port := os.Getenv("PORT")
    if port == "" {
      log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
    }

    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
        fmt.Fprintf(w, "Hello World")
    })

    log.Fatal(http.ListenAndServe(":" + port, nil))

}

As load testing client I have used WRK with the same setup as for testing Apache Tomcat.

And now the results:

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
 8 threads and 96 connections
 Thread Stats Avg    Stdev   Max +/-  Stdev
 Latency     6.67ms  8.82ms 196.74ms  89.85%
 Req/Sec     2.60k   337.06   5.79k   75.79%
 621350 requests in 30.09s, 75.85MB read
Requests/sec: 20651.69
Transfer/sec: 2.52MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
 8 threads and 96 connections
 Thread Stats  Avg    Stdev    Max +/-  Stdev
 Latency      3.32ms  4.46ms  75.42ms   94.58%
 Req/Sec      4.71k   538.41   8.84k    82.41%
 1127664 requests in 30.10s, 137.65MB read
Requests/sec: 37464.85
Transfer/sec: 4.57MB

aarch64, HTTPS

Running 30s test @ https://192.168.0.232:8080
 8 threads and 96 connections
 Thread Stats    Avg   Stdev    Max +/-  Stdev
 Latency       7.92ms  12.50ms  248.52ms 91.18%
 Req/Sec       2.42k   338.67   4.34k    80.88%
 578210 requests in 30.08s, 70.58MB read
Requests/sec: 19220.81
Transfer/sec: 2.35MB

x86_64, HTTPS

Running 30s test @ https://192.168.0.206:8080
 8 threads and 96 connections
 Thread Stats   Avg   Stdev   Max +/-  Stdev
 Latency       3.56ms 4.83ms  111.51ms 94.25%
 Req/Sec       4.46k  609.37   7.23k   85.60%
 1066831 requests in 30.07s, 130.23MB read
Requests/sec: 35474.26
Transfer/sec: 4.33MB

What we see here is:

that HAProxy is almost twice faster on the x86_64 VM than the aarch64 VM!
and also that TLS offloading decreases the throughput with around 5–8%

Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   615.31us  586.70us  22.44ms   90.61%
    Req/Sec    20.05k     1.57k   42.29k    73.62%
  4794299 requests in 30.09s, 585.24MB read
Requests/sec: 159319.75
Transfer/sec:     19.45MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   774.24us  484.99us  36.43ms   97.04%
    Req/Sec    15.28k   413.04    16.89k    73.57%
  3658911 requests in 30.10s, 446.64MB read
Requests/sec: 121561.40
Transfer/sec:     14.84MB

Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!

And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!

Update 2 (Jul 10 2020):

To pin the processes to specific CPUs I will use numactl.

$ numactl — hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16012 MB
node 0 free: 170 MB
node distances:
node 0
0: 10

I’ve pinned the Golang HTTP servers with:

1 2	numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server. go

i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.

Also I’ve changed slightly the HAProxy configuration:

nbthread 4
cpu-map 1/all 0–3

Nbthread 4cpu-map 1/all 0-3

i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.

With these changes the results stayed the same for aarch64:

Running 30s test @ https://192.168.0.232:8080
  4 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.44ms    2.11ms  36.48ms   88.36%
    Req/Sec     4.98k   651.34     6.62k    74.40%
  596102 requests in 30.10s, 72.77MB read
Requests/sec:  19804.19
Transfer/sec:      2.42MB

but dropped for x86_64:


Running 30s test @ https://192.168.0.206:8080
  4 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   767.40us  153.24us  19.07ms   97.72%
    Req/Sec     5.21k   173.41     5.51k    63.46%
  623911 requests in 30.10s, 76.16MB read
Requests/sec:  20727.89
Transfer/sec:      2.53MB

and same for HTTP (no TLS):

aarch64


Running 30s test @ http://192.168.0.232:8080                                                                                                                                                                   
  4 threads and 16 connections                                                                                                                                                                                 
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                            
    Latency     1.40ms    2.16ms  36.55ms   88.08%                                                                                                                                                             
    Req/Sec     5.55k   462.65     6.97k    69.85%                                                                                                                                                             
  665269 requests in 30.10s, 81.21MB read                                                                                                                                                                      
Requests/sec:  22102.12                                                                                                                                                                                        
Transfer/sec:      2.70MB

x86_64

Running 30s test @ http://192.168.0.206:8080                                                                                                                                                                   
  4 threads and 16 connections                                                                                                                                                                                 
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                            
    Latency   726.01us  125.04us   6.42ms   93.95%                                                                                                                                                             
    Req/Sec     5.51k   165.80     5.80k    57.24%                                                                                                                                                             
  658777 requests in 30.10s, 80.42MB read                                                                                                                                                                      
Requests/sec:  21886.50                                                                                                                                                                                        
Transfer/sec:      2.67MB

So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.

Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:

aarch64, HTTP

Running 30s test @ http://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.33ms    4.93ms  76.85ms   89.14%
    Req/Sec     2.10k   316.84     3.52k    74.50%
  501840 requests in 30.07s, 61.26MB read
Requests/sec:  16688.53
Transfer/sec:      2.04MB

x86_64, HTTP

Running 30s test @ http://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.32ms    6.71ms  71.29ms   90.25%
    Req/Sec     3.26k   639.12     4.14k    65.52%
  779297 requests in 30.08s, 95.13MB read
Requests/sec:  25908.50
Transfer/sec:      3.16MB

aarch64, HTTPS

Running 30s test @ https://192.168.0.232:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.17ms    5.41ms 292.21ms   91.08%
    Req/Sec     2.13k   238.74     3.85k    86.32%
  506111 requests in 30.09s, 61.78MB read
Requests/sec:  16821.60
Transfer/sec:      2.05MB

x86_64, HTTPS

Running 30s test @ https://192.168.0.206:8080
  8 threads and 96 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.40ms    2.54ms  58.66ms   97.27%
    Req/Sec     3.82k   385.85     4.55k    92.10%
  914329 requests in 30.10s, 111.61MB read
Requests/sec:  30376.95
Transfer/sec:      3.71MB

Happy hacking and stay safe!

# Web

HAproxy X86 vs ARM64性能比拼

Comments

Links

Categories

Tags

Categories

Tags

Your browser is out-of-date!