Built with multi-threading support (MAX_THREADS=64, default=8). Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020 Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3 Built with Lua version : Lua 5.3.3 Built with DeviceAtlas support (dummy library only). Built with 51Degrees Pattern support (dummy library). Built with WURFL support (dummy library version 1.11.2.100) Built with network namespace support. Built with zlib version : 1.2.11 Running on zlib version : 1.2.11 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Built with PCRE version : 8.39 2016-06-14 Running on PCRE version : 8.39 2016-06-14 PCRE library supports JIT : yes Encrypted password support via crypt(3): yes Built with clang compiler version 9.0.1
Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll.
Available multiplexer protocols : (protocols marked as <default> cannot be specified using 'proto' keyword) fcgi : mode=HTTP side=BE mux=FCGI <default> : mode=HTTP side=FE|BE mux=H1 h2 : mode=HTTP side=FE|BE mux=H2 <default> : mode=TCP side=FE|BE mux=PASS
port := os.Getenv("PORT") if port == "" { log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go") }
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){ fmt.Fprintf(w, "Hello World") })
Running 30s test @ http://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 615.31us 586.70us 22.44ms 90.61% Req/Sec 20.05k 1.57k 42.29k 73.62% 4794299 requests in 30.09s, 585.24MB read Requests/sec: 159319.75 Transfer/sec: 19.45MB
x86_64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 774.24us 484.99us 36.43ms 97.04% Req/Sec 15.28k 413.04 16.89k 73.57% 3658911 requests in 30.10s, 446.64MB read Requests/sec: 121561.40 Transfer/sec: 14.84MB
在这里我们看到运行在 aarch64上的 HTTP 服务比运行在 x86– 64上的要快30% !
更重要的观察结果是,当根本不使用负载均衡器时,arm64的吞吐量要好几倍!我认为问题在于我的设置ーー HAProxy 和4个后端服务器都运行在同一个虚拟机上,所以它们在争夺资源!下面我计划把Golang服务固定到他们自己的 CPU 核心上,让 HAProxy 只使用其他4个 CPU 核心!敬请期待最新消息!
Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: 0x48 Model: 0 Stepping: 0x1 CPU max MHz: 2400.0000 CPU min MHz: 2400.0000 BogoMIPS: 200.00 L1d cache: 512 KiB L1i cache: 512 KiB L2 cache: 4 MiB L3 cache: 32 MiB NUMA node0 CPU(s): 0-7 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences
the CPU frequency: 3000 MHz (x86_64) vs 2400 MHz (aarch64)
BogoMIPS: 6000 (x86_64) vs 200 (aarch64)
Level 1 caches: 128 KiB (x86_64) vs 512 KiB (aarch64)
Both VMs run Ubuntu 20.04 with latest software updates.
HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!
Built with multi-threading support (MAX_THREADS=64, default=8). Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020 Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3 Built with Lua version : Lua 5.3.3 Built with DeviceAtlas support (dummy library only). Built with 51Degrees Pattern support (dummy library). Built with WURFL support (dummy library version 1.11.2.100) Built with network namespace support. Built with zlib version : 1.2.11 Running on zlib version : 1.2.11 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Built with PCRE version : 8.39 2016-06-14 Running on PCRE version : 8.39 2016-06-14 PCRE library supports JIT : yes Encrypted password support via crypt(3): yes Built with clang compiler version 9.0.1
Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll.
Available multiplexer protocols : (protocols marked as <default> cannot be specified using 'proto' keyword) fcgi : mode=HTTP side=BE mux=FCGI <default> : mode=HTTP side=FE|BE mux=H1 h2 : mode=HTTP side=FE|BE mux=H2 <default> : mode=TCP side=FE|BE mux=PASS
backend test_be #balance roundrobin balance leastconn #balance random(2) server go1 127.0.0.1:8081 no-check server go2 127.0.0.1:8082 no-check server go3 127.0.0.1:8083 no-check server go4 127.0.0.1:8084 no-check
This way HAProxy is used as a load balancer in front of four HTTP servers.
To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.
The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.
I’ve also pinned the threads to stay at the same CPU for its lifetime with cpu-map 1/all 0–7.
The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me leastconn gave the best performance.
As recommended at HAProxy Enterprice documentation I’ve disabled irqbalance.
Finally I’ve applied the following kernel settings:
fs.file-max is related also with a change in /etc/security/limits.conf:
1 2 3 4
root soft nofile 500000 root hard nofile 500000 * soft nofile 500000 * hard nofile 500000
For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:
port := os.Getenv("PORT") if port == "" { log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go") }
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){ fmt.Fprintf(w, "Hello World") })
Running 30s test @ http://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 6.67ms 8.82ms 196.74ms 89.85% Req/Sec 2.60k 337.06 5.79k 75.79% 621350 requests in 30.09s, 75.85MB read Requests/sec: 20651.69 Transfer/sec: 2.52MB
x86_64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.32ms 4.46ms 75.42ms 94.58% Req/Sec 4.71k 538.41 8.84k 82.41% 1127664 requests in 30.10s, 137.65MB read Requests/sec: 37464.85 Transfer/sec: 4.57MB
aarch64, HTTPS
1 2 3 4 5 6 7 8
Running 30s test @ https://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 7.92ms 12.50ms 248.52ms 91.18% Req/Sec 2.42k 338.67 4.34k 80.88% 578210 requests in 30.08s, 70.58MB read Requests/sec: 19220.81 Transfer/sec: 2.35MB
x86_64, HTTPS
1 2 3 4 5 6 7 8
Running 30s test @ https://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.56ms 4.83ms 111.51ms 94.25% Req/Sec 4.46k 609.37 7.23k 85.60% 1066831 requests in 30.07s, 130.23MB read Requests/sec: 35474.26 Transfer/sec: 4.33MB
What we see here is:
that HAProxy is almost twice faster on the x86_64 VM than the aarch64 VM!
and also that TLS offloading decreases the throughput with around 5–8%
Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.
aarch64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 615.31us 586.70us 22.44ms 90.61% Req/Sec 20.05k 1.57k 42.29k 73.62% 4794299 requests in 30.09s, 585.24MB read Requests/sec: 159319.75 Transfer/sec: 19.45MB
x86_64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 774.24us 484.99us 36.43ms 97.04% Req/Sec 15.28k 413.04 16.89k 73.57% 3658911 requests in 30.10s, 446.64MB read Requests/sec: 121561.40 Transfer/sec: 14.84MB
Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!
And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!
Update 2 (Jul 10 2020):
To pin the processes to specific CPUs I will use numactl.
numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server. go
i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.
Also I’ve changed slightly the HAProxy configuration:
1 2 3 4
nbthread 4 cpu-map 1/all 0–3
Nbthread 4cpu-map 1/all 0-3
i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.
With these changes the results stayed the same for aarch64:
1 2 3 4 5 6 7 8
Running 30s test @ https://192.168.0.232:8080 4 threads and 16 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.44ms 2.11ms 36.48ms 88.36% Req/Sec 4.98k 651.34 6.62k 74.40% 596102 requests in 30.10s, 72.77MB read Requests/sec: 19804.19 Transfer/sec: 2.42MB
but dropped for x86_64:
1 2 3 4 5 6 7 8 9
Running 30s test @ https://192.168.0.206:8080 4 threads and 16 connections Thread Stats Avg Stdev Max +/- Stdev Latency 767.40us 153.24us 19.07ms 97.72% Req/Sec 5.21k 173.41 5.51k 63.46% 623911 requests in 30.10s, 76.16MB read Requests/sec: 20727.89 Transfer/sec: 2.53MB
and same for HTTP (no TLS):
aarch64
1 2 3 4 5 6 7 8 9
Running 30s test @ http://192.168.0.232:8080 4 threads and 16 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.40ms 2.16ms 36.55ms 88.08% Req/Sec 5.55k 462.65 6.97k 69.85% 665269 requests in 30.10s, 81.21MB read Requests/sec: 22102.12 Transfer/sec: 2.70MB
x86_64
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.206:8080 4 threads and 16 connections Thread Stats Avg Stdev Max +/- Stdev Latency 726.01us 125.04us 6.42ms 93.95% Req/Sec 5.51k 165.80 5.80k 57.24% 658777 requests in 30.10s, 80.42MB read Requests/sec: 21886.50 Transfer/sec: 2.67MB
So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.
Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:
aarch64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 6.33ms 4.93ms 76.85ms 89.14% Req/Sec 2.10k 316.84 3.52k 74.50% 501840 requests in 30.07s, 61.26MB read Requests/sec: 16688.53 Transfer/sec: 2.04MB
x86_64, HTTP
1 2 3 4 5 6 7 8
Running 30s test @ http://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 5.32ms 6.71ms 71.29ms 90.25% Req/Sec 3.26k 639.12 4.14k 65.52% 779297 requests in 30.08s, 95.13MB read Requests/sec: 25908.50 Transfer/sec: 3.16MB
aarch64, HTTPS
1 2 3 4 5 6 7 8
Running 30s test @ https://192.168.0.232:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 6.17ms 5.41ms 292.21ms 91.08% Req/Sec 2.13k 238.74 3.85k 86.32% 506111 requests in 30.09s, 61.78MB read Requests/sec: 16821.60 Transfer/sec: 2.05MB
x86_64, HTTPS
1 2 3 4 5 6 7 8
Running 30s test @ https://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.40ms 2.54ms 58.66ms 97.27% Req/Sec 3.82k 385.85 4.55k 92.10% 914329 requests in 30.10s, 111.61MB read Requests/sec: 30376.95 Transfer/sec: 3.71MB