译者: wangxiyuan
作者: Martin Grigorov
原文链接: https://medium.com/@martin.grigorov/compare-apache-tomcat-performance-on-x86-64-and-arm64-cpu-architectures-aacfbb0b5bb6
Tomcat PMC Martin Grigorov带来的Tomcat X86 VS ARM64性能测试。
大多数软件开发人员通常不会考虑他们的软件将在何种 CPU 架构上运行。 尽管没有官方的统计数据,但根据我的经验,大多数桌面和后端应用软件都运行在 x86_64架构(英特尔和 AMD 处理器)上,大多数移动和物联网设备都运行在 ARM 架构上。 开发人员使用一些高级编程语言为各自的 CPU 架构编写软件,并不考虑在运行时执行何种汇编指令。 而这正是高级编程语言的目的—- 让编译器处理低级硬件指令,并简化我们的任务,使其只专注于高级业务相关问题。
生活简单而美好,但有时候,笔记本电脑和台式机硬件及软件制造业的巨头会说,我们的软件必须在不同的架构上运行——先是从 PowerPC 到英特尔,现在从英特尔到 ARM64(消息来源: Bloomberg & AppleInsider)。 由于电力消耗较低,甚至一些较大的云供应商也开始提供 ARM64虚拟机(如亚马逊 AWS、华为云、 Linaro)。 但还有以下不确定性:
- 我的软件能在新的 CPU 架构上运行吗?
- 我需要做出什么样的改变才能让它发挥作用
- 它会像以前一样表现出色吗
为了能够回答这些问题,你必须撸起袖子进行测试!
您可以在任何云供应商上部署软件。 有些还提供免费试用期! 或者如果你的预算很少,你可以试试 RaspberryPi。
根据您编写软件所使用的编程语言,您可能需要进行一些更改,或者根本不需要更改! 如果你使用一个直译语言文件(例如 Python,Perl,Ruby,JVM,…) ,那么解释器已经支持 ARM64的可能性相当高,你可以不做任何改变就继续使用它! 但是,如果你的软件需要被编译,那么你需要调整你的工具链,并确保有 ARM64二进制文件为你所有的依赖! 根据您的软件开发堆栈,您的修改量可能会有所不同!
一旦我们的软件在新架构上运行良好,我们将能够检查它是否像以前那样执行良好。 最近一些用户在 Apache Tomcat 邮件列表中询问是否支持 ARM64架构。 因为 Apache Tomcat 大部分代码是用 Java 编写的,所以它可以基本的运行在ARM64上。 如果您需要使用 libtcnative
和 / 或 mod_jk
,那么您需要自己在 ARM64上构建它们。 Apache Tomcat 团队使用 TravisCI 在 ARM64上测试 Java 和 C 代码,目前还没有已知的问题!
为了比较某些软件的两个版本的性能,通常您将在同一个硬件上运行测试,但在这种情况下,由于我们使用不同的 CPU 架构,这是不可能的。 在我的测试中,我使用了两个具有类似规范的 vm:
X86_64处理器是:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilitiesArm64处理器是:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
两个虚拟机具有相同数量的 RAM、磁盘和网络连接。
测试应用程序基于 Spring Boot (2.2.7) ,运行嵌入式 Apache Tomcat 9.0.x + OpenSSL 1.1.1h-dev 和 Apache Apr 1.7.x。 每晚构建,并且有一个单独的 REST 客户端,该客户端公开一个用于创建实体的 PUT Endpoint、一个用于读取它的 GET Endpoint、一个用于更新它的 POST Endpoint和一个用于删除它的 DELETE Endpoint。 它使用 Memcached 作为数据库。
1 | package info.mgsolutions.testbed.rest; |
对于负载测试,我使用了 Apache JMeter 5.2.1和基于主干代码的 wrk。 Jmeter 用于一个真实的情况场景,有1000个并发用户,在 HTTP 请求之间有一个适应期和一段思考时间。 然后用 wrk 测试最大吞吐量。
使用以下参数执行 JMeter:
1 | jmeter.sh \ |
重用 HTTPS 连接需要 httpclient4. * *
属性,否则 Keep-Alive 不会有效。
Jmeter 和 wrk 的结果与存储在 Elasticsearch 的 Logstash 一起解析,并由 Kibana 进行可视化。
Jmeter 的响应时间:
正如你所看到的,在5月8日之前 HTTPS 的结果并不是很好。 没有重用 HTTPS 连接,每个请求都进行了 TLS 握手,尽管请求头“ Connection: keep-alive”。 因为 wrk 没有这样的问题,我在 JMeter 邮件列表中询问过,他们给了我上面提到的 httpclient4参数。 (谢谢你,菲利普 · 穆瓦德!) . 不管有没有 HttpClient 的调整,我们看到 x8664和 arm64的响应时间非常相似。 太棒了!
对于 wrk 的吞吐量测试,我使用以下参数运行它:
1 | wrk -c96 -t8 -d30s -s /scripts/wrk-report-to-csv.lua $HOST:$PORT |
例如,8个线程将使用96个 HTTP (s)连接访问服务器30秒。
为了收集 CSV 文件中的摘要,我使用了这个自定义 Lua 脚本:
1 | -- Initialize the pseudo random number generator |
结果显示,x86_64上的 Tomcat 比 arm64快两倍:
我将试图找出这种差异的原因,并在后续的帖子中与你分享。 如果你有什么想法和建议,我很乐意试试!
祝你黑客生活愉快,注意安全!
The majority of the software developers usually do not think about the CPU architecture their software will run on. I do not have official statistics but in my experience most of the software for desktop and backend applications run on x86_64 architecture (Intel and AMD processors) and most of the mobile and IoT devices run on ARM architecture. The developers write their software for the respective CPU architecture using some high level programming language and do not think what kind of Assembly instructions are being executed at runtime. And this is the purpose of the high level programming languages — to let the compiler deal with the low level hardware instructions and simplify our task to focus only on the high level business related problems.
Life is simple and beautiful but there are times when a big player in the laptop and desktop hardware and software manufacturing comes and says that our software will have to run on a different architecture — first from PowerPC to Intel and now from Intel to ARM64 (sources: Bloomberg & AppleInsider). Due to the lower consumption of electricity even several of the bigger cloud providers started providing ARM64 virtual machines (e.g. Amazon AWS, HuaweiCloud, Linaro). And here comes the uncertainty —
- Will my software run on the new CPU architecture ?!
- What kind of changes I will have to do to make it work ?!
- Will it perform as good as before ?!
To be able to answer these questions you will have to roll up your sleeves and test!
You can deploy your software on any of the cloud providers. Some of them give free trial period! Or if you are on a low budget you can experiment on RaspberryPi.
Depending on what programming language you use to write your software you might need to do some changes or not at all! If you use an interpreted language (e.g. Python, Perl, Ruby, JVM, …) then the chances the interpreter already supports ARM64 are pretty high and you are good to go without any changes! But if your software needs to be compiled then you will need to adapt your toolchain and make sure that there are ARM64 binaries for all your dependencies! Depending on your software development stack your mileage may vary!
Once our software runs fine on the new architecture we will be able to check whether it performs as good as before. Recently some users have asked in Apache Tomcat mailing lists whether ARM64 architecture is supported. Since Apache Tomcat is written mostly in Java it “Just Works”. If you need to use libtcnative and/or mod_jk then you will need to build them yourself on ARM64. Apache Tomcat team uses TravisCI to test both Java and C code on ARM64 and there are no known issues at the moment!
To compare the performance of two versions of some software usually you will run it on the same hardware but in this case since we use different CPU architectures this makes it impossible. For my tests I have used two VMs with similar specifications:
The x86_64 processor is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilitiesThe ARM64 processor is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Both VMs have same amount of RAM, disk and network connectivity.
The test application is based on Spring Boot (2.2.7) running an embedded Apache Tomcat 9.0.x nightly builds and has a single REST controller that exposes a PUT endpoint for creating an entity, a GET endpoint to read it, a POST endpoint to update it and a DELETE endpoint to remove it. It uses Memcached as a database.
1 | package info.mgsolutions.testbed.rest; |
For load testing I have used Apache JMeter 5.2.1 and wrk from its master branch. JMeter is used for a real case scenario with 1000 simultaneous users, ramp-up period and think time between the HTTP requests. And wrk is used to test the maximal throughput.
JMeter is executed with these arguments:
1 | jmeter.sh \ |
The httpclient4.** properties are needed to reuse the HTTPS connections, otherwise Keep-Alive was not effective.
The results from both JMeter and wrk are parsed with Logstash, stored in Elasticsearch and visualized by Kibana.
JMeter’s response times:
As you can see the results for HTTPS were not very good before May 8th. The HTTPS connections were not reused and TLS handshake has been done for each request, despite request header “Connection: keep-alive”. Since there was no such issue with wrk I’ve asked at JMeter mailing lists and they gave me the httpclient4 arguments above. (Thank you, Philippe Mouawad!). With or without the HttpClient tweak we see that the response times are very similar for x86_64 and arm64.
For the throughput test with wrk I have run it with these parameters:
1 | wrk -c96 -t8 -d30s -s /scripts/wrk-report-to-csv.lua $HOST:$PORT |
i.e. 8 threads will hit the server for 30 seconds using 96 HTTP(S) connections.
To collect the summary in a CSV file I used this custom Lua script:
1 | -- Initialize the pseudo random number generator |
the results show that Tomcat on x86_64 is twice faster than on arm64:
I will try to find out
what is the reason for this difference and share it with you in a follow up post. If you have any ideas I would be happy to test them!
Happy hacking and stay safe!