Mysql社区ARM优化汇总

译者: bzhaoopenstack
作者: Krunal Bauskar
原文链接: https://mysqlonarm.github.io/Community-Contributions-So-Far/

社区拥有来自不同组织的开发人员为 MySQL 提供了一些很好的补丁。但是这些补丁中的大多数都在等待Oracle的接受。 这篇博客的目的是分析这些补丁以及它们的利弊。希望这将有助于 Mysql / Oracle 接受这些期待已久的补丁。

社区Patches

1. 校验和优化

  • Mysql 使用的校验和有两种: crc32c 和 crc32。因为它们使用不同的多项式而导致它们的之间的不同。
    • crc32c 被用来在 MySQL Innodb中计算页面校验和
    • crc32在 MySQL 中用于表校验和、 binlog-checksum 等…

crc32c

  • 页面校验和是在读/写每个页面时进行的,所以crc32c可以在perl报告中快速的展示出来。确保在使用优化版本后,它能够提高整个系统的性能。
  • crc32c 通常是由硬件完成的功能,例如在 x86(SSE)和 ARM (ACLE)。 Innodb 目前使用基于硬件的 x86实现,但还没有使用 ARM (ACLE)实现。 必要的补丁修复有助于解决上述问题。
  • 最新的补丁(bug # 85819)还有助于在使用 crypto (PMULL)处理指令时来进行进一步优化.

开源贡献:
bug#79144 No hardware CRC32 implementation for AArch64
bug#85819 Optimize AARCH64 CRC32c implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[example test-case runs update-non-index and gets the crc32 as top mysqld function].

perf analysis (w/o patch)
+ 10.43% 8027 mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 3.23% 2486 mysqld mysqld [.] ut_crc32_sw
+ 2.33% 1797 mysqld [kernel.kallsyms] [k] finish_task_switch
+ 1.73% 1330 mysqld libc-2.27.so [.] malloc

perf analysis (w/ patch)
Overhead Samples Command Shared Object Symbol
+ 10.60% 8133 mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 2.37% 1816 mysqld [kernel.kallsyms] [k] finish_task_switch
+ 1.78% 1366 mysqld libc-2.27.so [.] malloc
....
0.44% 338 mysqld mysqld [.] ut_crc32_aarch64

结论: 明显可以考到节省了大约3%的吞吐量。另外,在更加广泛的测试中,我们可以看到crc32有助于提高所有类型测试场景下的吞吐量。

crc32

为了计算表校验和,MySQL 使用基于 zlib 的 crc32(软实现)。 据我所知,x86不支持 crc32计算的硬件优化版本,但幸运的是 ARM (ACLE)支持。 同时Binlog-checksum 也使用相同的代码 / 处理流程。

开源贡献:
bug#99118 ARM CRC32 intrinsic call to accelerate table-checksum (not crc32c but crc32)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[example test-case runs checksum on all tables and update-non-index].

perf analysis (w/o patch)

checksum:
+ 49.46% 13480 mysqld mysqld [.] crc32_z

update-non-index:
0.40% 311 mysqld mysqld [.] crc32_z


perf analysis (w/ patch)

checksum:
+ 8.15% 988 mysqld mysqld [.] aarch64_crc32_checksum

update-non-index:
0.07% 56 mysqld mysqld [.] aarch64_crc32_checksum

结论: 这个补丁在两个方面都有提升。超级加速表校验和(平均提高50%) ,并且在 binlog 校验和中也是。

2. my_convert (in turn copy_and_convert) 在ARM平台表现不好:

  • my_convert用作发送结果的一部分步骤中,主要是用于在字符集之间进行转换。
  • 给定转换的数据量,这个函数会出现在 perf top-list 中
  • 现有的实现对于 x86使用4字节的复制,但对于 ARM 则退回到单字节复制。 通过对 x86-64和 aarch64使用8个字节的复制,然后再到现有逻辑的尾部处理过程,可以总体改进这一点。 这个简单的补丁可以帮助节省大量的时钟周期并提高系统性能。

开源贡献:
bug#98737 my_convert routine is suboptimal in case of non-x86 platforms

1
2
3
4
5
6
[example test-case runs oltp-read-write on all tables].
perf analysis (w/o patch)
+ 0.79% 1114 mysqld mysqld [.] my_convert

perf analysis (w/ patch)
0.22% 299 mysqld mysqld [.] my_convert

结论: 这个补丁提高了吞吐量。

3. 为原子变量优化内存屏障:

  • Mysql / innodb 有很多变量,它使用 gcc 内置的原子函数(sync add and fetch 或 atomic add fetch).
  • 虽然在x86的强内存模型,它们表现很好,但是大多数这些计数器函数中都是使用顺序内存排序(缺省)来实现的。
  • 因为Arm 的弱内存模型,因此不推荐使用这种顺序内存排序(缺省的)。
  • 社区多个补丁来帮助修改这些代码片段。它们有助于实现两个目标:
    • 切换到使用 c + + 11原子函数(MySQL现已支持)。
    • 切换到使用松散的内存顺序(vs 顺序)。

开源贡献:
bug#97228 rwlock: refine lock->lock_word with C11 atomics
bug#97230 rwlock: refine lock->waiters with C++11 atomics
bug#97703 innobase/dict: refine dict_temp_file_num with c++11 atomics
bug#97704 innobase/srv: refine srv0conc with c++11 atomics
bug#97765 innobase/os: refine os_total_large_mem_allocated with c++11 atomics
bug#97766 innobase/os_once: optimize os_once with c++11 atomics
bug#97767 innobase/dict: refine zip_pad_info->pad with c++11 atomics
bug#99432 Improving memory barrier during rseg allocation

结论: 由于它的影响分布非常广,所以很难用perf来进行判断。另外,部分修复仅仅是为了改进语义,而不是为了性能因素。

4. 为全局计数器带来核心亲和性调度:

Arm 以拥有大量的核心(和 numa 套接字)而闻名,为了从多核中获得最大的吞吐量,确保全局计数器的可编程性是非常重要的。拥有一个分布式计数器并将计数器的递增部分尽量与线程核心靠近,应该可以避免跨numa延迟。

MySQL通过调用call sched_getcpu来获取计数器插槽,但是这个逻辑由于另一个bug的修复而改变了(这对于上述问题来说当然是有意义的) , 这个bug修复影响了正常的全局计数器。下面的补丁提议纠正这一点,并使用sched_getcpu(核心亲和性)来实现全局计数器。

不幸的是,在 ARM 上,这个补丁由于使用了 sched_getcpu 而产生了开销,但是在x86上在使用VDSO时进行了优化。

开源贡献:
bug#79455 Restore get_sched_indexer_t in 5.7

5. 当前UT_RELAX_CPU () 在ARM平台上的可伸缩性问题:

Innodb 使用自制的 spin-wait 来实现 rw-locks 和 mutexes。 无论何时需要休眠(或让我纠正称它为PAUSE) ,在 x86 MySQL 上支持PAUSE指令。 ARM不支持 PAUSE 指令,因此流程中使用编译器屏障,但是该指令未能引入所需的延迟。 修补程序建议使用Compare-And-Exchange,这应该有助于引入类似的延迟(如PAUSE)。

开源贡献:
bug#87933 Scalibility issue on Arm platform with the current UT_RELAX_CPU () code.

基于内部评估,我们得不到补丁所带来的吞吐量提升,因此目前没有将其纳入我们贡献到社区的内容。

6. 在ARM上应用更宽的cacheline来填充:

大多数 ARM 处理器计划拥有更宽的cacheline size。 补丁提出基于 ARM 处理器使用更大的cachelilne,并填充以避免false sharing问题。

开源贡献:
bug#98499 Improvement about CPU cache line size

7. 其他开源贡献

除了上面列出的6个大类,在其他领域也有很多的贡献。 但是大多数都没有相关的代码提交,或者这个想法已经作为另一个重大改进被放在 MySQL 中(不是针对 ARM 的工作) ,又或者是这个idea不太可能对性能产生影响。

社区补丁对性能的影响

基于上面收集的内容,我们分析了引入社区补丁后对性能的影响,下面的表显示了如果我们合入这些补丁,吞吐量将如何提高。 结果限制在较大的可伸缩性(256线程),因为它显示了主要的影响,我们已经全面运行了测试用例,确定补丁有助于提高总吞吐量(即使对于单线程而言)。

point select read only read write update index update non index
without-patch 218447 145755 5646 22200 22601
with-patch 224355 149718 5829 23070 23292
% 2.7 2.72 3.24 3.92 3.06

使用 mysql-8.0.20进行评估。配置参看 here. 处理器: ARM Kunpeng 920 24vCPU/48GB

结论

社区提供的补丁确实有助于优化 ARM 上的 MySQL,但影响程度有限,并且需要覆盖很多领域才能看到 MySQL 在 ARM 上加速的程度。 如果你有好的想法,请联系我或者直接在社区进行交流。 ARM MySQL 社区可以尽情针对这一问题爆发头脑风暴,并尝试实现修复这个问题的idea。

如果你有问题 / 疑问,请让我知道,我会尽力回答。

Community Patches

1. Optimizing checksum

  • MySQL uses 2 types of checksum: crc32c and crc32. They both are different since both uses different polynomials.
    • crc32c is used in MySQL by InnoDB to calculate page-checksum.
    • crc32 is used in MySQL for table checksum, binlog-checksum, etc…

crc32c

  • Page checksum is calculated during each page read/write so crc32c can quickly show up as one of the top functions in perf report. Ensuring use of optimized versions of it could help improve the overall throughput of the system.
  • crc32c has been implemented as a hardware function on both x86 (SSE) and ARM (ACLE). InnoDB currently uses hardware based implementation for x86 but not yet for ARM (ACLE). Patch helps address the said issue.
  • Latest patch (bug#85819) also helps further optimize it using crypto (PMULL) processing instruction.

Open Contributions:
bug#79144 No hardware CRC32 implementation for AArch64
bug#85819 Optimize AARCH64 CRC32c implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[example test-case runs update-non-index and gets the crc32 as top mysqld function].

perf analysis (w/o patch)
+ 10.43% 8027 mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 3.23% 2486 mysqld mysqld [.] ut_crc32_sw
+ 2.33% 1797 mysqld [kernel.kallsyms] [k] finish_task_switch
+ 1.73% 1330 mysqld libc-2.27.so [.] malloc

perf analysis (w/ patch)
Overhead Samples Command Shared Object Symbol
+ 10.60% 8133 mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 2.37% 1816 mysqld [kernel.kallsyms] [k] finish_task_switch
+ 1.78% 1366 mysqld libc-2.27.so [.] malloc
....
0.44% 338 mysqld mysqld [.] ut_crc32_aarch64

Conclusion: Clearly a saving of around 3% in overall throughput can be seen. Also, as part of wider testing we see crc32c helps in overall throughput gain for all kind of test-cases.

crc32

For calculating table checksum MySQL uses zlib-based crc32. As per my knowledge, x86 doesn’t support hardware optimized versions for crc32 calculation but fortunately ARM (ACLE) supports it. The same code/flow path is also used for binlog-checksum.

Open Contributions:
bug#99118 ARM CRC32 intrinsic call to accelerate table-checksum (not crc32c but crc32)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[example test-case runs checksum on all tables and update-non-index].

perf analysis (w/o patch)

checksum:
+ 49.46% 13480 mysqld mysqld [.] crc32_z

update-non-index:
0.40% 311 mysqld mysqld [.] crc32_z


perf analysis (w/ patch)

checksum:
+ 8.15% 988 mysqld mysqld [.] aarch64_crc32_checksum

update-non-index:
0.07% 56 mysqld mysqld [.] aarch64_crc32_checksum

Conclusion: This patch helps on both front. Super-accelerate table checksum (average improvement of 50%) and also marginally helps in binlog-checksum.

2. my_convert (in turn copy_and_convert) routine is suboptimal for ARM:

  • my_convert is used as part of the send result for converting among charsets.
  • Given the amount of the data that is converted this function gets spotted in perf top-list.
  • Existing implementation uses 4 bytes copying for x86 but falls back to byte copy for ARM. This could be overall improved by using 8 bytes copying for x86-64 and aarch64 and then falling back for trailing things to existing logic. Patch for this simple operation help save significant cycles and help improve performance.

Open Contributions:
bug#98737 my_convert routine is suboptimal in case of non-x86 platforms

1
2
3
4
5
6
[example test-case runs oltp-read-write on all tables].
perf analysis (w/o patch)
+ 0.79% 1114 mysqld mysqld [.] my_convert

perf analysis (w/ patch)
0.22% 299 mysqld mysqld [.] my_convert

Conclusion: Patch can help improve overall throuhgput.

3. Improving memory barrier for atomic variables:

  • MySQL/InnoDB has a lot of variables for which it uses gcc inbuilt atomic functions (__sync_add_and_fetch or __atomic_add_fetch).
  • While this is all good x86 being a strong memory model most of these counter functions were implemented to use sequential memory ordering (default).
  • ARM has a relaxed memory model so using sequential memory ordering (default one) is not recommended.
  • Multiple patches were submitted to help revamp the said snippets. Patches help achieve 2 things:
    • Switch to use C++11 atomics. (Now that MySQL supports it).
    • Switch to use relaxed memory order (vs sequential).

Open Contributions:
bug#97228 rwlock: refine lock->lock_word with C11 atomics
bug#97230 rwlock: refine lock->waiters with C++11 atomics
bug#97703 innobase/dict: refine dict_temp_file_num with c++11 atomics
bug#97704 innobase/srv: refine srv0conc with c++11 atomics
bug#97765 innobase/os: refine os_total_large_mem_allocated with c++11 atomics
bug#97766 innobase/os_once: optimize os_once with c++11 atomics
bug#97767 innobase/dict: refine zip_pad_info->pad with c++11 atomics
bug#99432 Improving memory barrier during rseg allocation

Conclusion: Impact is wide spread so difficult to judge using perf. Also, some of the fixes help improve semantics and may not be for performance reason as such.

4. Restore core affinity scheduler for global counter:

ARM is known for its large number of cores (and numa sockets) and to harvest the max throughput from multi-cores it is important to ensure that global counters are programmed accordingly. Having a distributed counter and incrementing part of the counter closer to the thread core should avoid cross-numa latency.

MySQL use to call sched_getcpu for getting the counter slots but this logic was changed as part of the different bug fix (that surely made sense for the said issue) but it also affected the normal global counters. Patch proposes to correct this and use sched_getcpu (core affinity) based counter for global counters.

On ARM this patch unfortunately is running into overhead resulting from use of sched_getcpu which is optimized on x86 using VDSO.

Open Contributions:
bug#79455 Restore get_sched_indexer_t in 5.7

5. Scalability issue on ARM platform with the current UT_RELAX_CPU () code:

InnoDB uses home-grown spin-wait implementation for rw-locks and mutexes. Whenever there is a need to sleep (or let me correctly say PAUSE) then on x86 MySQL uses supported PAUSE instruction. ARM doesn’t have support for PAUSE instruction so the flow uses a compiler barrier but this statement fails to introduce the needed delay. Patch suggest use of Compare-And-Exchange that should help introduce comparable delay (like PAUSE).

Open Contributions:
bug#87933 Scalibility issue on Arm platform with the current UT_RELAX_CPU () code.

Based on internal evaluation we couldn’t get the patch to help improve on throughput so have not-considered it as part of our community-patch branch for now.

6. Using wider cacheline padding for ARM:

Most of the ARM processors are scheduled to have a wider cache line. Patch proposes use of a wider cache line padding for ARM based processors to avoid false sharing.

Open Contributions:
bug#98499 Improvement about CPU cache line size

7. Other open contributions

Besides the 6 main categories listed above there are more contributions in other areas too. But most of them didn’t have a patch associated or the said idea has been folded in MySQL as part of another major revamp (not specific to ARM work) or the idea is less likely to have a performance impact. So for now we were not able to consider these set of patches.

Performance impact of Community Patches

Based on the inputs collected above we have analyzed performance impact of community patches and below table help shows how the throughput would improve if the said patches are accepted. Limiting results for higher scalability (256 threads) where it shows major effect but we have run test-case across the board and patches helps improve overall throughput (even for single threaded).

point select read only read write update index update non index
without-patch 218447 145755 5646 22200 22601
with-patch 224355 149718 5829 23070 23292
% 2.7 2.72 3.24 3.92 3.06

Evaluated using mysql-8.0.20. For configuration check here. Processor: ARM Kunpeng 920 24vCPU/48GB

Conclusion

Patches contributed by community surely helps in optimizing MySQL on ARM but the impact is still limited and lot of ground to cover to make MySQL accelerate on ARM. If you have good ideas on how things could be pushed further then let’s connect. ARM MySQL community can help brainstorm the idea and aid/help in materializing it to a contribution.

If you have more questions/queries do let me know. Will try to answer them.

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×