译者: bzhaoopenstack
作者: Amit Dattatray Khandekar
原文链接: https://amitdkhan-pg.blogspot.com/2020/06/leveraging-simd-vectorization.html
团队大牛Amit利用SIMD向量化对ARM和X86硬件平台在PostgreSQL上的优化,欢迎品鉴,相当硬核,坐稳了吗?
Leveraging SIMD Vectorization
随着列式存储数据库的出现,人们迫切需要使用 SIMD 向量处理数据表。 这种情况显然很符合表格数据的排列方式。 让我们首先简单介绍一下什么是 SIMD。 它代表单指令多数据流(Single Instruction Multiple Data)。 在当前,CPU 指令支持这种机制,在这种机制中,同一条指令可以在多个数据元素上同时执行。 例如,你想把所有的列值元素加倍。 或者删除图像像素RGB 值的红色部分。 对于大数据场景来说,这些操作是 CPU 的瓶颈。 因此,SIMD 根据每个数据元素的大小,对2、4、8、16或32个(或更多)数据元素同时进行操作,从而大大缩短了 CPU 时间。 假设我们想对“ int32 arr []”的每个元素执行“ arr [ i ] * = 2”。 通常,我们会遍历每个元素来执行这个操作。 在生成的汇编代码中,MUL 指令将在每个元素上执行。 使用 SIMD,我们将划分4个(或更多)相邻的数组元素加载到128位(或更大) CPU“向量”寄存器中,然后让这个寄存器调用 MUL 指令的“向量化”版本,并对随后的每个4数组元素重复这一步骤。
我们怎么做才能生成这样的向量化汇编指令? 一种方法是编写这样的汇编代码。 但是在大多数情况下,我们不会这么做,多亏了以下两个方法的出现:
1. 内部函数实现向量化
对于程序员来说,调用内部函数就像调用其他函数一样。 在底层,编译器会用适当的程序集指令替换它。 因此,不必使用 c / c + + 代码中的汇编指令来处理寄存器,而是调用相应的内部函数。 每个 CPU 体系结构都有自己的一组内部函数API 和相应的头文件。 作为一个例子,让我们使用 ARM 架构的 SIMD内部函数对 PostgreSQL 代码片段进行向量化,看看通过向量化代码会产生多大的不同。 在此之前,您可能希望快速浏览NEON架构预览来了解寄存器(registers)、通道(lanes)和向量(vectors)的命名规范。 NEON是 ARM SIMD 架构的品牌名称(The implementation of the Advanced SIMD extension used in ARM processors is called NEON,)。 NEON 单元是 ARMv8芯片的必备部分。
下面是 PostgreSQL 代码片段,mul_var() 函数 用于将两个PostgreSQL NUMERIC 数据类型的值相乘. 就像下面的例子那样:
1 | for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; |
其中变量声明为:
1 | int32 *dig; |
这个例子,你将可以看到循环迭代 i2 + 1次。 在每次迭代中,i 和 i2都会递减。 这意味着,两个数组中的每个数组都有一个固定的连续区段,我们希望在这个区段中对每个数组元素重复执行相同的算术运算。 这里所做的算法是: 将两个 int16变量相乘,然后将乘积加起来得到一个 int32变量。 有一条汇编指令正是这样做的: VMLA。 相应的 内部函数是: vmlal _ s16()
让我们首先将上面的反向 for-loop 简化为一个等效的正向循环 :
1 | i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); |
当我们想要对上面的 multiply + accumulate 语句进行向量化时,我们应用下面这个内部函数:
1 | int16x8_t vmlaq_s16(int16x8_t a, int16x8_t b, int16x8_t c); |
这句代码执行 a + (b * c)并返回结果。 a b c 是矢量。 类型 int16x8_t 表示该向量位于一个128位的 NEON 寄存器中,该寄存器有8个通道,每个通道有16位有符号整数。 所以 vmlaq_s16并行地对3个向量的所有8个通道执行相同的multiply + accumulate操作,并在一个int16x8_t 向量中再次返回8个结果值。 每个multiply + accumulate操作都包含在所有3个向量中的一个特定通道中。
如上面 c 代码片段所示,为了避免溢出,将所得的乘法累计值计入一个32位整数。 因此,我们不能使用vmlaq_s16() ,而必须使用一个对16位值进行操作并返回32位值的内部函数:
1 | int32x4_t vmlal_s16(int32x4_t a, int16x4_t b, int16x4_t c); |
由于128位矢量只能容纳4个32位数据元素,因此4个元素可以并行化,而不是8个。
可以看出,所有这些操作都使用128位寄存器,它们不需要完全占用,就像使用 int16x4向量那样。 我们需要首先将 C 数组元素值加载到这些寄存器中,最后将结果值从寄存器取回至结果数组元素中。 我们也有实现这种想法的内部函数。 尽管有混合使用标量和向量的内部函数,然而上面内部函数只使用到了向量。 因此,同样的 var1digit 值可以装载到16x4矢量的所有4个通道中。
结合这些内部函数,最终的代码会是:
1 | #include <arm_neon.h> |
我创建了一个包含高精度的数据的模型,如图所示, 并以多组t1.val 和 t2.val来执行如下查询。在没有向量化时,执行时间为0.874毫秒:
1 | $ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2" |
使用上面的向量化代码,相同的查询执行时间现在是0.360 ms,即超过2倍的加速: :
1 | $ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2" |
由于数字的个别位数必须与另一个数字的位数相乘,对于精度较高的数字来说,效果更好。 我创建的模式精度在200-600之间。 但是当我在 ARM64 VM 上的测试时,从20精度开始,它的好处就显现出来了。
2. 自动向量化
并不总是需要编写使用 内部函数的代码。通常,如果我们组织并简化代码,今天的编译器,使用适当的编译器选项尝试识别代码是否可以被向量化, 并生成适当的汇编指令,以便利用 CPU 体系结构的 SIMD。实际上,在上面的代码中,我将反向 for-loop 简化为使用单个变量递增的正向 for-loop,gcc 编译器能够自动对简化的 for-loop 进行向量化。 以下是一些细节:
1 | diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c |
通过这个修改,在 mul_var()汇编代码中,我可以看到操作 NEON 向量的乘积指令(这些是 arm64指令) :
1 | smlal v1.4s, v2.4h, v3.4h |
gcc 编译器选项启用自动向量化是“-ftree-loop-vectorize”. 当使用 gcc -O3时,它始终是开启的。
虽然有一些例子表明 gcc 能够自动向量化甚至是反向循环,但是在上面的例子中,由于两个递减变量,它不能对原始代码这样做。 这就是为什么我必须将其简化为一个单变量递增的正向循环,这是最简单的方式来规避。
要检查 gcc 是否能够向量化一段代码,请使用 gcc -fopt-info-all 选项。输出信息如下:
1 | numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors |
用这种自动向量化的方法,我观察到的加速比大约是2.7倍。 这种加速比内部函数方式更快快,可能是因为编译器可能比我使用了更好的汇编向量化指令组合。
总结
向量化操作可以在重复操作中获得显著的性能提升。 虽然它很适合柱状数据结构,但是当前 PostgreSQL 代码中的一些代码可能会受益于利用 SIMD 进行这种调整。 尽可能使用编译器的自动向量化。 因为这样的做会使代码更干净,更容易移植。 与方法1相比,我们必须使用特定于 CPU 体系结构的内部函数。 但是选择这个例子是为了解释如何使用内部函数来向量化。 在编译器不能对代码进行向量化的情况下,我们应该使用编译器内部函数。 例如:这个。
With the advent of column store databases, there was an urge to make use of SIMD vector processing. It naturally fits into the way table data is arranged. Let’s first briefly check what is SIMD. It stands for Single Instruction Multiple Data. Today, CPU instructions support this kind of mechanism where the same instruction can be executed simultaneously on multiple data elements. E.g. Say, you want to double all the column values. Or remove the red component of the RGB values of pixels of an image. For large data, these operations are CPU bottlenecks. So SIMD cuts the CPU time significantly by operating simultaneously on 2, 4, 8, 16 or 32 (or more) data elements depending on the size of each data element. So suppose we want to do “arr[i] *= 2” for each element of “int32 arr[]”. Normally we would iterate through each of the elements for doing this operation. In the generated assembly code, MUL instruction will be run on each of the elements. With SIMD, we would arrange for loading 4 (or more) adjacent array elements into a 128-bit (or larger) CPU “vector” register, and then arrange for a “vectorized” version of the MUL instruction to be called using this register, and repeat this for each subsequent 4 element array section.
How do we arrange for generating such vectorized assembly instructions ? Well, one way is to write such an assembly code. But in most of the cases, we won’t need this method, thanks to the below two methods :
1. Vectorization Intrinsics
For a programmer, an intrinsic is just like any other function call. Underneath, the compiler replaces it with an appropriate assembly instruction. So instead of having to deal with registers using assembly instruction inside C/C++ code, call the corresponding intrinsic function. Each CPU architecture has it’s own set of intrinsics API, and corresponding header file. As an example, let’s vectorize a snippet of PostgreSQL code using ARM architecture’s SIMD intrinsics, to see how big a difference it makes by vectorizing things. Before that, you might want to quickly go through the NEON architecture to understand the naming conventions for registers, lanes and vectors. NEON is ARM’s brand name for SIMD architecture. NEON unit is a mandatory part of ARMv8 chip.
Here is a PostgreSQL code snippet from the mul_var() function that is used to multiply two PostgreSQL NUMERIC data types. As of this writing, it looks like this :
1 | for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; |
where, the variables are declared as :
1 | int32 *dig; |
Here, you can see that the loop iterates i2+1 times. On each iteration, both i and i2 are decremented. That means, there is a fixed contiguous section of each of the two arrays where we want to repeatedly do the same arithmetic operation for every array element in this section. The arithmetic being done here is : multiply two int16 variables, and add up that product into an int32 variable. An assembly instruction is available which exactly does that : VMLA. The corresponding intrinsic is : vmlal_s16()
Let’s first simplify the above backward for-loop into an equivalent forward loop :
1 | i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); |
So we want to vectorize the above multiply+accumulate statement. We have this intrinsic :
1 | int16x8_t vmlaq_s16(int16x8_t a, int16x8_t b, int16x8_t c); |
This does a+(b*c) and returns the result. a, b and c are vectors. The type int16x8_t signifies that the vector is in a 128-bit NEON register having 8 lanes, each lane having 16-bit signed integers. So vmlaq_s16() does the multiply+accumulate operation on all 8 lanes of the 3 vectors in parallel, and returns the 8 result values again in a int16x8_t vector. Each multiple+accumulate is contained in one particular lane of all the 3 vectors.
To avoid overflow, as can be seen in the above C snippet, the multiplication is accumulated into a 32-bit integer. So instead of vmlaq_s16(), we have to use an intrinsic that operates on 16-bit values and returns 32bit values :
1 | int32x4_t vmlal_s16(int32x4_t a, int16x4_t b, int16x4_t c); |
Since only 4 32-bit data elements can be accommodated in a 128-bit vector, 4 elements could be parallelized rather than 8.
As can be seen, all these operations use the 128-bit registers, even though they need not be fully occupied, as in the case with int16x4 vectors. We need to first load the C array element values into these registers, and in the end, store the resultant values back from the registers into the result array elements. We have intrinsics for that also. Although there are intrinsics that operate on a mix of scalar and vectors, the intrinsic used above uses only vectors. So the same var1digit value can be loaded into all 4 lanes of a 16x4 vector.
With these instrinsics, the final code looks like this :
1 | \#include <arm_neon.h> |
I created a schema that contains numerics with large precisions, as shown here, and ran the following query that multiplies t1.val and t2.val. With the non-vectorized code, the execution time showed .874 milliseconds :
1 | $ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2" |
Since individual digits of the number have to be multiplied by the digits of the other number, the benefit is more for numerics with large precision. The schema I created has values with precisions in the range of 200-600. But the benefit starts showing up from around 20 precision onwards, with my ARM64 VM.
2. Auto-vectorization
It’s not always necessary to write code that uses intrinsics. Often if we arrange/simplify the code, today’s compilers, with appropriate compiler options, try to identify if the code can be vectorized, and generate appropriate assembly instructions that leverage the CPU architecture’s SIMD. In fact, above where I simplified the backward for-loop to a forward for-loop that uses a single variable increment, the gcc compiler is able to auto-vectorize the simplified for-loop. Here are the changes again:
1 | diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c |
With this change, in mul_var() assembly code, I could see the multiply-accumulate instructions that operate on NEON vectors (these are arm64 instructions) :
1 | smlal v1.4s, v2.4h, v3.4h |
gcc compiler option to enable auto-vectorization is “-ftree-loop-vectorize”. With gcc -O3, it is always enabled.
Although there are examples where gcc is able to auto-vectorize even backward loops, in the above case, it could not do so for the original code, seemingly because of two decrementing variables. That’s why I had to simplify it to a forward loop with a single variable increment, which is as simple as it gets.
To check whether gcc has been able to vectorize a particular code, use the gcc -fopt-info-all option. This outputs info such as this :
1 | numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors |
With this auto-vectorization method, the speedup I observed was around 2.7x. This speedup is higher than the intrinsics method, probably because the compiler might have used a better combination of assembly vectorized instructions than I did.
Conclusion
Vectorizing operations gives significant returns in repetitive operations. Although it suits well for columnar data, there could be some regions in current PostgreSQL code that might benefit from such tweaks to leverage SIMD. As far as possible, we should arrange for the compiler’s auto-vectorization. Such change is cleaner and clearly portable. Compare this with method 1 where we had to use intrinsics specific to the CPU architecture. But that example was chosen for the sake of explaining how to make use of intrinsics. In cases where it is not possible for the compiler to vectorize the code, we should use compiler intrinsics. E.g. check this out.