Test MariaDB CONNECT storage engine on arm64 platform
Author:zhaorenhai
MariaDB’s connect storage engine is an exciting storage engine.
The reason why it’s exciting is that the storage engine can do so much. Except for OLTP field, it is almost omnipotent. It can directly use SQL to query external files. It supports various common text files and log files. As long as you tell CONNECT the record format, you can use SQL to query and analyze. It also supports common file formats such as JSON, XML and INI. In addition to supporting local files, remote data can also be queried directly through the rest interface. For local files, in addition to query, it also supports insert, delete and other functions. It can also realize the dblink function in Oracle, that is, it can remotely query the tables of another database, even different types of databases, including Mysql, PostgreSQL, Oracle, etc., and even MongoDB of NoSQL type. Of course, on this basis, it is easier to import external files into the database, and it is more powerful and flexible than mysqlimport, because it can realize various format conversion and filtering. The same is true for exporting data to an external file. It also supports querying compressed files (currently only in ZIP format). Even binary file queries and updates are supported. The engine also supports virtual table which functions similar to Oracle’s dual. It can also support a variety of functions, such as single row to multi row, row column conversion, multi table mapping to a table.
In a word, with the CONNECT storage engine, everything can be transformed into SQL, and SQL can be transformed into everything.
Next, let’s take a look at the performance of the connect storage engine on the arm64 platform.
(note that this article is not a detailed guide to the use of CONNECT. As CONNECT involves too many contents, please refer to the official guidance for detailed guidance, various parameters, and restriction of each usage https://mariadb.com/kb/en/connect/ )
Our test platform uses Kunpeng virtual machine of Huawei cloud. The OS is Ubuntu 18.04 version.
It can be seen that MariaDB has released the CONNECT plugin of arm64 platform, which shows that the basic functions of this plugin should be OK on arm64.
If you compile MariaDB code yourself (please refer to this blog for specific compilation methods), you can run the connect test case, you would find that most of the test cases are successfully on arm64, except some skipped due to environmental reasons .
Log in to MariaDB and execute the following statement to enable the connect storage engine:
1
installsoname'ha_connect';
DOS Table Type
DOS table type actually corresponds to ordinary text file. The data of each column in the file is fixed length, only the last column is variable length.
We have a file in the following format, the size is 659M, and the number of lines is more than 13 million.
MariaDB [test]> pager md5sum;select * from userlist where id = 23456; PAGER set to 'md5sum' 8f3941802b868b2bbb43390ab5a2f1c8 - 256 rows in set (9.221 sec)
MariaDB [test]> pager md5sum;select * from userlist where id = 23456; PAGER set to 'md5sum' 8f3941802b868b2bbb43390ab5a2f1c8 - 256 rows in set (9.221 sec)
MariaDB [test]> pager md5sum;select * from userlist where userid like 'ltccjc%'; PAGER set to 'md5sum' 2bde22161d7a5bad42c377b7290cdff4 - 256 rows in set (19.196 sec)
MariaDB [test]> pager md5sum;select * from userlist where userid like 'ltccjc%'; PAGER set to 'md5sum' 2bde22161d7a5bad42c377b7290cdff4 - 256 rows in set (19.279 sec)
MariaDB [test]> delete from userlist where id=76245 and phonenum = '1812345678901'; Query OK, 1 row affected (9.719 sec)
You can see the query is fast, with 256 records queried from over 13 million records taking only 9 seconds. However, like queries are slightly slower, taking 19 seconds. Insertion is also fast. Inserting is essentially inserting a record at the end of the original file.
Update operations on tables of this DOS type are not recommended and strange things may occur. Because Update does not update the original file by default, it updates a temporary file. There is one parameterconnect_Use_Tempfile can be configured to support updating source files, but is not recommended for files with variable length records. If you are sure you want to use this feature, it is recommended that you test it yourself before using it.
FIX Table Type
The difference between FIX table type and DOS is whether the width of the last column is fixed or variable, FIX if it is fixed, or DOS if it is variable.
Now let’s test the FIX type. We have the following file, 667M in size, 13.43 million lines, and the last column is the same length.
MariaDB [test]> select count(*) from userlist_fix; +----------+ | count(*) | +----------+ | 13432193 | +----------+ 1 row in set (0.000 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where id = 23456; PAGER set to 'md5sum' 0871c0088df3f4dbf37b12e56611a0b0 - 128 rows in set (3.535 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where id = 23456; PAGER set to 'md5sum' 0871c0088df3f4dbf37b12e56611a0b0 - 128 rows in set (3.535 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where userid like 'ltccjc%'; PAGER set to 'md5sum' Empty set (9.344 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where userid like 'ltccjc%'; PAGER set to 'md5sum' Empty set (9.325 sec)
You can see that each type of FIX type operates much faster than the DOS type, indicating that MariaDB takes advantage of the fixed-length feature in the processing of fixed-length formats and has better performance. Last line, the data we inserted is not aligned, MariaDB will automatically complete for you when it inserts the database:
MariaDB [test]> delete from userlist_fix where phonenum = '1811111111111'; Query OK, 128 rows affected (3.672 sec)
So when we build a table from the original text file, if it is possible to process it as FIX type, try to process it as FIX type.
About connect_work_size
Next let’s modify connect_work_size parameter, which should be the memory area used by the CONNECT engine from the description of the Knowledge Base website. We enlarge it to see if it works.
MariaDB [test]> select count(*) from userlist; +----------+ | count(*) | +----------+ | 13819650 | +----------+ 1 row in set (13.576 sec)
MariaDB [test]> pager md5sum;select * from userlist where id = 23456; PAGER set to 'md5sum' 8f3941802b868b2bbb43390ab5a2f1c8 - 256 rows in set (9.306 sec)
MariaDB [test]> pager md5sum;select * from userlist where id = 23456; PAGER set to 'md5sum' 8f3941802b868b2bbb43390ab5a2f1c8 - 256 rows in set (9.305 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where id = 23456; PAGER set to 'md5sum' 6daf8a0815666570cee0011046e3b568 - 128 rows in set (3.526 sec)
MariaDB [test]> pager md5sum;select * from userlist_fix where id = 23456; PAGER set to 'md5sum' 6daf8a0815666570cee0011046e3b568 - 128 rows in set (3.526 sec)
As you can see from the above output, it has no effect. When the CONNECT storage engine processes files, it does not cache the entire file, increasing the memory space, and is not useful. The reason we’re so fast is also due to the fact that our Huawei Cloud virtual machine uses SSD storage, which is faster.
CSV Table Type
The CSV table type corresponds to ordinary CSV text files, but the delimiter here can be specified, either as a comma or as other specified characters.
Let’s test the performance of the CSV table type.
We create a CSV file in following format, 603M in size, with over 13.9 million lines. Columns are separated by |.
create table userlist_csv ( id int not null, phonenum char(13) not null, userid char(29) not null) engine=CONNECT table_type=CSV file_name='/tmp/data.csv' header=0 sep_char='|' quoted=0;
Then we do some query ,insertion and deletion operations:
MariaDB [test]> select count(*) from userlist_csv; 9229c8871853852300fca7ce0bb33598 - 1 row in set (17.835 sec)
MariaDB [test]> select count(*) from userlist_csv; 9229c8871853852300fca7ce0bb33598 - 1 row in set (17.712 sec)
MariaDB [test]> pager md5sum;select * from userlist_csv where id = 23456; PAGER set to 'md5sum' 9a02ca9b5c3044a73f4441c10945ff91 - 140 rows in set (9.356 sec)
MariaDB [test]> pager md5sum;select * from userlist_csv where id = 23456; PAGER set to 'md5sum' 9a02ca9b5c3044a73f4441c10945ff91 - 140 rows in set (9.355 sec)
MariaDB [test]> pager md5sum;select * from userlist_csv where userid like 'ltccjc%'; PAGER set to 'md5sum' Empty set (21.788 sec)
MariaDB [test]> pager md5sum;select * from userlist_csv where userid like 'ltccjc%'; PAGER set to 'md5sum' Empty set (21.781 sec)
MariaDB [test]> delete from userlist_csv where id=76245 and phonenum='1811111111111'; Query OK, 141 rows affected (6.674 sec)
You can see that the performance is still good, count query and like query are slightly slower than the DOS type, both query and insert operations are similar to the DOS type, and delete operation is faster than the DOS type.
FMT Table Type
The FMT table type is more flexible than CSV, and you can specify the format and length of each field by format matching. FMT is useful when analyzing various types of logs.
We use it to analyze an operating system auth log.
This auth log has over 600 M, more than 5 million rows.
1 2 3 4
root@mariadb-arm-test:/var/log# ls -lhrt auth.log.1 -rw-r----- 1 syslog adm 625M Dec 14 15:11 auth.log.1 root@mariadb-arm-test:/var/log# wc -l auth.log.1 5890048 auth.log.1
Dec 12 16:39:23 localhost groupadd[32556]: group added to /etc/group: name=mariadb, GID=1000 Dec 12 16:39:23 localhost groupadd[32556]: group added to /etc/gshadow: name=mariadb Dec 12 16:39:23 localhost groupadd[32556]: new group: name=mariadb, GID=1000 Dec 12 16:39:23 localhost useradd[32560]: new user: name=mariadb, UID=1000, GID=1000, home=/home/mariadb, shell=/bin/bash Dec 12 16:39:43 localhost passwd[32569]: pam_unix(passwd:chauthtok): password changed for mariadb Dec 12 16:39:46 localhost chfn[32570]: changed user 'mariadb' information Dec 12 16:45:36 localhost su[1805]: Successful su for mariadb by root Dec 12 16:45:36 localhost su[1805]: + /dev/pts/0 root:mariadb Dec 12 16:45:36 localhost su[1805]: pam_unix(su:session): session opened for user mariadb by root(uid=0) Dec 12 16:45:36 localhost su[1805]: pam_systemd(su:session): Cannot create session: Already running in a session Dec 12 16:50:01 localhost sudo: mariadb : TTY=pts/0 ; PWD=/home/mariadb ; USER=root ; COMMAND=/usr/bin/apt-get install build-essential libncurses5-dev gnutls-dev libcurl4-gnutls-dev Dec 12 16:50:01 localhost sudo: pam_unix(sudo:session): session opened for user root by root(uid=0) Dec 12 16:50:04 localhost sudo: pam_unix(sudo:session): session closed for user root Dec 12 16:50:14 localhost sudo: mariadb : TTY=pts/0 ; PWD=/home/mariadb ; USER=root ; COMMAND=/usr/bin/apt-get install build-essential libncurses5-dev gnutls-dev libcurl4-gnutls-dev Dec 12 16:50:14 localhost sudo: pam_unix(sudo:session): session opened for user root by root(uid=0) Dec 12 16:50:25 localhost sudo: pam_unix(sudo:session): session closed for user root Dec 12 16:50:49 localhost sudo: mariadb : TTY=pts/0 ; PWD=/home/mariadb ; USER=root ; COMMAND=/usr/bin/apt-get install zlib1g-dev ccache libnuma-dev libxml2-dev cmake bison Dec 12 16:50:49 localhost sudo: pam_unix(sudo:session): session opened for user root by root(uid=0) Dec 12 16:51:08 localhost sudo: pam_unix(sudo:session): session closed for user root Dec 12 16:56:01 localhost CRON[7999]: pam_unix(cron:session): session opened for user root by (uid=0) Dec 12 16:56:01 localhost CRON[7999]: pam_unix(cron:session): session closed for user root Dec 12 17:17:01 localhost CRON[16493]: pam_unix(cron:session): session opened for user root by (uid=0) Dec 12 17:17:01 localhost CRON[16493]: pam_unix(cron:session): session closed for user root
Create the table:
1 2 3 4 5 6 7 8
create table authlog ( month char(3) not null field_format='%n%s%n', day char(2) not null field_format=' %n%s%n', time char(8) not null field_format=' %n%s%n', hostname char(9) not null field_format=' %n%s%n', module char(20) not null field_format=' %n%s%n', message char(255) not null field_format=' %n%255[^\n]%n') engine=CONNECT table_type=FMT file_name='/var/log/auth.log.1' ;
Query it , the format looks better now:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MariaDB [test]> select * from authlog limit 10; +-------+-----+----------+-----------+----------------------+-----------------------------------------------------------------------+ | month | day | time | hostname | module | message | +-------+-----+----------+-----------+----------------------+-----------------------------------------------------------------------+ | Dec | 2 | 20:47:26 | localhost | systemd-logind[721]: | Power key pressed. | | Dec | 2 | 20:47:26 | localhost | systemd-logind[721]: | Powering Off... | | Dec | 2 | 20:47:26 | localhost | systemd-logind[721]: | System is powering down. | | Dec | 2 | 20:47:27 | localhost | systemd: | pam_unix(systemd-user:session): session closed for user root | | Dec | 11 | 10:07:26 | localhost | systemd-logind[960]: | New seat seat0. | | Dec | 11 | 10:07:26 | localhost | systemd-logind[960]: | Watching system buttons on /dev/input/event0 (Power Button) | | Dec | 11 | 10:07:26 | localhost | systemd-logind[960]: | Watching system buttons on /dev/input/event2 (QEMU QEMU USB Keyboard) | | Dec | 11 | 10:07:50 | localhost | sshd[2384]: | Server listening on 0.0.0.0 port 22. | | Dec | 11 | 10:07:50 | localhost | sshd[2384]: | Server listening on :: port 22. | | Dec | 11 | 10:07:51 | localhost | sshd[2384]: | Received signal 15; terminating. | +-------+-----+----------+-----------+----------------------+-----------------------------------------------------------------------+
Let’s do some queries:
1 2 3 4 5 6 7
MariaDB [test]> select count(*) from authlog where month='Dec' and day='12' and module like 'sshd%'; +----------+ | count(*) | +----------+ | 3170304 | +----------+ 1 row in set (15.609 sec)
The speed is acceptable.
Of course, if you use the grep command to filter in the following way, it will be faster:
Data is separated by fields and the results are more accurate.
If you want to import the log into the database, it’s very convenient. Just `create table tablename as select * from authlog’.
Sometimes it is convenient to create a table when you want to join the contents of the log and the contents of some tables.
Summary
CONNECT has many functions, such as parsing xml, json, remote querying other database functions, row and column conversion, virtual columns and so on, which we have not demonstrated. However, these features can be found to be supported on arm64 by the test cases previously run. And there are not many scenarios where these functions involve performance. Like XML and Json, you hardly can find some files which are several hundreds megabytes. Remote queries on remote databases actually rely more on the performance of remote databases themselves. For specific usage of these features, you can refer to MariaDB’s official Knowledge Base guide, which is detailed.
This article mainly tests several CONNECT storage engine scenarios may be related to performance, need to parse large files, found that the performance on arm64 is also good. In addition, the query scenarios tested in this paper are all unindexed scenes. If the data’s selectivity is good, you can choose to index and query again, then the result can be returned instantly. Of course, the index needs to take up a certain amount of disk space, and it also requires write permission to the disk. Please refer to the official guidance of MariaDB for specific usage.