Project update 13 of 20
One key differentiator of Ten64 from general-purpose and media-oriented appliances is the networking-oriented acceleration capabilities of Ten64’s LS1088 System-on-Chip.
The previous 10G Options & Performance post described some of the options available to improve packet routing performance - all the way up to the programmable offload engine (AIOP).
There are two other workloads you can accelerate on Ten64. In this post, we will describe how Ten64 can accelerate cryptography (important for VPNs) and AI workloads using an AI acceleration card.
The LS1088 SoC provides two separate methods of cryptography acceleration:
This provides acceleration for AES, and SHA-1,-224 and SHA-256. It is
analogous to the
AES-NI in most
modern x86 processors. This is an optional extension which is not
present on all ARM-powered processors, but is present on the
LS1088. You can check if it is available on your ARM machine by
looking at the flags in
$ cat /proc/cpuinfo | grep Features | head -n 1 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
$ cat /proc/cpuinfo | grep Features | head -n 1 Features : fp asimd evtstrm crc32 cpuid
To illustrate the difference, we ran OpenSSL’s speed benchmark on the Ten64, and Raspberry Pi 3 and 4. The Raspberry Pi 3 also uses the Cortex-A53 core (like the LS1088), but does not have the ARMv8 crypto extension.
The newer Raspberry Pi 4 uses the Cortex-A72 - a faster, out-of-order core, but also lacks the cryptography extension.
As we can see, the lack of AES acceleration is a major handicap — the LS1088 is 18-22x faster in this particular use case.
The ARMv8 Cryptography extension is used by OpenSSL, wolfSSL, and through the arm/sha*-ce kernel modules in the Linux kernel, so most applications using these libraries should be able to take advantage of them.
The NXP SEC engine (also known as CAAM) is NXP’s encryption acceleration block. It is designed to accelerate communications workloads like IPSec, as well as some earlier versions of TLS and ciphers used in standards such as 3G/UMTS (Kasumi, Snow) and Wi-Fi. It also implements some older, but still relevant standards such as RSA and 3DES.
SEC engine is best at accelerating packets to/from the network stack in the kernel (or similar environments such as DPDK). There are higher latencies as data packets need to be transferred in and out of it via DMA, rather than the ARMv8 crypto extensions, which are part of the CPU instruction set.
It is possible to use SEC from userspace, using mechanisms such as
but you might end up with better performance using the CPU instructions.
IPSec throughput comparison between ARMv8 crypto and SEC engine. We anticipate the SEC engine throughput can be improved even further in the future.
Nonetheless, you can get some impressive performance from the SEC engine for
IPSec workloads, because it can accelerate not only the encryption cipher but also
a chain of related operations such as AEAD
and HMAC, as can be seen in
when the SEC drivers are compiled into the kernel:
cat /proc/crypto | grep aes | grep caam driver : cmac-aes-caam driver : xcbc-aes-caam driver : seqiv-authenc-hmac-sha512-rfc3686-ctr-aes-caam driver : authenc-hmac-sha512-rfc3686-ctr-aes-caam driver : seqiv-authenc-hmac-sha384-rfc3686-ctr-aes-caam driver : authenc-hmac-sha384-rfc3686-ctr-aes-caam driver : seqiv-authenc-hmac-sha256-rfc3686-ctr-aes-caam driver : authenc-hmac-sha256-rfc3686-ctr-aes-caam driver : seqiv-authenc-hmac-sha224-rfc3686-ctr-aes-caam driver : authenc-hmac-sha224-rfc3686-ctr-aes-caam driver : seqiv-authenc-hmac-sha1-rfc3686-ctr-aes-caam driver : authenc-hmac-sha1-rfc3686-ctr-aes-caam driver : seqiv-authenc-hmac-md5-rfc3686-ctr-aes-caam driver : authenc-hmac-md5-rfc3686-ctr-aes-caam driver : echainiv-authenc-hmac-sha512-cbc-aes-caam driver : authenc-hmac-sha512-cbc-aes-caam driver : echainiv-authenc-hmac-sha384-cbc-aes-caam driver : authenc-hmac-sha384-cbc-aes-caam driver : echainiv-authenc-hmac-sha256-cbc-aes-caam driver : authenc-hmac-sha256-cbc-aes-caam driver : echainiv-authenc-hmac-sha224-cbc-aes-caam driver : authenc-hmac-sha224-cbc-aes-caam driver : echainiv-authenc-hmac-sha1-cbc-aes-caam driver : authenc-hmac-sha1-cbc-aes-caam driver : echainiv-authenc-hmac-md5-cbc-aes-caam driver : authenc-hmac-md5-cbc-aes-caam driver : gcm-aes-caam driver : rfc4543-gcm-aes-caam driver : rfc4106-gcm-aes-caam driver : ecb-aes-caam driver : xts-aes-caam driver : rfc3686-ctr-aes-caam driver : ctr-aes-caam driver : cbc-aes-caam
(For a full output from /proc/crypto, see the cryptographic acceleration page in the Ten64 manual.)
IPSec may not be the easiest VPN solution to use (especially in the face of alternatives like OpenVPN and Wireguard) but this is balanced by its ubiquitous nature (as many operating systems and network appliances implement it) and ability to leverage hardware offloads such as the SEC engine.
Those of you interested in machine learning and AI may be interested to know that the Coral AI EdgeTPU cards work in the Ten64. The Coral PCIe cards are available in both Mini PCIe and M.2.
The Coral Mini PCIe card installed on a Ten64 board
While we haven’t had an opportunity to piece together an AI/ML demo of our own, the TensorFlow Lite image classification example shows an impressive speedup:
----INFERENCE TIME---- Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory. 140.4ms 138.9ms 139.1ms 139.3ms 139.3ms -------RESULTS-------- Ara macao (Scarlet Macaw): 0.77734
----INFERENCE TIME---- Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory. 12.6ms 2.5ms 2.4ms 2.4ms 2.4ms -------RESULTS-------- Ara macao (Scarlet Macaw): 0.77734
That is an over 50x speedup - which opens up possibilities involving real-time processing, such as classifying objects from a video feed.
For information on how to setup a development environment for the Coral EdgeTPU, see our application note.