Drobo

Using the DroboFS crypto hardware acceleration

#1

I always wondered about the “Cryptographic Engine” that ships with the DroboFS’s motherboard (Feroceon-MV78200).

According to the functional specification, page 22 1:

So I wondered if this could be used in something like OpenSSL. It turns out it can 2.

To compile OpenSSL 1.0.1c with support for the hardware acceleration, three two things are needed:

  1. copy the kernel header file ./crypto/ocf/cryptodev.h inside the crypto/ folder under the openssl source folder.
  2. Use these arguments when configuring OpenSSL: -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS

I compiled versions with -O3 and -Os for both -marm and -mthumb and here are the before/after results.

Before for -Os -mthumb:

$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1110658 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 318630 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 82762 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 20894 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 2624 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       5903.83k     6774.86k     7038.89k     7108.12k     7141.46k

Before for -O3 -mthumb:

$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1333854 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 396645 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 97872 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 26297 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 3295 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       7090.25k     8433.65k     8822.26k     8946.22k     8967.65k

Before for -O3 -marm:

$ sudo -s chmod o-rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 1676376 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 64 size blocks: 498816 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 130130 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 1024 size blocks: 33163 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 4165 aes-128-cbc's in 3.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       8910.97k    10606.05k    11067.53k    11282.03k    11335.44k

So far, it seems that the optimization level has very little impact on the performance of OpenSSL. What happens if we enable the hardware optimization?

After for -Os -mthumb:

ricardo@DroboFS:~/tmp$ sudo -s chmod o+rw /dev/crypto                                                                  
ricardo@DroboFS:~/tmp$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 103967 aes-128-cbc's in 0.17s
Doing aes-128-cbc for 3s on 64 size blocks: 88522 aes-128-cbc's in 0.11s
Doing aes-128-cbc for 3s on 256 size blocks: 56152 aes-128-cbc's in 0.09s
Doing aes-128-cbc for 3s on 1024 size blocks: 22848 aes-128-cbc's in 0.03s
Doing aes-128-cbc for 3s on 8192 size blocks: 3290 aes-128-cbc's in 0.01s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       9785.13k    51503.71k   159721.24k   779878.40k  2695168.00k

After for -O3 -mthumb:

$ sudo -s chmod o+rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 105181 aes-128-cbc's in 0.12s
Doing aes-128-cbc for 3s on 64 size blocks: 89682 aes-128-cbc's in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 56550 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 1024 size blocks: 22903 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 8192 size blocks: 3286 aes-128-cbc's in 0.00s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      14024.13k    71745.60k   289536.00k   469053.44k         infk

After for -O3 -marm:

$ sudo -s chmod o+rw /dev/crypto                                                                  
$ ./openssl speed -evp aes-128-cbc                                                                
Doing aes-128-cbc for 3s on 16 size blocks: 104514 aes-128-cbc's in 0.10s
Doing aes-128-cbc for 3s on 64 size blocks: 83879 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 256 size blocks: 56418 aes-128-cbc's in 0.02s
Doing aes-128-cbc for 3s on 1024 size blocks: 22909 aes-128-cbc's in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 3291 aes-128-cbc's in 0.00s
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      16722.24k   107365.12k   722150.40k  2345881.60k         infk

If you are saying “wow!” now, I’m right there with you. From 3 seconds CPU time to 0.0 seconds. The performance of “-O3 -marm” is still quite a bit higher (4165 vs 3291, i.e., a 25% drop), but “-Os -mthumb” gets beaten (2624 vs 3290, i.e., a 25% increase).

Since it is a good idea to spare as much CPU time as we can on the FS, I’ll be posting hardware-accelerated versions OpenSSH, OpenVPN and other apps that depend on OpenSSL as they get updated.

#2

Ricardo, this latest achievement is possibly your greatest yet.
Your knowledge of the DroboFS, related hardware, and how to compile and tune software is a marvel. Your willingness and ability to communicate clearly is rare and deeply appreciated.
We’re unreasonably lucky to have you around; Drobo Inc. is luckiest of all.

Thank you, and keep up the great work. If you’re ever near San Diego, the meal and drinks are on me. :slight_smile:

#3

The stuff you accomplish, ricardo, is nothing short of amazing. :slight_smile:

#4

Thank you for the kind words. :slight_smile:

#5

Status update: in a word, conflicted.

I noticed today that I missed a very important configuration option for openssl that enables very efficient assembler implementations of the same algorithms that the hardware accelerator offers.

So I ran a complete benchmark on openssl with and without the hardware accelerator. This time using -elapsed, so that the numbers should be more comparable. I added RC4 as a common baseline, since it isn’t accelerated.

Without the accelerator:

With the accelerator:

With the accelerator, but without -elapsed:

So what to make of this? It seems that, yes, the hardware accelerator does reduce CPU usage a lot. By a lot, I mean 2 orders of magnitude.

On the other hand, the raw throughput of the CPU (with the -D*_ASM flags) is at best one order of magnitude faster (for block sizes up to 256 bytes), and even for more realistic block sizes (1k) it is still more than double the throughput of the hardware accelerator.

At this point, the only advantage of having hardware acceleration is to reduce CPU usage. But if the cost of that is less throughput, I’m not really sure it makes sense.

In fact, the best throughput comes from using RC4 (Cipher arcfour):

It is at least twice as fast as AES (more like 2.5x).

And to add insult to the injury, OpenSSH does not like the hardware accelerator in the FS. When I try to connect to the server using AES-128 with the hardware acceleration enabled, this happens on the server side:

And it dies. If I remove hardware acceleration, then the connection goes through without any problems.

I’ve googled around to see if anyone has a workaround or patch for this, but so far no solution found. Worse than that, I’ve seem quite a few reports that the same kind of problems happen to other apps, such as lighttpd, openvpn, etc. The impression I had is that for this thing to work properly we would need a newer version of the kernel. You can guess what the odds of that happening are…

#6

it’s probably closely guarded as…
wait for it…
“the kernels secret recipe” :smiley:

#7

I know you are joking, but I have been investigating how I could upgrade the FS’s kernel by myself. Yeah, I’m that crazy.

#8

You’re a good kind of crazy ricardo. :slight_smile:

#9

I have never before wanted to see a “Like” button on a web site.