Virtualization Performance Tests - Round One

In this update we explore the performance of virtualized Linux guests on an OpenPOWER Linux host with QEMU. Several tests are run, and all yield a somewhat surprising result — virtual machines actually provide a performance boost compared to native execution when the host SMT is set to 1! We suspect this is due to native host scheduling problems, but this also implies that there is considerable untapped potential latent within these OpenPOWER machines.

Test Setup

For all tests below, we use a Firestone reference server with dual 8-core 190 W CPUs, 4 Centaur memory buffers, and 256 GB RAM. We While the absolute numbers will change on a Talos machine, proportionally the numbers should be nearly identical when comparing native execution to the two virtualized modes.

OpenPOWER machines under KVM/QEMU have two separate virtualization modes available, "Hypervisor" (kvm-hv) and "Problem" (kvm-pr). The hypervisor mode uses the native virtualization extensions of POWER7 and greater CPUs, and provides the best possible peformance of any virtualization mode on POWER systems. However, this mode is limited to the host CPU generation or the prior CPU generation, and furthermore cannot be used from inside another virtual machine. In comparison, problem mode executes the virtual machine completely in user mode by utilizing the problem handlers of the POWER architecture, and emulates privileged instructions where needed. This virtualization mode can be used on any PPC / POWER hardware, can emulate any PPC / POWER CPU type or generation, and is suitable for nested virtualization, but carries a variable performance penalty based on workload.

One final variable is that POWER machines can be set to different SMT (Simultaneous MultiThreading) modes. POWER8 CPUs natively support 8 simultaneous threads (SMT 8), but some workloads (e.g. QEMU) require the native SMT support to be disabled (SMT 1). As a result, we benchmark the native SMT 8 performance alongside the native SMT 1 performance for direct comparison. It is hoped that over time, as QEMU on POWER matures further, this limitation can be removed.

Test 1 - Kernel Compile

Building on our previous kernel compilation tests, we ran timed compile tests on several native and virtualized configurations. As before, a snapshot of the Linux kernel source tree was pulled and compiled for POWER using the stock Debian configuration. The compilation took place entirely within a dedicated tmpfs mount. The command used to compile was:

time make -j<core count>

Native (SMT 8, 128 cores)Native (SMT 1, 16 cores)Virtualized HV (SMT 1, 16 cores)Virtualized PR (SMT 1, 16 cores)
Wall Time4m15.934s 23m33.949s 7m13.634s20m37.722s

Test 2 - STREAM

Also building on our previous memory bandwidth tests, we ran STREAM benchmarks on all four configurations. The command used to run the benchmark was:

OMP_NUM_THREADS=<core count> ./stream

Native (SMT 8, 128 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           32822.1     0.026917     0.019499     0.041350
Scale:          35293.1     0.027499     0.018134     0.035020
Add:            45206.4     0.025632     0.021236     0.031831
Triad:          43533.4     0.025338     0.022052     0.029733

Native (SMT 1, 16 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           47365.7     0.014247     0.013512     0.015773
Scale:          51871.6     0.013212     0.012338     0.014527
Add:            58472.5     0.018189     0.016418     0.027140
Triad:          60131.6     0.016697     0.015965     0.018448

Virtualized HV (SMT 1, 16 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           36221.7     0.019791     0.017669     0.022151
Scale:          32368.9     0.020795     0.019772     0.022489
Add:            38326.4     0.026114     0.025048     0.027989
Triad:          38551.0     0.026241     0.024902     0.027209

Virtualized PR (SMT 1, 16 cores)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           34471.4     0.022185     0.018566     0.026645
Scale:          32199.6     0.022841     0.019876     0.028890
Add:            37231.0     0.029773     0.025785     0.035886
Triad:          39228.5     0.027465     0.024472     0.034118

Test 3 - UNIX Bench

Given the rather odd results shown above, a more comprehensive systemwide open-source benchmark was sought. Unix Bench gives detailed information on the speed of various system calls, process spawning, etc. and we ran this benchmark on all four of the test system configurations.

Native (SMT 8, 128 cores)
BYTE UNIX Benchmarks (Version 5.1.3)
System: alsvidr: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up  1:26,  4 users,  load average: 0.87, 0.52, 0.33; runlevel 2016-11-10
128 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables       28248374.7 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     3969.4 MWIPS (9.7 s, 7 samples)
Execl Throughput                               1226.4 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        593518.2 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          157303.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1914860.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1406112.2 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 157185.1 lps   (10.0 s, 7 samples)
Process Creation                               6354.3 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3976.9 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1312.8 lpm   (60.0 s, 2 samples)
System Call Overhead                        1459471.8 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   28248374.7   2420.6
Double-Precision Whetstone                       55.0       3969.4    721.7
Execl Throughput                                 43.0       1226.4    285.2
File Copy 1024 bufsize 2000 maxblocks          3960.0     593518.2   1498.8
File Copy 256 bufsize 500 maxblocks            1655.0     157303.0    950.5
File Copy 4096 bufsize 8000 maxblocks          5800.0    1914860.8   3301.5
Pipe Throughput                               12440.0    1406112.2   1130.3
Pipe-based Context Switching                   4000.0     157185.1    393.0
Process Creation                                126.0       6354.3    504.3
Shell Scripts (1 concurrent)                     42.4       3976.9    937.9
Shell Scripts (8 concurrent)                      6.0       1312.8   2188.1
System Call Overhead                          15000.0    1459471.8    973.0
System Benchmarks Index Score                                        1003.9
128 CPUs in system; running 128 parallel copies of tests
Dhrystone 2 using register variables      474697222.9 lps   (10.1 s, 7 samples)
Double-Precision Whetstone                   196647.6 MWIPS (9.7 s, 7 samples)
Execl Throughput                               4955.9 lps   (29.5 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        323175.8 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           80165.2 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1178499.2 KBps  (30.1 s, 2 samples)
Pipe Throughput                            38865990.2 lps   (10.2 s, 7 samples)
Pipe-based Context Switching                4280137.2 lps   (10.0 s, 7 samples)
Process Creation                              69295.5 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  10687.0 lpm   (60.3 s, 2 samples)
Shell Scripts (8 concurrent)                   1066.3 lpm   (64.8 s, 2 samples)
System Call Overhead                        2407036.9 lps   (10.3 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  474697222.9  40676.7
Double-Precision Whetstone                       55.0     196647.6  35754.1
Execl Throughput                                 43.0       4955.9   1152.5
File Copy 1024 bufsize 2000 maxblocks          3960.0     323175.8    816.1
File Copy 256 bufsize 500 maxblocks            1655.0      80165.2    484.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    1178499.2   2031.9
Pipe Throughput                               12440.0   38865990.2  31242.8
Pipe-based Context Switching                   4000.0    4280137.2  10700.3
Process Creation                                126.0      69295.5   5499.6
Shell Scripts (1 concurrent)                     42.4      10687.0   2520.5
Shell Scripts (8 concurrent)                      6.0       1066.3   1777.2
System Call Overhead                          15000.0    2407036.9   1604.7
System Benchmarks Index Score                                        4019.7

Native (SMT 1, 16 cores)
BYTE UNIX Benchmarks (Version 5.1.3)
System: alsvidr: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up  6:56,  5 users,  load average: 0.91, 1.38, 1.00; runlevel 2016-11-09
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables       29813165.1 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4052.8 MWIPS (9.7 s, 7 samples)
Execl Throughput                               1236.7 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        624721.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          161424.4 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1967152.4 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1471144.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 181574.9 lps   (10.0 s, 7 samples)
Process Creation                               9996.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   4032.4 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1456.7 lpm   (60.0 s, 2 samples)
System Call Overhead                        1498750.4 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   29813165.1   2554.7
Double-Precision Whetstone                       55.0       4052.8    736.9
Execl Throughput                                 43.0       1236.7    287.6
File Copy 1024 bufsize 2000 maxblocks          3960.0     624721.1   1577.6
File Copy 256 bufsize 500 maxblocks            1655.0     161424.4    975.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    1967152.4   3391.6
Pipe Throughput                               12440.0    1471144.3   1182.6
Pipe-based Context Switching                   4000.0     181574.9    453.9
Process Creation                                126.0       9996.8    793.4
Shell Scripts (1 concurrent)                     42.4       4032.4    951.0
Shell Scripts (8 concurrent)                      6.0       1456.7   2427.9
System Call Overhead                          15000.0    1498750.4    999.2
System Benchmarks Index Score                                        1088.8
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables      469625912.8 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    64079.7 MWIPS (9.7 s, 7 samples)
Execl Throughput                               4840.5 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        458129.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          122260.4 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1928127.3 KBps  (30.0 s, 2 samples)
Pipe Throughput                            23057509.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                1414615.2 lps   (10.0 s, 7 samples)
Process Creation                              75094.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  14131.7 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1587.8 lpm   (60.4 s, 2 samples)
System Call Overhead                        3684855.9 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  469625912.8  40242.2
Double-Precision Whetstone                       55.0      64079.7  11650.9
Execl Throughput                                 43.0       4840.5   1125.7
File Copy 1024 bufsize 2000 maxblocks          3960.0     458129.4   1156.9
File Copy 256 bufsize 500 maxblocks            1655.0     122260.4    738.7
File Copy 4096 bufsize 8000 maxblocks          5800.0    1928127.3   3324.4
Pipe Throughput                               12440.0   23057509.5  18535.0
Pipe-based Context Switching                   4000.0    1414615.2   3536.5
Process Creation                                126.0      75094.7   5959.9
Shell Scripts (1 concurrent)                     42.4      14131.7   3333.0
Shell Scripts (8 concurrent)                      6.0       1587.8   2646.4
System Call Overhead                          15000.0    3684855.9   2456.6
System Benchmarks Index Score                                        3908.1

Virtualized HV (SMT 1, 16 cores)
BYTE UNIX Benchmarks (Version 5.1.3)
System: libreoffice-build-vm: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 3 min,  1 user,  load average: 0.22, 0.06, 0.02; runlevel 2016-11-09
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables       29740611.7 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4044.5 MWIPS (9.7 s, 7 samples)
Execl Throughput                               2065.3 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        492491.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          130002.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1608499.5 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1521715.9 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 165060.7 lps   (10.0 s, 7 samples)
Process Creation                               4405.1 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   5817.9 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   2778.9 lpm   (60.0 s, 2 samples)
System Call Overhead                        1619580.0 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   29740611.7   2548.5
Double-Precision Whetstone                       55.0       4044.5    735.4
Execl Throughput                                 43.0       2065.3    480.3
File Copy 1024 bufsize 2000 maxblocks          3960.0     492491.6   1243.7
File Copy 256 bufsize 500 maxblocks            1655.0     130002.0    785.5
File Copy 4096 bufsize 8000 maxblocks          5800.0    1608499.5   2773.3
Pipe Throughput                               12440.0    1521715.9   1223.2
Pipe-based Context Switching                   4000.0     165060.7    412.7
Process Creation                                126.0       4405.1    349.6
Shell Scripts (1 concurrent)                     42.4       5817.9   1372.2
Shell Scripts (8 concurrent)                      6.0       2778.9   4631.5
System Call Overhead                          15000.0    1619580.0   1079.7
System Benchmarks Index Score                                        1094.4
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables      465404814.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    63812.2 MWIPS (9.7 s, 7 samples)
Execl Throughput                              15151.1 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        384508.0 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           87708.3 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1554224.1 KBps  (30.0 s, 2 samples)
Pipe Throughput                            23429940.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                2449227.9 lps   (10.0 s, 7 samples)
Process Creation                              25233.1 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  49705.7 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   7294.8 lpm   (60.1 s, 2 samples)
System Call Overhead                        3708419.3 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  465404814.5  39880.4
Double-Precision Whetstone                       55.0      63812.2  11602.2
Execl Throughput                                 43.0      15151.1   3523.5
File Copy 1024 bufsize 2000 maxblocks          3960.0     384508.0    971.0
File Copy 256 bufsize 500 maxblocks            1655.0      87708.3    530.0
File Copy 4096 bufsize 8000 maxblocks          5800.0    1554224.1   2679.7
Pipe Throughput                               12440.0   23429940.5  18834.4
Pipe-based Context Switching                   4000.0    2449227.9   6123.1
Process Creation                                126.0      25233.1   2002.6
Shell Scripts (1 concurrent)                     42.4      49705.7  11723.0
Shell Scripts (8 concurrent)                      6.0       7294.8  12158.1
System Call Overhead                          15000.0    3708419.3   2472.3
System Benchmarks Index Score                                        4881.2

Virtualized PR (SMT 1, 16 cores)
BYTE UNIX Benchmarks (Version 5.1.3)
System: libreoffice-build-vm: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 0 min,  1 user,  load average: 0.88, 0.28, 0.10; runlevel 2016-11-10
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables       29598703.7 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4029.0 MWIPS (9.7 s, 7 samples)
Execl Throughput                                249.5 lps   (29.4 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         35533.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks            9273.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        144204.7 KBps  (30.0 s, 2 samples)
Pipe Throughput                               43923.1 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  10920.2 lps   (10.0 s, 7 samples)
Process Creation                                594.1 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   1078.5 lpm   (60.1 s, 2 samples)
Shell Scripts (8 concurrent)                    316.6 lpm   (60.1 s, 2 samples)
System Call Overhead                          32725.4 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   29598703.7   2536.3
Double-Precision Whetstone                       55.0       4029.0    732.6
Execl Throughput                                 43.0        249.5     58.0
File Copy 1024 bufsize 2000 maxblocks          3960.0      35533.1     89.7
File Copy 256 bufsize 500 maxblocks            1655.0       9273.0     56.0
File Copy 4096 bufsize 8000 maxblocks          5800.0     144204.7    248.6
Pipe Throughput                               12440.0      43923.1     35.3
Pipe-based Context Switching                   4000.0      10920.2     27.3
Process Creation                                126.0        594.1     47.1
Shell Scripts (1 concurrent)                     42.4       1078.5    254.4
Shell Scripts (8 concurrent)                      6.0        316.6    527.7
System Call Overhead                          15000.0      32725.4     21.8
System Benchmarks Index Score                                         127.2
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables      464272669.8 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    63585.0 MWIPS (9.7 s, 7 samples)
Execl Throughput                               1195.5 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        179139.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           42037.4 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        713253.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                              676627.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 125727.5 lps   (10.1 s, 7 samples)
Process Creation                               2225.8 lps   (30.1 s, 2 samples)
Shell Scripts (1 concurrent)                   3361.9 lpm   (60.2 s, 2 samples)
Shell Scripts (8 concurrent)                    412.6 lpm   (61.1 s, 2 samples)
System Call Overhead                         504498.4 lps   (10.0 s, 7 samples)
System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  464272669.8  39783.4
Double-Precision Whetstone                       55.0      63585.0  11560.9
Execl Throughput                                 43.0       1195.5    278.0
File Copy 1024 bufsize 2000 maxblocks          3960.0     179139.5    452.4
File Copy 256 bufsize 500 maxblocks            1655.0      42037.4    254.0
File Copy 4096 bufsize 8000 maxblocks          5800.0     713253.6   1229.7
Pipe Throughput                               12440.0     676627.3    543.9
Pipe-based Context Switching                   4000.0     125727.5    314.3
Process Creation                                126.0       2225.8    176.6
Shell Scripts (1 concurrent)                     42.4       3361.9    792.9
Shell Scripts (8 concurrent)                      6.0        412.6    687.6
System Call Overhead                          15000.0     504498.4    336.3
System Benchmarks Index Score                                         825.4


As before, the highest performance is attained within the kvm-hv virtual machine, which still exceeds native performance. The kvm-pr virtual machine performs far worse than expected, only reaching 11.6% of the kvm-hv performance in these kernel operation -heavy tests.

The results do shed some light on the performance increase inside a kvm-hv virtual machine, however. It appears that system call overhead is greatly reduced inside the kvm-hv virtual machine as compared to native exection, including execl(), and this would easily explain the observed results for the timed compilation tests. Furthermore, disabling SMT produces a puzzling, massive drop in timed compile performance, but this drop is not reflected in the Unix Bench results above. Overall, these test results hint the Linux kernel may not be properly tuned for native execution, and that our prior benchmarks on the campaign page and in the updates are likely significantly under-reporting OpenPOWER’s true performance limits. We will be forwarding these results to IBM for further analysis and hopefully a fix that unlocks more of OpenPOWER’s true potential!

