Comments on: Ultra-Low Latency Querying with Java Streams and In-JVM-Memory

By: Per Minborg

Per Minborg — Mon, 12 Aug 2019 15:36:37 +0000

You are perfectly right. I think that the article could be much clearer in this perspective. As you have pointed out many times, there are ways to send a message across a TCP/IP network with lower latency.

By: mp

mp — Mon, 12 Aug 2019 15:34:24 +0000

I only commented because you mentioned 'theoretical minimal latency'. The theoretical minimal latency can only be achieved when you modify all things available (in this case the kernel and all software-layers involved).

By: Per Minborg

Per Minborg — Sun, 11 Aug 2019 12:51:49 +0000

I admire your persistency. Let me explain why these references are not relevant to the article:
https://www.openonload.org/ Bypasses the kernel. The article specifically includes the kernel
http://www.coralblocks.com/index.php/coralreactor-performance-numbers/ Tests were made using the loop back interface on the same physical machine.
https://wiki.fd.io/images/f/f4/02_ld_Light_A_scalable_High_Performance_and_Fully_compatible_TCP_Stack.pdf appears to bypass the kernel but it is just a slideshow so hard to tell.
https://eggert.org/papers/atc16-paper_yasukata.pdf "StackMap leverages the best aspects of **kernel-bypass**
networking into a.." Again this is by bypassing the kernel.

Another aspect is that my tests were run on an old MacBook pro. It is likely that the latencies for the Streams would be even lower on modern hardware and the latest version of C2 compilers. So, in fact, the ratio could be even higher than proposed in the article.

I am interested to learn if you have a reference with sub 15 us latencies for the given requirements. Keep looking!

By: mp

mp — Sat, 10 Aug 2019 07:51:16 +0000

There are many more solutions for this problem:
"Application-to-application latency below 1.7us." from: https://www.openonload.org/
"TCP Latency Avg Time: 2.15 uS" (in java, from one jvm to another) from: http://www.coralblocks.com/index.php/coralreactor-performance-numbers/
Latency of 5 uS: https://wiki.fd.io/images/f/f4/02_ld_Light_A_scalable_High_Performance_and_Fully_compatible_TCP_Stack.pdf
Proof of < 25 uS roundtrip TCP latency be modifying linux: https://eggert.org/papers/atc16-paper_yasukata.pdf

By: Per Minborg

Per Minborg — Fri, 09 Aug 2019 19:43:51 +0000

Asics and specialized HW are indeed used for all NICs but the level of CPU involvement differs for different solutions.

The new reference you provided is for UDP and not TCP and involves bypassing the kernel whereas the example in the article explicitly involved the kernel. Furthermore the reference was not round trip but just one way.

By: mp

mp — Fri, 09 Aug 2019 15:21:02 +0000

They transport using 10 Gbit/s is just a bit slower: "10Gb Ethernet (10GbE) solution with a low 3 µs deterministic latency", from: https://www.chelsio.com/nic/wire-direct/
Asics are used for all network-cards. Anyway, 25 uS is not a theoretical minimum as chelsio already has lower values.
The 25 uS is just something you can achieve with a standard kernel.
The 3 uS if from user-space to user-space (so it involves the CPU).

By: Per Minborg

Per Minborg — Thu, 08 Aug 2019 16:52:14 +0000

Hi and thanks for your comment.

In the article I talked about nodes at 100 meters distance interconnected via a 10 GBit/s connection and where the CPU is involved in the round trip delay. Your reference is something completely different: it is about a T5 connection at 40 Gbit/s and where there is specialized ASIC involved. All though interesting, it is not relevant to the preconditions presented in the article.

For a 10 Gb/s connection, the time it takes to send a database package of about 1,000 bytes (with approximately 10,000 bits) is about one microsecond one way. There is an additional latency of 300 ns per 100 meter distance. With thread affinity and busy waiting is is possible to come down to about 25 us including context switches, packet handling, content decoding, memory latencies etc.

So, given the prerequisite, the claim stands. However, if you can produce a reference to someone who can do a faster roundtrip delay for database connections on a 10 GBit/s connections, I am eager to learn about it!

By: mp

mp — Thu, 08 Aug 2019 15:19:53 +0000

"due to TCP/IP protocol handling, a single packet round-trip delay on a 10 GBit/s connection can hardly be optimized down to less than 25 us"
I think this is incorrect. There were stacks handling user mode TCP latency of 2 us (6 years ago):
https://www.chelsio.com/chelsio-demonstrates-lowest-udp-tcp-and-rdma-over-ethernet-latency/

This invalidates your claim: "200 ns is more than 125 times faster than the theoretical minimum latency from a remote database (100 m) whose internal processing delay is zero and where a single TCP packet can convey both the query and the response."