20/08/2020 | Jump to [TL;DR takeaway] or [Why?]
A project I was working on required that the client send ~300,000 ints to a server. This rpc was invoked 25 times with different values which meant the network cost quickly became a bottleneck for a resource-constrained client. It was infeasible to expect the client to incur a
300K*25*4bytes = 27.5MB cost.
The key property that I ended up exploiting was that we weren't concerned about the ordering. We just wanted this set of ints to go through. At a high level, if we sort and delta encode the ints, the resultant diff elements will be smaller than the original elements. Our bane, having to send ~300K ints, became a boon as the more number of elements you have to pack within 232, the smaller the difference between each element; this allows more compact packing.
Concretely, assuming a uniform distribution in the original list, each diff element will be
232/300K = 14316 on average, which fits in 14 bits. Of course, we can't use 14 bits as a hard upper bound to fit our elements, and will probably need to build a packer which can exploit the expected lower bit cost per element, but still encode elements which require more bits than average.
Prelim results showed that we probably don't need to handwrite this tricky piece of code ourselves. Simple
gRPC with gzip enabled gave us the benefits we were looking for.
Running a quick and dirty experiment allows us to ascertain that this result isn't fickle. Since there is no clear way to approximate a realistic workload for all systems, I tested with 2 common scenarios: The list of 300K ints is either sampled from a uniform distribution or a normal distribution. Across these 2 scenarios, I evaluated the upload cost with and without applying delta encoding.
TL;DR: If you want to send a set (or list of small valued) integers (over
gRPC), you can easily save bandwidth by delta encoding your dataset with negligible compute overhead.
Pretty late. My suggestion: if you're sending more than 10K ints, the benefits become quickly apparent.
While running trials for the previous experiment, the system config remained constant. Specifically, the client's requests were always gzip'ed in flight. This improvement could be due to either
gRPC or gzip and it'd be better to reduce the number of variables in our experiment. Since it's easy to toggle compression usage, I tested the data cost with and without compression.
The gap between an uncompressed raw stream and an uncompressed delta encoded stream makes it clear that while gzip is doing useful work, it's not the one giving us the bulk of the savings. What we're searching for now is what magic
gRPC, or rather
A quick search yields the answer:
Protobuf encodes ints as varints, which serializes them in a variable number of bytes. It uses LEB128 which has an overhead of 1 bit per byte. Smaller valued ints will take fewer bytes, so a delta-encoded set would occupy less space — this is exactly what we were looking for.
We do stand on the shoulders of giants.
Aside: On the flip side, that means it'd take 5 bytes to serialize an int which is greater than 2(32-4) = 228. It'd be good to keep in mind to use
fixed32 instead of
int32 if your dataset contains predominantly large valued ints (no, I won't run experiments for that).