Common Systems Programming Optimizations & Tricks

Today’s blog post is an overview of some common optimization techniques and neat
tricks for doing “systems programming” – whatever that means today. We’ll walk
through some methods to make your code run faster, be more efficient, and to
squeeze just a little more juice from whatever you got.

All of the examples we’ll go over are also on github at
paulcavallaro/systems-programming.

Cache Lines & False Sharing

False sharing is a fairly well-understood problem in optimizing multi-threaded
code on modern SMP systems. The problem has been
written
about
fairly
extensively,
but the basic idea is that physical memory on a machine isn’t infinitely
granular, i.e. you can’t just read a byte. Instead, when you want to read a
single byte of memory, the processor will pull in and cache not just that byte,
but the surrounding data as well, on the assumption that it will likely be used
too. This unit of data that gets read and cached is called a “cache line”, and
is essentially the smallest piece of memory that can be accessed.

As of 2019 cache lines are powers of 2, usually in the range of 32 to 256 bytes,
with 64 bytes being the most common.

Now, to support multiple processors on a single machine reading and writing from
the same memory in a coherent way, only one processor on a machine can have
exclusive access to a given cache line.

False sharing is when you accidentally put two unrelated pieces of data in the
same cache line. Now having two processors update separate chunks of data, say
counters, will interfere with one another, as each processor attempts to gain
exclusive access to the cache line that houses both fields.

The explanation for the name false sharing is that even though these two
counters shouldn’t be affecting one another from a correctness standpoint, they
are – big air quotes incoming – “falsely sharing” a cache line for no good reason.

One solution is to force the data onto separate cache lines, which you can do in
C/C++ by forcing the alignment of the members of a struct/class. In
examples/cache-lines.cc
we use absl’s
ABSL_CACHELINE_ALIGNED
macro to achieve this.

To show the effect in practice, we benchmark two different structs of
std::atomic counters, NormalCounters and CacheLineAwareCounters.

// NormalCounters is straight forward naive implementation of a struct of
// counters.
// Note: We also use ABSL_CACHELINE_ALIGNED on the NormalCounters struct, but
// not its members, so that the entire struct will be aligned to a cache line.
// Otherwise the struct might be placed towards the end of a cache line,
// accidentally straddling two cache lines, thereby improving its performance.
struct ABSL_CACHELINE_ALIGNED NormalCounters {
  std::atomic<int64> success{0};
  std::atomic<int64> failure{0};
  std::atomic<int64> okay{0};
  std::atomic<int64> meh{0};
};

// CacheLineAwareCounters forces each counter onto a separate cache line to
// avoid any false sharing between the counters.
// Note: We must use ABSL_CACHELINE_ALIGNED for each member, since we want to
// pad every single counter so it will be forced onto its own separate cache
// line.
struct ABSL_CACHELINE_ALIGNED CacheLineAwareCounters {
  ABSL_CACHELINE_ALIGNED std::atomic<int64> success{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> failure{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> okay{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> meh{0};
};

The benchmark uses either 1, 2, 3, or 4 threads, each bumping a separate atomic
counter inside the struct 64K times. Here are the results on a 2013 MacBook Pro
with a Haswell processor:

Executing tests from //examples:cache-lines
-----------------------------------------------------------------------------
Cache Line Size: 64
sizeof(NormalCounters) = 64
sizeof(CacheLineAwareCounters) = 256
2019-08-13 01:16:18
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BM_NormalCounters/threads:1                389 us        387 us       1812
BM_NormalCounters/threads:2               1264 us       2517 us        234
BM_NormalCounters/threads:3               1286 us       3718 us        225
BM_NormalCounters/threads:4               1073 us       3660 us        204
BM_CacheLineAwareCounters/threads:1        386 us        385 us       1799
BM_CacheLineAwareCounters/threads:2        200 us        400 us       1658
BM_CacheLineAwareCounters/threads:3        208 us        581 us       1152
BM_CacheLineAwareCounters/threads:4        193 us        721 us       1008

A note on these results: Time measures wall clock time per-thread and CPU
measures cpu time per-thread.

We can see that the sizes of the two structs are different, where
sizeof(NormalCounters) = 64 and sizeof(CacheLineAwareCounters) = 256. This
is because the alignment constraint we put on the individual fields, such that
each member is on its own cache line. So instead of an int64 taking up 8 bytes
like usual, it’s taking up a full cache line, which is 64 bytes on my machine.

We also see that for the single threaded case, the NormalCounters
vs. CacheLineAwareCounters perform indistinguishably. But as we add more
threads, the CacheLineAwareCounters perform much better than the naive normal
counters that suffer from the false sharing.

Interestingly the wall clock time spent for CacheLineAwareCounters is higher
for one thread than multiple threads, which could point to perhaps some subtle
benchmarking problem, or maybe a fixed amount of delay that’s getting attributed
across more threads now, and so is smaller per-thread.

The Magic Power of 2: Division Is Slowaloo

In current hardware, division is one of the most expensive operations, where
expensive here means “longest
latency”. Agner Fog’s listing of instruction latencies
lists Intel Skylake’s DIV instruction operating on two 64 bit registers having
a latency of 35-88 cycles, compared to an ADD instruction operating on the
same two 64 bit registers having a latency of 1 cycle. So it would behoove us to
avoid division where some other set of operations would work.

One place where division is commonly used besides for actually doing division,
is in the modulo % operator. And a common place for the modulo operator is in
hash tables: to go from a hash to a bucket we compute HASH %
TABLE_SIZE
. Modulo is used even more heavily in
open-addressing schemes since
we’re constantly remapping values back into the hash table bucket space.

So how do we go from using modulo to something else? Bit twiddling and the Magic
Power of 2!

First, let me give away the answer: We’re going to force all of our hash tables
to be a size that’s a power of 2.

We’ll be able to take advantage of this property for replacing division with
faster bit twiddling. Also, this property is easy to maintain since we double
the size of our hash table whenever we grow to amoritize the cost of rehashing,
so the size will stay a power of 2 as we grow.

Now, we were using division – or modulo – for the purpose of mapping any hash
value to a bucket index in our hash table. Bucket indices must be strictly less
than our size, and this mapping should not lose any entropy from the hash value.

Instead of using division to do this, we’ll use a bitmask that will “mask” away
all set bits except those that are strictly less than our power of 2. This way
we keep all the entropy in the least significant bits, just like modulo, but
it’s much faster – Agner Fog lists it at 1 cycle latency on the same Intel
Skylake architecture.

As a quick refresher on bit twiddling and to explain how we’ll choose our
bitmask, let’s look at some bit patterns.

Since numbers are represented in binary, we know that every power of 2 number
N only has one bit set. For example:

Decimal |    Binary
    2   | 00 00 00 10
    8   | 00 00 10 00
   32   | 00 10 00 00
  128   | 10 00 00 00

And that means all N - 1’s have every less significant bit than log2(N)
set. For example:

Decimal |    Binary
    1   | 00 00 00 01
    7   | 00 00 01 11
   31   | 00 01 11 11
  127   | 01 11 11 11

So in order to replace our modulo operator in the HASH % N calculation, we
instead calculate HASH & (N - 1) using bitwise AND. This will only keep the
set bits that are less significant than our log_2(N) bit, mapping any HASH
value onto a number in the interval
[0, N). We can even cache this bitmask if we want so we don’t have to
recompute it in the future.

To show how much faster this bitmask trick is than using normal modulo operator,
I’ve written a
small benchmark
to measure executing 1M modulo operations, versus executing 1M bitmasks.

Executing tests from //examples:power-of-two
-----------------------------------------------------------------------------
2019-08-13 02:24:03
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_Mod                 9261 us       9234 us         75
BM_BitMask              325 us        324 us       1984

Here we can see how the DIV instruction used by the modulo operator is on the
order of 28x slower, near the 35x slow down predicted by Agner Fog’s latency
numbers.

Since this trick is easy to do, and provides a nice win, it’s used by many
high-performance hash-tables, like absl’s “Swiss Tables”
flat_hash_set and flat_hash_map and ConcurrencyKit’s ck_ht_map.

Repurposing Top Bits

Oftentimes you want to store a bit or two of extra information along with a
pointer – in fact it’s so common
Wikipedia has an article about it. One
way to do this is to take advantage that on many 64-bit systems, such as linux,
virtual memory addresses are only 48 bits wide, even though we use 8 bytes to
store them.

This means, we can just store any old crap we want to in the top 16 bits by just
masking it out whenever we want to actually dereference it. Here’s some
example C++ code
of using the top bit of a pointer to store whether the underlying data is
“dirty” or not.

constexpr uintptr_t kDirtyBit = 0x8000000000000000;
constexpr uintptr_t kPtrMask = 0x7fffffffffffffff;
static_assert(sizeof(uintptr_t) == 8, "Only works on 64-bit systems");

template <typename T>
uintptr_t MarkDirty(T* t) {
  return reinterpret_cast<uintptr_t>(t) | kDirtyBit;
}

template <typename T>
T* GetPointer(uintptr_t t) {
  return reinterpret_cast<T*>(t & kPtrMask);
}

Interestingly though, since this is a feature of Linux’s memory
management/virtual address space, it is subject to change – and it actually
has!

LWN reported on the patchset in 2017
implementing five-level page tables in order to support larger amounts of
addressable memory. The change, if enabled, would move Linux from 48-bit virtual
addresses to 57-bit virtual addresses, increasing virtual address space from 256
TiB to 128 PiB – which ought to be enough for everyone 😀.

Part of why the change couldn’t be enabled by default is because various high
performance programs, notably various JavaScript engines and
LuaJIT, use this repurposing trick to pack some extra data
into pointers.

Lock Striping

Locks can be used for mutual exclusion when you want to have multiple threads
access shared data exclusively. The downside though is if the shared data is
frequently accessed and the critical section is non-trivial, your threads could
spend most of their time contending on the lock, instead of actually doing work.

One common way of solving this problem, is introducing more locks. Wait – what?

Well, instead of one lock that guards all of the data, instead you have many
locks responsible for just a piece of the data. In this way we shard the data
into separate independent buckets that don’t contend with each other. Assuming
the data access is uniform-ish, increasing sharding of the data reduces the
number of threads contending on a lock proportionally.

Below is a
small C++ example
of two implementations of thread-safe hash set. The first implementation,
ThreadSafeHashSet, just uses a single lock to guard a single underlying
absl::flat_hash_set. The second implementation LockStripedHashSet has N
separate locks, guarding N separate abs::flat_hash_sets.

// Simple thread-safe single-lock hash set
class ThreadSafeHashSet {
 public:
  void Insert(uint64 i) {
    absl::MutexLock lock(&mu_);
    hash_set_.insert(i);
  }

  bool Contains(uint64 i) {
    absl::MutexLock lock(&mu_);
    return hash_set_.contains(i);
  }

 private:
  absl::Mutex mu_;
  absl::flat_hash_set<uint64> hash_set_;
};

// Chunk the data into `kNumChunks` separate hash sets, guarded by separate
// locks.
template <size_t kNumChunks>
class LockStripedHashSet {
 public:
  void Insert(uint64 i) {
    // Mod the data to calculate which hash_set/lock to use
    const size_t idx = i % kNumChunks;
    absl::MutexLock lock(&mu_[idx]);
    hash_set_[idx].insert(i);
  }

  bool Contains(uint64 i) {
    const size_t idx = i % kNumChunks;
    absl::MutexLock lock(&mu_[idx]);
    return hash_set_[idx].contains(i);
  }

 private:
  std::array<absl::Mutex, kNumChunks> mu_;
  std::array<absl::flat_hash_set<uint64>, kNumChunks> hash_set_;
};

To illustrate the benefits of lock striping, we benchmark the performance of the
two thread-safe hash sets in the presence of many threads, each inserting 1M
items. For the LockStripedHashSet we try with 4 chunks and with 8 chunks. The
results are below:

Executing tests from //examples:lock-striping
-----------------------------------------------------------------------------
2019-08-24 22:24:37
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
--------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations
--------------------------------------------------------------------------
BM_SingleLock/threads:1                    65 ms         65 ms         11
BM_SingleLock/threads:2                   140 ms        254 ms          2
BM_SingleLock/threads:3                   140 ms        332 ms          3
BM_SingleLock/threads:4                   142 ms        405 ms          4
BM_LockStriping_4_Chunks/threads:1         71 ms         69 ms          9
BM_LockStriping_4_Chunks/threads:2         90 ms        178 ms          4
BM_LockStriping_4_Chunks/threads:3         89 ms        248 ms          3
BM_LockStriping_4_Chunks/threads:4         82 ms        299 ms          4
BM_LockStriping_8_Chunks/threads:1         70 ms         69 ms         10
BM_LockStriping_8_Chunks/threads:2         74 ms        143 ms          4
BM_LockStriping_8_Chunks/threads:3         71 ms        198 ms          3
BM_LockStriping_8_Chunks/threads:4         60 ms        200 ms          4

Again, Time measures wall clock time per-thread and CPU measures cpu time
per-thread. Also note that since my machine only has 4 logical cores, this test
only goes up to 4 threads, since anything beyond that doesn’t actually introduce
any extra contention.

We can see that in the single-threaded case the LockStripedHashSet, no matter
the chunking, performs slightly worse on both wall clock and CPU time than the
simple ThreadSafeHashSet.

However, as more threads are added, increasing the contention on the locks, the
LockStripedHashSet performs much better, and the 8 chunk version outperforms
the 4 chunk version at higher thread counts.

While lock-striping can help alleviate contention on locks, it does have the
downside of increasing storage overhead for the locks. In our example the
overhead of 7 extra locks and extra absl::flat_hash_set bookkeeping is
miniscule for a single instance in our benchmark, but if you replaced all the
hash sets in an application with an 8-way striped thread-safe hash set, you
could bloat its memory footprint by a significant amount.

Conclusion

While this is nowhere near an exhaustive list of the most common systems
programming tricks, hopefully it whets your appetite to learn more, provides
some tools for improving performance in your own applications, or at least make
it easier to understand why performance sensitive code is doing what it’s doing.

Thanks to Mason Simon and Ray Yang for reviewing drafts of this post.



https://paulcavallaro.com/blog/common-systems-programming-optimizations-tricks/