Sorting binary records

My friend Sidney is working on a project where he needs to sort a big array of small binary records. Something like 10¹¹ records of 10 bytes each, adding up to 1 TB of data. He told me he encodes these binary records as ASCII strings and sorts them with the GNU sort utility.

I was surprised. "Surely a dedicated sorting routine will be much faster than that ancient Unix command", I told him. "Sort is made for variable-length text lines. That must be a lot slower than a dedicated tool which uses direct binary comparison."

I bluffed that I could make a sorting tool at least 8 times faster than sort. Sidney was not convinced because

GNU Sort has a sophisticated strategy for cases where the input is larger than available memory. There may not be much room for improvement.
Since the array is larger than memory, it will have to be read from disk several times during the sort process. The overall performance is probably limited by disk I/O, not by CPU operations. A little overhead for line handling will not make much difference.

These are very good points, unfortunately. So I had to put my code where my mouth was.

External sorting

Sorting an array in random access memory is a well-known task for which efficient algorithms are readily available. However when the array does not fit in memory, we are dealing with a slightly different challenge.

Disk I/O is many times slower than RAM access. Furthermore, disk transfer rates depend a lot on the access pattern. Sequential data transfers can be reasonably fast, while accessing small amounts of data from various locations on the disk is horribly slow.

We thus need a sorting algorithm that structures its data flow such that data are accessed from disk as few times as possible, and always in big sequential blocks. A simple but efficient method is the sort-then-merge strategy:

Split the input into multiple blocks that fit in memory. Sort each block separately and write the sorted blocks to disk.
Merge a group of sorted blocks to form a larger sorted block.
Keep merging blocks until only one block remains. This is the final sorted array.

This same strategy is also used by GNU Sort. It writes each sorted block to a separate temporary file. Files are then merged to form bigger files. It uses 16-way merging by default (each step merges 16 temporary files to form one new file), but this can be configured.

To estimate performance, it is useful to know how many times the data are accessed on disk. The sort phase reads and writes the complete array once. The merge phase reads and writes the complete array at least once, but possibly more than once.

If the input file size is F and the memory block size is M, the sort phase produces F / M sorted blocks. If K-way merging is used, these blocks can be merged in a single pass only if F / M ≤ K. They can be merged in two passes if F / M ≤ K², and so on.

The total number of read+write passes over the data is therefore equal to:

\lceil 1+log_{K}(F/M)\rceil

SortBin: a sort tool for binary data

I designed SortBin to play around with external sorting algorithms.
For the source code, see https://git.jorisvr.nl/joris/sortbin

The basic approach of SortBin is the sort-then-merge strategy as explained above. It uses 16-way merging by default. In addition, I applied the following design decisions:

Focus on sorting small binary records and getting good performance. Avoid additional functionality and avoid customizable options.
Use big sort blocks. Make the blocks as big as possible within the memory constraints specified by the user. Big blocks means fewer blocks, which means faster merging. (This is probably not as important as I initially thought it would be. See below under conclusions.)
Use separate threads for I/O. Even if overall performance is limited by disk I/O, it is still important to keep the I/O system constantly busy. I don't want the disk transfers to stall for brief periods of time while the CPU is processing data. By running I/O in separate threads, I can hopefully saturate the I/O system with a continuous sequence of read/write requests.
Optionally use multiple threads during the sorting phase. This did not occur to me until I noticed that GNU sort uses multiple threads by default. This can definitely speed things up, since the sorting phase requires a lot of work from the CPU.
Use a single temporary file instead of separate files for each block. The temporary file has the same size as the input, which makes it easier for users to estimate required disk space. And I think filesystems may handle a single huge file more efficiently than a directory containing hundreds of files, but I'm not very sure about that.
Pre-allocate the required size for the output file and the temporary file. This is useful because it avoids out-of-space errors halfway through a long sort job. And I hope it results in big sequential extents in the physical layout of the file on disk.

I implemented my own in-memory sorting routine. My initial plan was to use the C++ library function std::sort(), but I found that it can not be used for arrays where the element size is unknown at compile-time. I then considered using qsort(), but the GNU implementation has a quadratic worst-case, which I find distasteful even if it is unlikely to occur in practice. I ended up essentially reimplementing the algorithm used by std::sort(). It uses quicksort but falls back to heapsort when the recursion depth exceeds a certain threshold.

I initially used posix_fallocate() to pre-allocate files on disk. This turned out to be bad. If the underlying filesystem does not support pre-allocation, posix_fallocate() silently falls back to very slowly writing zero-filled blocks to the file. When I tested SortBin on a networked filesystem, the posix_fallocate() call would have taken many hours to complete. The solution is to never use posix_fallocate(). Instead, directly call the Linux-specific function fallocate(), which returns an error code when pre-allocation is not possible.

Performance tests

I tested the performance of SortBin against GNU Sort. This comparison is not really fair to GNU Sort, which is a highly customizable tool designed for sorting lines of text. I forced it to sort small records and compared it to a program that was optimized for that job. Whatever the outcome of this test, it should not be seen as a criticism of GNU Sort.

The test case is sorting a 100 GB file while using only 1 GB memory.

I prepared a test file containing 10¹⁰ lines. Each line consists of 9 random alphanumeric characters and a newline character. We thus have 10¹⁰ records of 10 bytes, for a total of 100 GB.

My test system is a mid-range desktop PC:

CPU: AMD Ryzen 5 3600, 6 cores
Disk: Seagate IronWolf, 4 TB, 5980 rpm, SATA 3.1
OS: Debian GNU/Linux
Sequential disk transfer rate: 180 MB/s

My results should be taken with a pinch of salt. I don't know how specific these results are for my test system, or even how repeatable these results are on the same system. I tried to avoid other system activity during these tests, but I did not take specific precautions to ensure a reproducible system state.

The first run is with GNU Sort. Although I want to use just 1 GB memory, it seemed more fair to allow 2 GB for GNU Sort because it appears to spend a significant amount of memory to manage lines of text. By default, GNU Sort uses 6 threads for parallel sorting. It does not support background I/O threads.

The next runs are with SortBin in various configurations. The base run is already faster than GNU Sort, but multi-threaded sorting and threaded I/O provide small additional improvements. The best result is 6.48 / 1.22 = 5.3 times faster than GNU Sort.

It is nice that my specialized niche-tool is faster than GNU Sort, but this is not the factor 8 that I was aiming for. Clearly I either underestimated GNU Sort, or overestimated my own ability to make a fast sorting tool.

 Table 1: 100 GB, 10 bytes/record, desktop PC

 Software  MemSize  SortThreads  IO thrd  MergeFactor  Duration  CPU time
 --------  -------  -----------  -------  -----------  --------  --------
 GNU sort    2 GB       6           -          16       6.48 hr   9.90 hr
 sortbin     1 GB       1          no          16       1.78 hr   1.03 hr
 sortbin     1 GB       4          no          16       1.36 hr   1.04 hr
 sortbin     1 GB       4          yes         16       1.22 hr   1.03 hr

In the last run above, SortBin creates 187 temporary blocks during the sort phase. It uses 16-way merging, so 2 merge passes are needed. Let's try to push this to 1 merge pass by using 187-way merging:

 Table 2: 100 GB, 10 bytes/record, desktop PC, different merge factor

 Software  MemSize  SortThreads  IO thrd  MergeFactor  Duration  CPU time
 --------  -------  -----------  -------  -----------  --------  --------
 sortbin     1 GB       4          yes         16       1.22 hr   1.03 hr
 sortbin     1 GB       4          yes        187       0.82 hr   1.01 hr

Reducing the number of merge passes makes the sort process almost 1.5 times faster! This shows that optimizing the merge tree is at least as important as low-level multi-threading optimizations. Unfortunately this does not count as an additional improvement over GNU Sort, because GNU Sort also has an option to adjust the merge factor (which I did not test).

How far can we crank up the merge factor? I tried sorting an input file of 858 GB, which involves merging 1600 blocks. This can be done either in two merge passes with merge factor 40, or in one merge pass with merge factor 1600. Although the single-pass run needs fewer disk transfers, it took more time to finish. So clearly there is a threshold where it becomes advantageous to sort in two passes.

 Table 3: 858 GB, 10 bytes/record, desktop PC

 Software  MemSize  SortThreads  IO thrd  MergeFactor  Duration  CPU time
 --------  -------  -----------  -------  -----------  --------  --------
 sortbin     1 GB       4          yes         40      13.19 hr   9.87 hr
 sortbin     1 GB       4          yes       1600      17.43 hr  10.28 hr

Let's review the assumption that the overall sort process might be I/O limited. The best 100 GB run with 2 merge passes ran in 1.22 hours. During this time, it reads and writes (nearly) all data 3 times. This comes down to an average data transfer rate of 137 MB/s. The run with 1 merge pass ran in 0.82 hours, which amounts to 135 MB/s. The best 858 GB run comes out at 108 MB/s.

This is slower than the 180 MB/s sequential transfer rate of my disk, but not very much slower. My guess is the process is still mostly I/O limited, but for some reason my code does not manage to reach the maximum possible I/O rate.

The purpose of SortBin is sorting small binary records. But I also checked how it runs on somewhat larger records: 2×10¹⁰ records of 50 bytes = 100 GB. SortBin still needs approximately the same amount of time as before. However GNU Sort runs much faster on these large records compared to the runs with small records. This suggests that GNU Sort may have a per-record overhead which weighs heavily on workloads with small records.

 Table 4: 100 GB, 50 bytes/record, desktop PC

 Software  MemSize  SortThreads  IO thrd  MergeFactor  Duration  CPU time
 --------  -------  -----------  -------  -----------  --------  --------
 GNU sort    2 GB       6           -          16       1.78 hr   1.93 hr
 sortbin     1 GB       4          yes         16       1.22 hr   0.38 hr

Finally, I ran a test on a high performance computer with a lot of memory. It accesses storage via a fast network file system.

 Table 5: 100 GB, 10 bytes/record, high performance computer

 Software  MemSize  SortThreads  IO thrd  MergeFactor  Duration  CPU time
 --------  -------  -----------  -------  -----------  --------  --------
 GNU sort   16 GB       4           -          16       6.16 hr  11.30 hr
 sortbin    16 GB       4          yes         16       0.46 hr   1.03 hr

Conclusions

My specialized binary sorting tool is faster than GNU Sort, but not as much faster as I thought it would be.
GNU Sort is a highly versatile tool, capable of reasonable performance even on workloads for which it was not designed.
Disk I/O is a significant limiting factor, but not the only factor. In my tests, the average disk I/O rate reached 75% of the maximum possible transfer rate. I needed multi-threaded sorting just to get to that point. This may turn out differently on other systems, depending on the relative speed of disk vs CPU and other factors.
The merge factor is an important performance tuning parameter. When a larger merge factor reduces the number of merge passes, it typically improves performance. High merge factors can still perform well: 187-way merging still runs close to the disk I/O limit. However performance declines when the merge factor is push to high: 1600-way merging is much slower than 40-way merging, even when it saves a merge pass.
Memory size is also important, but perhaps not as important as it seems. Merely doubling the memory size is not likely to bring much improvement, unless it happens to cross the threshold where it reduces the number of merge passes.
posix_fallocate() should be avoided because it has erratic behaviour when the underlying filesystem reports an error.