3n+1 iteration: FPGA vs CPU

An algorithm for computational verification of the 3n+1 conjecture is implemented on an FPGA and on a fast 64-bit CPU. The performance of the FPGA is shown to be of the same order of magnitude as the performance of the CPU.

Introduction

The 3n+1 sequence is obtained by starting with a given positive integer n and repeatedly applying the following iteration:

   T(n) = n / 2   if n is even
T(n) = ( 3n + 1 ) / 2   if n is odd (*1)

The 3n+1 conjecture states that, for any starting number n, this sequence will eventually reach the number 1. No proof for the conjecture is known (as of 2006). The conjecture has been computationally verified for many starting numbers, and no counter example has been found.

On this page, I examine implementation aspects of 3n+1 verification algorithms. I try to answer two questions:

  1. Can an FPGA outperform a normal CPU on this problem?
  2. Is a 64-bit CPU significantly faster than a 32-bit CPU on this problem?

(*1) A slightly different formulation of the 3n+1 iteration is sometimes given, with odd n mapping to 3n+1 without the division by two. It is easy to see that both formulations have the same convergence properties.

Stopping time and maximum excursion

A straightforward method to verify convergence for a given starting number is to iterate through the sequence until the number 1 is reached. However, a significant shortcut is possible by assuming convergence of all numbers smaller than the current starting number. This assumption is justified, since a verification project typically tests the full range of starting numbers up to a certain limit.

The stopping time of n, abbreviated as σ(n), is the minimum number of iterations needed to reach a value that is smaller than n itself. In other words, the stopping time is the number of iterations needed to demonstrate convergence.
A σ-record occurs when σ(n) is strictly larger than any σ(m) for m<n.

The maximum excursion of a starting number n is the largest number that occurs anywhere in its sequence.
An excursion record occurs when the maximum excursion of n is strictly larger than the maximum excursion of any smaller starting number.

3n+1 iteration in an FPGA

An FPGA is a reprogrammable chip. For this project, I used a Xilinx XC3S200 clocked at 50 MHz. The FPGA sits on a development board and talks to my PC through an RS232 interface.

I designed a component that iterates through a 3n+1 sequence at a rate of one step per clock cycle. A 128-bit register is used to store intermediate values; this is sufficient for the numbers tested in this project. The component operates on a range of starting values, testing them one at a time. Each test consists of iterating through a sequence until convergence occurs (the value drops below the starting number).

The FPGA has enough resources to fit 4 parallel instances of the iterator component, together with some control logic and an RS232 driver. Since the iterators work on different parts of the test range in parallel, the speed of the combined device is 4 times the speed of a single iterator.

The device computes the σ-record and excursion record for a given block of starting numbers, and sends the records back to the PC. Results are logged and analyzed on the PC. If a block record is also a global record, the block is recomputed in software to discover the exact n where the record occurs.

The table below summarizes test results for two variants of the design. The first, simple design uses one optimization: Only starting numbers that have remainder 3 modulo 4 are tested. This is justified because even numbers converge immediately and numbers that have remainder 1 modulo 4 converge in two steps.

The enhanced design uses two additional optimizations (not invented by me but copied from other authors).
Starting numbers that have remainder 2 modulo 3 are not tested, since any number of the form 3k+2 also occurs as the first iterate of the smaller starting number 2k+1.
Furthermore, for fixed w and q, σ(2w+q) ≤ w implies that σ(2wk+q) ≤ w for any k ≥ 1. (To understand this, observe that the w low-order bits of a number completely determine the next w steps in the sequence.) This optimization is applied with w=24 in the enhanced design: The numbers in the test range are visited in interlaced order, such that all numbers with the same 24 low-order bits are grouped together. If the first number of a group converges within 24 steps, the rest of the group is skipped.

Results from FPGA experiments
Design variant Block size Block time Speed (*2)
simple design 230 13.4 s 8.0 × 107
enhanced design 234 42.0 s 40.9 × 107

(*2) Here, the speed is defined as the total rate of expansion of the tested range in numbers per second. Numbers that are skipped as a result of optimizations, are still counted as part of the tested range. The speed is measured for the whole device (it is the sum of the contributions of individual iterators).

3n+1 iteration on a 64-bit CPU

Processors of the x86-64 family support a new 64-bit instruction set in addition to the classic 32-bit instructions. It seems likely that 3n+1 sequences over large numbers can be computed significantly faster in terms of 64-bit operations than in terms of 32-bit operations. To quantify the difference, I ran some benchmarks.

The benchmark consists of computing the σ-record and excursion record for a range of starting numbers. The basic 3-mod-4 optimization is used, but more advanced optimizations were not attempted.

Two variants of the benchmark were considered: The first variant uses 64-bit variables for intermediate values. It can only be used on a limited range of starting numbers, as larger starting values may cause overflow. The second variant uses 128-bit variables for intermediate values. This variant is tested on larger numbers to ensure that the extra storage is actually used in the benchmark.

Both variants were implemented in several different ways, in C and in assembly language, using 32-bit i386 instructions and using 64-bit x86-64 instructions. Benchmarks were compiled with GCC 3.4.4 and tested on an AMD Athlon64 3400+ 2.4 GHz. Results are summarized in the two tables below.
Source code is available: collatz_cpu.c (or syntax highlighted).

Iteration using a 64-bit working value; test range: 3 ... 232
CPU mode Code Speed (*2) Improvement
i386 pure C 6.55×107
i386 baseline assembler 6.77×107 3% faster than C
i386 optimized assembler 8.61×107 27% faster than baseline
x86_64 optimized assembler 13.2×107 54% faster than i386
x86_64 pure C 9.92×107 52% faster than i386 C

On the 64-bit-value benchmark, I started with a straightforward i386 assembler implementation. An optimized version was obtained by rearranging the inner loop and eliminating a data dependent branch (the choice between odd and even). The x86-64 assembler code is based on the same principles but independently optimized for best performance. The result demonstrates that the best x86-64 code is significantly faster than the best i386 code.

Iteration using a 128-bit working value; test range: 106×228 ... (106+16)×228
CPU mode Code Speed (*2) Improvement / comment
i386 pure C Not supported. Compiler does not provide a 128-bit integer type.
i386 baseline assembler 5.18×107
i386 optimized assembler 6.11×107 18% faster than baseline
x86_64 optimized assembler 9.92×107 62% faster than i386
x86_64 pure C 4.58×107 Compiler uses slow XMM registers for 128-bit operations.

On the 128-bit-value benchmark, the i386 assembler implementation is under high register pressure. Branch elimination is therefore not feasible, but rearranging the inner loop still yields a significant optimization. Again, the x86-64 code is clearly much faster than the i386 code. This improvement is only partly a result of using 64-bit operations; it is also due to the extra general purpose registers available on x86-64, solving the problem of register pressure.

Conclusions

The speed of the simple FPGA design (8.0×107) can be compared to the speed of the optimized x86_64 implementation (13.2×107). Both implementations use the same set of algorithmic optimizations. My conclusion is that the FPGA can not keep up with the CPU, but still performs in the same order of magnitude.

Of more interest is the comparison of the speed of the best achieved FPGA design (40.9×107) to the speed of the best known software implementation. It seems that Oliveira currently has the fastest program; I estimate its speed at around 2×109, about 5 times the speed of my best FPGA design. Perhaps some of the optimizations used in that software could be carried over to the FPGA design. But the current FPGA design is losing the race from the best known software.

The comparison between FPGA and CPU is inherently imprecise. After all, the FPGA performance can be trivially improved by upgrading to a larger or faster chip, as the CPU performance can be improved by upgrading to a faster model. The results presented here therefore do not provide a definitive conclusion on which method is faster. They do indicate, however, that neither method has a huge performance advantage over the other.

The x86-64 instruction set was shown to be much faster than the i386 instruction set when tested on an Athlon64 processor. In general, benchmarks such as these lead to varying results on processors of a different brand or model. Still, I think it is reasonable to conclude that native 64-bit operations are important for performance on the 3n+1 iteration.

References

Gary T. Leavens and Mike Vermeulen (1992) report records and other statistics for all starting numbers up to 5.6×1013.
Eric Roosendaal verified convergence of all numbers up to 259 and lists the corresponding excursion records.
Thomás Oliveira e Silva verified the convergence of all numbers up to 262 and computed σ-records and excursion records.

I would like to thank Eric Roosendaal for his detailed explanation of the sieve-based optimization.
I would like to thank Sidney Cadot for teaching me the basics of FPGA programming.