[Tech Blog] Accelaration of Collatz conjecture

CollatzFractal

The Wikipedia entry of the Collatz conjecture describes the simple algorithm used to generate so-called hailstone sequences:

Take any natural number n. If n is even, divide it by 2 to get n / 2. If n is odd, multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process (which has been called “Half Or Triple Plus One”, or HOTPO) indefinitely. The conjecture is that no matter what number you start with, you will always eventually reach 1.

In the examples, it states:

Numbers with a total stopping time longer than any smaller starting value form a sequence beginning with:

1, 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, … (sequence A006877 in OEIS).

By looking up the A006877 sequence (“In the `3x+1′ problem, these values for the starting value set new records for number of steps to reach 1”) in OEIS, and following the link embedded in “T. D. Noe, Table of n, a(n) for n = 1..130 (from Eric Roosendaal’s data)” (under LINKS), one finds this list which — supposedly — lists the numbers with a larger stopping count than of any smaller number:

1
2
3
6
7
9
18
25
27
54
73
97
129
171
231
313
327
649
703
871
1161
2223
2463
2919
3711
6171
10971
13255
17647
23529
26623
34239
35655
52527
77031
106239
142587
156159
216367
230631
410011
511935
626331
837799
1117065
1501353
1723519
2298025
3064033
3542887
3732423
5649499
6649279
8400511
11200681
14934241
15733191
31466382
36791535
63728127
127456254
169941673
226588897
268549803
537099606
670617279
1341234558
1412987847
1674652263
2610744987
4578853915
4890328815
9780657630
12212032815
12235060455
13371194527
17828259369
31694683323
63389366646
75128138247
133561134663
158294678119
166763117679
202485402111
404970804222
426635908975
568847878633
674190078379
881715740415
989345275647
1122382791663
1444338092271
1899148184679
2081751768559
2775669024745
3700892032993
3743559068799
7487118137598
7887663552367
10516884736489
14022512981985
19536224150271
26262557464201
27667550250351
38903934249727
48575069253735
51173735510107
60650353197163
80867137596217
100759293214567
134345724286089
223656998090055
397612441048987
530149921398649
706866561864865
942488749153153
1256651665537537
1675535554050049
2234047405400065
2978729873866753
3586720916237671
4320515538764287
4861718551722727
6482291402296969
7579309213675935
12769884180266527
17026512240355369
22702016320473825
45404032640947650
46785696846401151

Here is the naive implementation of the hailstone sequence length generation algorithm:

static inline int hailstone(unsigned long n)
{
    int count = 0;
    
    while (n > 1)
    {
        if (n & 1)
            n = 3 * n + 1;
        else
            n >>= 1;

        ++count;
    }
    
    return count;
}

By mapping this function to the elements of the above list, one obtains the following list:

1: 0
2: 1
3: 7
6: 8
7: 16
9: 19
18: 20
25: 23
27: 111
54: 112
73: 115
97: 118
129: 121
171: 124
231: 127
313: 130
327: 143
649: 144
703: 170
871: 178
1161: 181
2223: 182
2463: 208
2919: 216
3711: 237
6171: 261
10971: 267
13255: 275
17647: 278
23529: 281
26623: 307
34239: 310
35655: 323
52527: 339
77031: 350
106239: 353
142587: 374
156159: 382
216367: 385
230631: 442
410011: 448
511935: 469
626331: 508
837799: 524
1117065: 527
1501353: 530
1723519: 556
2298025: 559
3064033: 562
3542887: 583
3732423: 596
5649499: 612
6649279: 664
8400511: 685
11200681: 688
14934241: 691
15733191: 704
31466382: 705
36791535: 744
63728127: 949
127456254: 950
169941673: 953
226588897: 956
268549803: 964
537099606: 965
670617279: 986
1341234558: 987
1412987847: 1000
1674652263: 1008
2610744987: 1050
4578853915: 1087
4890328815: 1131
9780657630: 1132
12212032815: 1153
12235060455: 1184
13371194527: 1210
17828259369: 1213
31694683323: 1219
63389366646: 1220
75128138247: 1228
133561134663: 1234
158294678119: 1242
166763117679: 1255
202485402111: 1307
404970804222: 1308
426635908975: 1321
568847878633: 1324
674190078379: 1332
881715740415: 1335
989345275647: 1348
1122382791663: 1356
1444338092271: 1408
1899148184679: 1411
2081751768559: 682
2775669024745: 685
3700892032993: 688
3743559068799: 794
7487118137598: 795
7887663552367: 808
10516884736489: 811
14022512981985: 814
19536224150271: 1585
26262557464201: 833
27667550250351: 846
38903934249727: 1617
48575069253735: 1638
51173735510107: 1651
60650353197163: 1659
80867137596217: 1662
100759293214567: 1820
134345724286089: 1823
223656998090055: 1847
397612441048987: 1853
530149921398649: 1856
706866561864865: 1859
942488749153153: 1862
1256651665537537: 1865
1675535554050049: 1868
2234047405400065: 1871
2978729873866753: 1874
3586720916237671: 1895
4320515538764287: 458
4861718551722727: 470
6482291402296969: 473
7579309213675935: 512
12769884180266527: 445
17026512240355369: 448
22702016320473825: 451
45404032640947650: 452
46785696846401151: 738

As it is easily visible, the first list (which is the left column in the second list) is “broken”: up to and including 1899148184679, the sequence lengths (right column in the second list) are indeed monotonically growing, but afterwards there are some numbers which have shorter sequence lengths than previous ones.

The problem becomes recalculating the sequence lengths for all odd numbers above 1899148184679 and checking if the current sequence length is greater than all previous ones.

Unless otherwise noted, tests were run on an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz with 4 GiB RAM running Ubuntu Linux 14.04

Using the naive implementation, one obtains ~1.37 M numbers checked per second (NCPS).

Using an inlined assembly language implementation of the function

static inline dword hailstone(qword n)
{
    dword retval;

    asm volatile("mov    %0,%%rax" : : "r"(n) : "rax");
    asm volatile(".intel_syntax noprefix");
    asm volatile("mov    rsi,rax" : : : "rsi");
    asm volatile("xor    r8d,r8d" : : : "r8");
    asm volatile(".align 16");
    asm volatile("top:");
    asm volatile("shr    rax,1" : : : "rax");
    asm volatile("test    rsi,1");
    asm volatile("lea    rsi,[rsi + 2 * rsi + 1]" : : : "rsi");
    asm volatile("cmovz    rsi,rax" : : : "rsi");
    asm volatile("inc    r8d" : : : "r8");
    asm volatile("mov    rax,rsi" : : : "rax");
    asm volatile("cmp    rsi,1");
    asm volatile("jnz    top");
    asm volatile(".att_syntax");
    asm volatile("mov    %%r8d,%0": "=r"(retval));

    return retval;
}

which uses conditional move instructions to avoid branch misprediction penalties improves the rate to ~2.02 M NCPS — a 48% improvement.

If we’re interested only in the sequence lengths of odd numbers, then the following function can be made to calculate them:

typedef unsigned int dword; 	/* 32 bit unsigned integer */
typedef unsigned long qword; 	/* 64 bit unsigned integer */

static inline dword bsfq(qword n)
{
    qword retval = -1;
    
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    
    return retval;
}

static inline int hailstone(unsigned long n)
{
    int count = 0;
    
    do
    {
        /* n is odd */
        dword shifter = bsfq(n = 3 * n + 1);
        
        /* accumulate the number of divisions by 2 and the initial
        tripling + 1 step */
        count += shifter + 1;
        
        /* n is even; do the required number of divisions by two */
        n >>= shifter;
        
    }
    while (n > 1);
    
    return count;
}

Using this function results in ~6.58 M NCPS — a 380% improvement over the naive implementation.

As all hailstone sequences eventually — presumably — end with 1, at least some of the numbers in the sequence are smaller than the initial number. Accordingly, it is possible to cache (precalculate and store) sequence lengths for some set of numbers (starting with 1) and once a particular sequence reaches a number which is in cache, accumulate the precalculated sequence length into the current count and terminate the calculation.

It is easily seen that numbers in hailstone sequences aren’t divisible by 3. Accordingly, one stores in the cache the sequence lengths of only those numbers that aren’t divisible by 3. This can be done by placing n’s sequence length in the cache at index n/3.

#define    CACHEFILE_SIZE    1012645888

word cache[CACHEFILE_SIZE / sizeof(word)];

void load_cache(void)
{
    FILE *fin = fopen("./hailstone-cache.bin", "rb");
    
    if (CACHEFILE_SIZE != fread(cache, sizeof(byte), CACHEFILE_SIZE, fin))
    {
        fprintf(stderr, "Read error on cache file\n");

        exit(-1);
    }
    
    fclose(fin);
}

static inline dword bsfq(qword n)
{
    qword retval = -1;
    
    asm volatile("bsfq        %1,%0" : "=r"(retval) : "r"(n));
    
    return retval;
}

static inline int hailstone(unsigned long n)
{
    int count = 0;
    
    do
    {
        dword shifter = bsfq(n = 3 * n + 1);
        
        count += shifter + 1;
        n >>= shifter;
        
        /* here n is odd and not divisible by 3 */
        if (n < sizeof(cache) / sizeof(word) * 3)
        {
            count += cache[n / 3];

            n = 1;
        }
    }
    while (n > 1);
    
    return count;
}

Using this ~1 GB sized cache (don’t ask why is the odd size) results in ~17.21 M NCPS — a 1156% improvement over the naive implementation.

The next step is to parallelize the sequence length calculations.

#define qCLKS_PER_SEC	3000000000UL	/* processor clock rate */

static inline qword rdtsc(void) {
    register qword acc128lo asm ("rax");
    register qword acc128hi asm ("rdx");
    
    asm volatile ("rdtsc" : "=r"(acc128lo), "=r"(acc128hi));
    
    return (acc128hi << 32) | acc128lo;
}

/* max_cnt has to be initialized to a nonzero value so it gets put in the DATA
section instead of the COMMON section */
volatile unsigned max_cnt = 1;
unsigned secs = 1;

unsigned word cache[<# of words in cache file>];

typedef struct
{
    qword starting_n;
    qword current_n;
    qword cnts;
    qword cached_cnts;
    dword increment;
} hailstone_t;

void *hailstone(void *parm)
{
    qword i;
    
    ((hailstone_t *)parm)->cnts = 0;
    ((hailstone_t *)parm)->cached_cnts = 0;

    for (i = ((hailstone_t *)parm)->starting_n; 1;
     i += ((hailstone_t *)parm)->increment)
    {
        qword n = i;
        dword cnt = 0;
        
        ((hailstone_t *)parm)->current_n = n;
        
        do
        {
            qword shifter, idx;

            n += 2 * n + 1;
            ++cnt;
            
            shifter = bsfq(n);
            n >>= shifter;
            cnt += shifter;

            idx = n / 3;
            if (idx < sizeof(cache) / sizeof(*cache))
            {
                cnt += cache[idx];
                ((hailstone_t *)parm)->cached_cnts += cache[idx];

                n = 1;
            }
        }
        while (n != 1);

        ((hailstone_t *)parm)->cnts += cnt;

        if (cnt > max_cnt)
        {
            max_cnt = cnt;
            printf("%lu: %u\n", i, cnt);
            fflush(stdout);
        }
    }
}

Which can be invoked by:

int main(int argc, char *argv[])
{
    FILE *fin;
    dword nthreads = atoi(argv[1]);
    dword i;
    pthread_t *pThreads;
    hailstone_t *pParms;
    qword next_sec;
    struct timespec req = { 0, 50000000 }, rem;
    
    if (!nthreads)
        nthreads = 1;
        
    if (!(fin = fopen(CACHEFILE, "rb")))
    {
        fprintf(stderr, "Couldn't open cache file '%s'\n", CACHEFILE);
        
        exit(-2);
    }

    if (1 != fread(cache, sizeof(cache), 1, fin))
    {
        fprintf(stderr, "Read error on cache file '%s'\n", CACHEFILE);
        
        exit(-2);
    }

    fclose(fin);
    
    if (!(pThreads = (pthread_t *)malloc(nthreads * sizeof(pthread_t))))
    {
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(pthread_t));
        
        exit(-3);
    }

    if (!(pParms = (hailstone_t *)malloc(nthreads * sizeof(hailstone_t))))
    {
        fprintf(stderr, "Couldn't allocate %lu bytes\n",
         nthreads * sizeof(hailstone_t));
        
        exit(-4);
    }
        
    next_sec = rdtsc() + qCLKS_PER_SEC;

    for (i = 0; i < nthreads; ++i)
    {
        int rc;

        pParms[i].starting_n = STARTING_N + i * 2;
        pParms[i].increment  = nthreads * 2;
        
        if (rc = pthread_create(pThreads + i, NULL,
                  hailstone, (void *)(pParms + i)))
        {
            fprintf(stderr,"Error - pthread_create() return code: %d\n", rc);
            exit(-100);
        }
    }
    
    while (1)
    {
        nanosleep(&req, &rem);
        
        if (rdtsc() >= next_sec)
        {
            qword min_n = pParms[0].current_n, all_cnts = pParms[0].cnts,
              all_cached_cnts = pParms[0].cached_cnts;
            dword h, m, s;
            
            for (i = 1; i < nthreads; ++i)
            {
                if (pParms[i].current_n < min_n)
                    min_n = pParms[i].current_n;
                    
                all_cnts += pParms[i].cnts;
                all_cached_cnts += pParms[i].cached_cnts;
            }

            next_sec += qCLKS_PER_SEC;
            s = secs++;
            m = s / 60;
            s %= 60;
            h = m / 60;
            m %= 60;
            
            fprintf(stderr, "\r%02u:%02u:%02u  %lu  %.0f/sec (%4.1f%%)",
           h, m, s, min_n, (min_n - STARTING_N + 1) / 2. / secs,
           100. * all_cached_cnts / all_cnts);
        }
    }

    /* Last thing that main() should do */
    pthread_exit(NULL);
}

Synchronized access to max_cnt is explicitly avoided as it presents a very significant performance hit (approx. one-third loss of performance) on parallel processing.

Using this (the final code) to run three threads results in ~29 M NCPS — an approx. 2000% improvement over the naive implementation.

On an Intel 4790K CPU with 4 physical cores and 4 virtual (HyperThreading) cores running at 4.6 GHz clock rate, using 32 GiB Crucial Ballistix DDR3 RAM overclocked to 2 GHz the code runs at 80 M NCPS. Unfortunately, a simple calculation shows that it would take 10 years at that rate to check all odd numbers up to 46785696846401151 (the last — wrong — element of the original list).

performance_result

Comments are closed.