Base64 is the kind of thing nobody thinks about until it’s the bottleneck. It’s just there — wrapping every JWT, every email attachment, every embedded image, every TLS certificate — quietly converting bytes to printable characters and back. A library call. Two lines.

Then one day you profile a service that decodes a few hundred megabytes of base64 a second and the flamegraph is seventy percent b64_decode. You stare at it for a while. Then you go figure out how to make it fast.

This is a walk through that journey: from a textbook scalar decoder doing about 400 MB/s, to a SIMD version that pushes 4 GB/s on a single core. Most of it is just learning to look at the problem sideways — base64 has more structure than it looks like, and SIMD is happy to exploit it.

Heads up: the playground, animations and benchmark in this post all run in your browser. Try them.

A quick refresher

Base64 takes 3 bytes of input (24 bits) and emits 4 ASCII characters, each carrying 6 of those bits. The alphabet is A–Z, a–z, 0–9, +, / — exactly 64 symbols, hence the name.

Type something below and watch the bits flow through.

ascii bits
6-bit groups
encoded

Notice how the colored boundaries don’t line up. The 8-bit byte boundaries (top row) and the 6-bit character boundaries (middle row) are deliberately misaligned. That’s the whole point: 8 and 6 share a common multiple of 24, so every 3 input bytes pack neatly into 4 output characters.

The scalar decoder

Going the other way — text back to bytes — is just inversion. For each input character, look up its 6-bit value, then mash four of them together into three bytes:

static const uint8_t lookup[256] = {
    /* 256-entry table: alphabet[i] -> i, everything else -> 0xFF */
    ['A' ... 'Z'] = 0,         /* placeholders; filled at init */
    ['a' ... 'z'] = 26,
    ['0' ... '9'] = 52,
    ['+'] = 62, ['/'] = 63,
};

size_t b64_decode_scalar(const char *src, size_t n, uint8_t *dst) {
    uint8_t *out = dst;
    for (size_t i = 0; i + 4 <= n; i += 4) {
        uint32_t a = lookup[(uint8_t)src[i + 0]];
        uint32_t b = lookup[(uint8_t)src[i + 1]];
        uint32_t c = lookup[(uint8_t)src[i + 2]];
        uint32_t d = lookup[(uint8_t)src[i + 3]];

        uint32_t v = (a << 18) | (b << 12) | (c << 6) | d;
        *out++ = (v >> 16) & 0xFF;
        *out++ = (v >>  8) & 0xFF;
        *out++ =  v        & 0xFF;
    }
    return out - dst;
}

Clean. Correct. Roughly 400 MB/s on a modern x86 core with cached input. Now: where does the time actually go?

Where the cycles go

Three things hurt this loop:

  1. The lookup table. 256 bytes of cold-ish memory, four reads per iteration. Each access stalls for a few cycles when the cache line isn’t already there, and you can’t easily prefetch because the addresses depend on the input bytes.
  2. Branch mispredictions. A real decoder validates input. Every “is this a legal base64 character?” branch the CPU guesses wrong is ~15 wasted cycles.
  3. Throughput, not latency. Even with perfect cache behaviour, a CPU running at 4 GHz that processes 3 output bytes per iteration cannot exceed 12 GB/s, and the actual instruction count puts the ceiling well below that. We need to do more bytes per cycle, not the same bytes faster.

The fix for all three: stop processing one character at a time. Process sixteen.

SIMD: 16 bytes per instruction

A 128-bit SSE register holds 16 bytes. AVX2 doubles that to 32. AVX-512 goes to 64. The CPU has instructions that operate on every byte of the register simultaneously: a single paddb adds 16 pairs of bytes in one cycle. A pshufb does 16 parallel byte-lookups in one cycle. A pcmpgtb does 16 parallel range checks.

The animation below shows the same input being processed by a scalar loop (one byte at a time) versus the SIMD version (sixteen at once). Watch the cursors.

idle

By the time the scalar loop has crawled to the end, the SIMD version has finished sixteen full passes. The real-world speedup is less dramatic because SIMD instructions are slightly slower per instruction than their scalar counterparts — but the per-byte cost still drops by an order of magnitude.

To make this work we need two SIMD-friendly operations:

  1. Convert 16 ASCII bytes into their 16 corresponding 6-bit values, all at once.
  2. Pack those 16 six-bit values into 12 output bytes, all at once.

Both sound innocent. Both have a clever trick.

ASCII to 6-bit, without a table

The naive way to do step 1 is a 256-entry lookup table. But there’s no vectorized “look up 16 different addresses in memory in one cycle” instruction — pshufb only works on tables that fit in a register, which means 16 entries max.

The escape hatch: base64’s alphabet almost maps to value-minus-offset. If we split the input into 5 buckets, each bucket has a fixed offset:

range ASCII value offset
A–Z 65–90 0–25 −65
a–z 97–122 26–51 −71
0–9 48–57 52–61 +4
+ 43 62 +19
/ 47 63 +16

So we don’t need a value table — we need an offset table indexed by a compact category. And we can derive the category from the high nibble of the ASCII byte. Four bits of category, 16 entries, fits in a pshufb register.

// Five offsets, packed into a 16-byte vector indexed by category nibble.
const __m128i offset_lut = _mm_setr_epi8(
    /* 0..7 unused */ 0, 0, 0, 0, 0, 0, 0, 0,
    /* '+' '/' digits */ 19, 16, 4, 4,
    /* 'A'-'Z' */        -65, -65,
    /* 'a'-'z' */        -71, -71,
    0, 0
);

__m128i decode_chunk(__m128i in) {
    // Category: high nibble + a couple of fixups for the '+/' edge cases.
    __m128i hi  = _mm_srli_epi32(in, 4) & _mm_set1_epi8(0x0F);
    __m128i off = _mm_shuffle_epi8(offset_lut, hi);
    return _mm_add_epi8(in, off);   // 16 ASCII -> 16 six-bit values
}

Two instructions plus a shift. No memory loads. No branches. Sixteen characters converted in parallel.

Packing the bits

After step 1 we have 16 bytes in a register, each holding a 6-bit value in its low six bits — and two wasted bits in every byte. We need to compress that down: four 6-bit values per output triplet, three triplets per group of 12 input values, totalling 12 output bytes from 16 input bytes.

Step through what happens to a single quartet:

step 1 / 5

In SIMD-speak, this is two pmaddubsw and one pmaddwd for a chunk of 16 bytes — three multiply-and-add instructions that recombine the bits with the right shifts. Then one pshufb to drop the wasted bytes and pack the result tightly. Four instructions for 12 output bytes.

Putting it together

The full hot loop, in pseudo-intrinsics:

size_t b64_decode_simd(const char *src, size_t n, uint8_t *dst) {
    uint8_t *out = dst;
    while (n >= 16) {
        __m128i in   = _mm_loadu_si128((const __m128i *)src);
        __m128i vals = decode_chunk(in);       // 16 ASCII -> 16 six-bit values

        // multiply-add to merge adjacent 6-bit pairs into 12-bit words
        const __m128i merge_ab = _mm_set1_epi32(0x01400140);
        __m128i merged_ab = _mm_maddubs_epi16(vals, merge_ab);

        // merge again to land 24-bit triplets in 32-bit lanes
        const __m128i merge_cd = _mm_set1_epi32(0x00011000);
        __m128i merged_cd = _mm_madd_epi16(merged_ab, merge_cd);

        // pack: drop the high byte of each 32-bit lane, leave 12 bytes
        const __m128i shuf = _mm_setr_epi8(
            2, 1, 0,  6, 5, 4,  10, 9, 8,  14, 13, 12,
            -1, -1, -1, -1
        );
        __m128i packed = _mm_shuffle_epi8(merged_cd, shuf);

        _mm_storeu_si128((__m128i *)out, packed);

        src += 16;
        out += 12;
        n   -= 16;
    }
    // tail: fall back to the scalar decoder for the last <16 chars
    out += b64_decode_scalar(src, n, out);
    return out - dst;
}

Five SIMD instructions plus a load and a store, processing twelve output bytes per iteration. On a Zen 4 or recent Intel core that retires several SIMD instructions per cycle, this lands around 4 GB/s in the steady state — within shouting distance of memcpy. Which is wild, considering the loop is doing a non-trivial format conversion.

How fast is it really?

The plot below runs a scalar JavaScript decoder and the browser’s native atob against a 4 MB random base64 payload, right here, in your tab. Native atob in Chrome and Firefox uses essentially the same SIMD tricks described above (Lemire’s algorithm is also what they ship). The gap you see is roughly the gap between scalar C and SIMD C.

Results vary wildly by device — a recent laptop will easily clear 3 GB/s on the native path; a mid-range phone will sit closer to 1 GB/s. The ratio between the two columns is the part that’s stable: it’s the same algorithmic story everywhere.

What’s left on the table

A few directions to push this further if you genuinely need every last byte per second:

  • AVX2 / AVX-512. The exact same algorithm at 32-byte or 64-byte width gets you 2–4× more throughput at the cost of needing modern hardware. Lemire’s published benchmarks hit ~9 GB/s on AVX-512.
  • Validation. Real decoders need to reject invalid input. The trick is to compute a “is this byte legal?” mask in the same SIMD pass and OR-reduce it into a single error flag at the end. No branches in the hot loop.
  • URL-safe alphabet. Base64-URL uses - and _ instead of + and /. Re-tabling the offset LUT handles both variants with a compile-time switch.
  • ARM NEON. The same algorithm ports across cleanly. pshufb becomes tbl, pmaddubsw is a multiply-then-pairwise-add, the rest is shifts. Apple Silicon decodes around 4 GB/s using essentially this same recipe.

The deeper takeaway: SIMD isn’t magic, and it isn’t really about doing the same operation on more data. It’s about looking for latent parallelism in problems that look sequential — finding the structure (here, the contiguous ASCII ranges) that lets you collapse a memory-bound table lookup into a register-resident shuffle.

There’s a lot of it out there. Once you start seeing it, you can’t stop.