UTF-8 encoding

Everything inside the computer is represented as a combination of 0s and 1s. But we, humans, don’t really think in 1s and 0s. Even right now, when I’m writing this blog post, my computer is doing a hard work for me by converting some 1s and 0s and displaying them to me as text. But how are computers doing that? This is a question that I never really asked myself in my early years as a software engineer, but the Performance Engineering course from CSOsvita really opened my eyes to that topic. It comes down to one truth: we cannot view any arbitrary sequence of bits; we also need to know how it’s encoded.

Let’s take a look at how UTF-8 encoding works, the most popular text encoding on the internet.

How it works

First of all, UTF-8 operates with code points. Code point is just a number, assigned to a character in Unicode. For example, the character “a” has a code point of U+0061. UTF-8 then defines how those code points are converted to/from binary representation.

UTF-8 is a variable-length encoding, that is one character can take up from 1 to 4 bytes.

Here’s how code points are converted to UTF-8

First code pointLast code pointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0xxxxxxx
U+0080U+07FF110xxxxx10xxxxxx
U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
U+01000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

One of the cool features of the UTF-8 is that ASCII characters are encoded exactly the same. And, additionally, it takes up less space than UTF-16 or UTF-32 which require 16 or 32 bits to encode each character, respectively. There’s a catch, though, UTF-16 is also a variable-length encoding, so some characters can take up to 4 bytes. But UTF-32 is always 4 bytes.

Here are a couple of example characters and how they’ll be encoded in UTF-8:

Let’s now build a visualization of the UTF-8.

Visualization

And we should actually start with an interesting fact: JavaScript (just like Java) uses UTF-16 to encode strings in runtime. This was a bit of surprise to me. I always thought that JS uses UTF-8. There are languages that use UTF-8 for string encoding, for example, Rust. So the task of visualizing UTF-8 encoding with JS actually comes down to converting UTF-16 to UTF-8. I will go down that rabbit hole in some future post, and try to implement it from scratch. But for now, let’s concentrate on UTF-8, and to do that, we’ll utilize this built-in API in JS: TextEncoder .

Input string hexadecimal representation
CharacterBytes
"h"

68

"e"

65

"l"

6c

"l"

6c

"o"

6f

" "

20

"😉"

f09f9889

Here’s a core of what’s powering this table:

const map = useMemo(() => {
  const radix = repr === "hex" ? 16 : 2;
  const pad = repr === "hex" ? 2 : 8;
  const textEncoder = new TextEncoder();
  const result: [string, string[]][] = [];
  for (const ch of input) {
    const encoded = textEncoder.encode(ch);
    const bytes = Array.from(encoded).map((b) =>
      b.toString(radix).padStart(pad, "0"),
    );
    result.push([ch, bytes]);
  }
  return result;
}, [input, repr]);

Here, I’m iterating over every character in a string, and encoding it into Uint8Array , with the help of TextEncoder, and then mapping over every byte and transforming it into hexadecimal or binary string.

I myself only recently learned about the TextEncoder API, and got to use it in practice for this blog post.

Experiments

Now, you can try inserting this family emoji into the input text box 🧑‍🧑‍🧒‍🧒, and observe the result.

Turns out, this emoji is actually composed of multiple code points:

  1. 🧑 (U+1F9D1)
  2. ‘ZERO WIDTH JOINER’ (U+200D)
  3. 🧑 (U+1F9D1)
  4. ‘ZERO WIDTH JOINER’ (U+200D)
  5. 🧒 (U+1F9D2)
  6. ‘ZERO WIDTH JOINER’ (U+200D)
  7. 🧒 (U+1F9D2)

Those “ZERO WIDTH JOINER” code points will be displayed as an empty space in the table, because this code point is only needed to combine multiple code points into one.

You can try inputting the following example as well: র‍্য.

One more interesting thing that I found, is that when you enter the “heart” emoji ❤️, it actually is a combination of two code points:

Try to find something interesting as well!

Conclusion

UTF-8 is remarkable when you think about it. It keeps things simple for languages like English, while also giving us the power to represent every character from every language, and even emojis! It’s no wonder it became the go-to encoding for the web. In my opinion, for every programmer out there, knowing how UTF-8 works is essential. It’s one of those fundamental technologies that without a doubt, runs out world.

Comments