ASCII, Unicode and UTF-8: how text becomes bytes

Published June 13, 2026

Computers only store numbers, yet your screen is full of letters, punctuation and emoji. The bridge between the two is called character encoding: a set of rules that maps each character to a number, and each number to a pattern of bytes. When two programs disagree about which rules to use, you get the classic garbled text — Ã© instead of é, or a row of question marks where an emoji should be. Understanding the three layers below makes those bugs predictable instead of mysterious.

ASCII: 128 characters, one byte each

ASCII, from the 1960s, assigns a number from 0 to 127 to the English letters, digits, basic punctuation and a handful of control codes. The letter A is 65, a is 97, the digit 0 is 48. Because 127 fits in seven bits, every ASCII character is a single, simple byte. The catch is obvious: there is no room for accented letters, Arabic, Chinese, or the rupiah sign, let alone emoji.

Unicode: a number for every character

Unicode solves the limitation by giving a unique number — called a code point — to every character in every writing system, plus symbols and emoji. There are well over a million possible code points, written like U+00E9 for é or U+1F600 for a grinning face. Crucially, Unicode only assigns the numbers; it does not say how to store them as bytes. That job belongs to an encoding, and the most important one is UTF-8.

UTF-8: the encoding that won the web

UTF-8 is a clever variable-length scheme. Any ASCII character still takes exactly one byte, so old English text and source code stay byte-for-byte identical and fully compatible. Characters outside ASCII take two, three or four bytes as needed — é is two bytes, most Chinese characters are three, and emoji are four. This backward compatibility plus universal coverage is why UTF-8 now encodes the overwhelming majority of the web.

Why text gets garbled

Almost every encoding bug is a mismatch: text was written with one encoding and read with another. If a file saved as UTF-8 is opened as the older Windows-1252, each multi-byte character is misread as two separate symbols — that is where Ã© comes from. The fix is rarely to 'repair' the text and usually to tell each program the right encoding. A few habits prevent most trouble:

Save and export as UTF-8 everywhere unless a system explicitly demands something else.
Declare the encoding in your files — <meta charset="utf-8"> for HTML, the right header for APIs.
When moving text through a URL or JSON, encode it first so reserved bytes survive the trip intact.

You can see these layers directly with our tools: the text-to-binary converter shows the raw bits behind each character, the URL encoder reveals how non-ASCII bytes become %-sequences, and Base64 demonstrates packing arbitrary bytes into safe printable text. Each one runs locally in your browser, so you can experiment with sensitive text without it ever leaving your device.

← All articles