Unicode and UTF-8 Explained: Why Text Sometimes Turns Into Garbled Symbols

Almost everyone has seen garbled text somewhere — an apostrophe that renders as "â€™", or a name with an accent that turns into a string of question marks. This nearly always comes down to a mismatch between how text was encoded and how it's being decoded, and understanding the basic mechanism makes the problem much easier to diagnose.

Unicode: a universal numbering system for characters

Unicode assigns every character in every supported writing system a unique number, called a code point — the letter "A" is U+0041, the emoji 😀 is U+1F600, and the Devanagari letter "अ" is U+0905. Unicode itself doesn't say how those numbers should be stored as bytes on a disk or sent over a network — that's a separate job, handled by an encoding.

UTF-8: the encoding that actually stores the bytes

UTF-8 is the encoding that translates Unicode code points into the bytes computers actually read, write, and transmit. Its key design choice is variable length: common characters (basic English letters, numbers, punctuation) take just 1 byte, while less common characters — accented letters, symbols, emoji, and non-Latin scripts — take 2 to 4 bytes. This is why UTF-8 became the dominant encoding on the web: it's fully backward-compatible with older, English-only text encoding while still supporting every writing system in existence.

Why garbled text happens

Garbled text ("mojibake") happens when bytes encoded in UTF-8 are decoded using the wrong encoding — commonly an older single-byte encoding like Windows-1252, which interprets each byte as one character regardless of how many bytes the original character actually used. A single UTF-8 character that used 2 bytes gets misread as two separate, unrelated characters, producing exactly the kind of scrambled symbol strings people are used to seeing when a webpage's encoding is misconfigured.

Where HTML entities fit in

HTML entities (like & for "&", ' for an apostrophe, or é for "é") are a separate, older mechanism for representing special characters safely inside HTML, independent of file encoding. They're useful specifically because certain characters — like < and & — have special meaning in HTML syntax, so encoding them as entities avoids them being misinterpreted as markup. Escape sequences serve a similar purpose in JavaScript and JSON, representing special or non-printable characters (like a newline, \n) in a way that survives being embedded in code or transmitted as plain text.

Try it yourself

Our Unicode Converter converts text to and from Unicode code points, JavaScript escape sequences, and HTML entities, which is useful for debugging encoding issues or safely embedding special characters in HTML or JSON.

This guide is for general understanding of text encoding concepts.

Frequently asked questions

Why do apostrophes and quotation marks sometimes turn into weird symbols?

This is a classic sign of a UTF-8 file being decoded with the wrong encoding (often Windows-1252) — "smart quotes" and apostrophes use multi-byte UTF-8 characters that get split apart and misread under the wrong encoding.

Is UTF-8 the same as Unicode?

No — Unicode is the numbering system assigning a code point to every character; UTF-8 is one specific way (the most common one) of encoding those code points as bytes.

When do I need to use HTML entities instead of typing a character directly?

Mainly for characters with special meaning in HTML syntax, like <, >, and & — typing them directly can be misread as markup, so encoding them as entities keeps the intended character displaying correctly.