Complete Guide to Text Encoding

March 2025

Text encoding converts characters into formats that systems can store, transmit, or process safely. Base64 makes binary data safe for JSON and URLs. URL encoding handles special characters in query strings. HTML entities prevent markup from breaking. UTF-8 has become the dominant web encoding: W3Techs reports that over 98% of websites use UTF-8 as their character encoding. This guide explains when to use each scheme and how to work with them using our encoding tools.

Encoding issues surface in subtle ways: mojibake (garbled characters), broken links, failed API requests, or security vulnerabilities. A form that accepts "Café" but stores it as "Café" has an encoding mismatch. A URL with an unencoded ampersand can truncate a query string. Understanding encoding prevents these bugs. For a shorter overview, see text encoding explained.

What is text encoding?

Encoding transforms data from one representation to another. Unlike encryption, encoding does not require a secret key. Anyone who knows the scheme can reverse it. The goal is compatibility: making data fit constraints of storage, transmission, or syntax.

A simple example: the character "A" has a numeric code (65 in ASCII). When you send "A" over a network, it travels as a byte. Different systems might interpret that byte differently depending on their character set. UTF-8, defined in RFC 3629, solves this by using a variable-length scheme that can represent every character in the Unicode standard while remaining backward-compatible with ASCII for the first 128 code points.

The distinction between encoding and encryption matters. Base64 is encoding. AES is encryption. Base64 data can be decoded by anyone. Encrypted data requires a key. If you need to hide information, use encryption, not encoding. Our Base64 encode/decode tool handles the encoding side.

Base64 encoding

Base64 encodes binary data as ASCII text using 64 printable characters: A-Z, a-z, 0-9, plus + and /. The scheme is defined in RFC 4648. It expands data by roughly 33%: every 3 bytes become 4 characters. That tradeoff exists because many protocols (email, JSON, XML) were designed for text, not raw bytes.

The algorithm works in 24-bit chunks. Three 8-bit bytes (24 bits) are split into four 6-bit groups. Each 6-bit value maps to one of the 64 characters. Padding is added when the input length is not divisible by 3: one = for two bytes, two = for one byte. Decoders strip padding before reversing the mapping. Our Base64 glossary entry has a quick reference.

When would you use it? Embedding images in HTML or CSS as data URIs. Storing file contents in JSON. Sending binary attachments through APIs that expect text. Authentication tokens (like JWT) often use Base64 for the payload, though the payload itself may be encrypted or signed separately.

Data URIs start with data:image/png;base64, followed by the Base64 string. Browsers decode and display the image inline. This avoids an extra HTTP request but increases HTML size and prevents caching. Use for small icons; avoid for large images. The Base64 encode/decode tool can produce these strings.

Base64 output is safe for URLs in most cases, but the standard alphabet includes + and /, which have special meaning in URLs. A URL-safe variant replaces + with - and / with _. Some implementations also strip padding (the = characters at the end). When working with URLs, verify which variant your system expects.

One limitation: Base64 is not compression. It increases size. If you have large binary data, consider compressing first, then encoding. The resulting string will often be smaller than encoding the raw data. Gzip before Base64 is a common pattern for APIs that need to send compressed payloads as text.

URL encoding (percent-encoding)

URL encoding, also called percent-encoding, replaces unsafe characters with % followed by two hexadecimal digits. A space becomes %20. An ampersand becomes %26. The RFC 3986 specification defines which characters are reserved and which must be encoded. The URL encoding glossary summarizes the scheme.

Unreserved characters (A-Z, a-z, 0-9, -, ., _, ~) never need encoding. Everything else might. Percent-encoding uses uppercase hex digits by convention (e.g., %2F not %2f), though most decoders accept both. Consistency helps when debugging and when URLs are logged or compared.

Reserved characters vary by context. In a path segment, / is reserved. In a query string, &, =, and + have special meaning. When you build URLs programmatically, you must encode each component correctly. JavaScript has encodeURIComponent() for query values and encodeURI() for full URLs; they treat characters differently. encodeURI() leaves : / ? # [ ] @ ! $ ' ( ) * + , ; = unencoded. encodeURIComponent() encodes almost everything except - _ . ! ~ * ' ( ). Use encodeURIComponent() for query parameter values. Python offers urllib.parse.quote() and quote_plus(). Getting this wrong leads to broken links or security issues.

Spaces deserve attention. Historically, + represented a space in query strings (application/x-www-form-urlencoded). In URL paths, %20 is standard. Some systems accept both; others do not. When in doubt, use %20. Our URL encode/decode tool handles both encoding and decoding so you can test behavior.

Internationalized domain names (IDNs) and non-ASCII characters in paths add complexity. Punycode converts Unicode domain labels to ASCII (xn--). For path segments, percent-encode UTF-8 bytes. A character like é (U+00E9) becomes %C3%A9 in UTF-8. Browsers often display the decoded form in the address bar while storing the encoded form internally.

Double encoding happens when already-encoded strings get encoded again. %20 becomes %2520. Decoders may or may not handle this correctly. Avoid double encoding by encoding only once, at the point where you construct the URL.

HTML entities

HTML entities let you represent characters that would otherwise break markup or be invisible. The less-than sign < would start a tag. The ampersand & would start an entity. To display them literally, you use &lt; and &amp;. Browsers decode these when rendering.

Named entities exist for common characters: &nbsp; for non-breaking space, &copy; for the copyright symbol, &ndash; for an en dash. Numeric entities work for any Unicode character: &#8212; for an em dash, &#x263A; for a smiley (hex). The numeric form is more portable because it does not depend on HTML entity names, which can vary.

When displaying user-generated content in HTML, you must escape <, >, &, and quotes. Otherwise, attackers can inject scripts (XSS). Server-side frameworks usually provide an escape function. For one-off conversions, our HTML entities tool encodes and decodes both ways.

JSON has different rules. In JSON strings, you escape quotes and backslashes with backslash, but angle brackets and ampersands are fine. Do not confuse HTML escaping with JSON escaping. Each context has its own requirements.

Character encodings: UTF-8, ASCII, and Unicode

ASCII uses 7 bits (128 code points) and covers English letters, digits, and basic symbols. It cannot represent accents, CJK characters, or emoji. Extended ASCII variants use the 8th bit for more characters, but they conflict with each other. Code page 1252 (Windows Western European) is different from ISO-8859-1 (Latin-1). Sending the wrong one produces mojibake. A file saved as CP1252 and opened as Latin-1 will show wrong characters for bytes 0x80–0x9F.

Unicode assigns a code point to every character across all scripts. As of Unicode 15, there are over 149,000 characters. UTF-8 encodes Unicode as variable-length bytes. ASCII characters stay as single bytes (0-127). Other characters use 2, 3, or 4 bytes. UTF-8 is backward-compatible with ASCII and avoids null bytes in the middle of strings, which helps with C-style string handling.

The UTF-8 BOM (byte order mark) is the sequence EF BB BF at the start of a file. It signals UTF-8 and helps some programs detect the encoding. Many Unix tools dislike BOMs; Windows Notepad adds them. For web delivery, omitting the BOM is usually preferable. BOMs can break shebangs in scripts (e.g., #!/bin/bash) if the file starts with a BOM.

UTF-16 uses 2 or 4 bytes per character. Java and JavaScript historically used UTF-16 internally. UTF-16 has big-endian (UTF-16BE) and little-endian (UTF-16LE) variants; the BOM (FF FE or FE FF) indicates which. UTF-32 uses 4 bytes for every character; it is simple but wasteful. For the web, UTF-8 dominates. HTML5 specifies UTF-8 as the default. APIs and databases default to UTF-8. Use our character encoding converter to convert between UTF-8, UTF-16, and other encodings.

Declare encoding in your HTML: <meta charset="UTF-8"> in the head. Place it early; parsing can change if the charset is discovered late. For HTTP, send Content-Type: text/html; charset=UTF-8. The HTTP header overrides the meta tag when both exist. Without a declaration, browsers guess, and guessing fails on mixed content. Chrome uses a statistical guesser; results vary.

Binary and hexadecimal representation

Binary represents data as 0s and 1s. Each character in ASCII has an 8-bit (or 7-bit) binary form. "A" is 01000001. Binary is verbose but explicit. Useful for teaching, debugging low-level protocols, or understanding bitwise operations. When you need to see exactly how data is stored at the bit level, binary view is the clearest.

Hexadecimal (hex) uses 0-9 and A-F. Each hex digit represents 4 bits, so two hex digits represent one byte. "A" in ASCII is 0x41. Hex is more compact than binary and easier to read. Dump output, MAC addresses, and color codes (e.g., #FF5733) use hex. CSS hex colors omit the leading 0x: #FFFFFF for white. Some APIs expect hex without a prefix; others use 0x. Check the documentation.

Our text to binary converter and hexadecimal text converter handle these conversions. They are helpful when you need to inspect raw byte representation or prepare data for systems that expect hex.

Morse code and ROT13 are different beasts. Morse code maps letters to dots and dashes; it is an encoding in the broad sense but not a binary or character encoding. ROT13 shifts letters by 13 positions. Neither provides security. Both are reversible by anyone. Use for obfuscation (e.g., hiding spoilers) or education, not for confidential data.

Note: binary and hex represent the numeric values of bytes. They do not change the underlying data. The same byte sequence can be viewed as binary, hex, or (if it forms valid text) as characters in a given encoding.

When to use which encoding

SituationUse
Binary data in JSON or XMLBase64
Query string parameters, URL path segmentsURL encoding
User content in HTMLHTML entities
General text storage and transmissionUTF-8
Debugging, protocol analysisBinary or hex

For a concise overview, see our shorter text encoding explained article. For hands-on work, browse all encoding and conversion tools or the encoding complete guide.

Encoding in APIs and databases

REST APIs typically expect UTF-8 in request bodies. Set Content-Type: application/json; charset=UTF-8. For form data, use application/x-www-form-urlencoded or multipart/form-data; both support UTF-8. When constructing query strings programmatically, encode each parameter value with encodeURIComponent() (JavaScript) or the equivalent in your language. Never concatenate user input into URLs without encoding.

Databases have their own encoding. MySQL's utf8mb4 supports full Unicode including emoji; the older utf8 is limited to 3-byte characters. PostgreSQL defaults to UTF-8. When migrating data, ensure the source and target encodings match. The character encoding converter can help diagnose and convert between charsets.

Common pitfalls

Encoding the wrong thing: encode the whole URL vs. only the query value. Encode only the dynamic parts. Encoding a full URL can turn :// into %3A%2F%2F and break the scheme. The base URL structure (scheme, host, path delimiters) should stay literal; only parameter values and path segments that contain special characters need encoding.

Assuming UTF-8: legacy systems, file uploads, and some APIs may use Latin-1, Windows-1252, or another encoding. Check Content-Type headers, BOMs, and documentation. When reading files, specify the encoding explicitly. Python's open() has an encoding parameter; omitting it uses the system default, which differs across platforms.

Double encoding: encoding twice produces %2520 instead of %20. Decoding twice may fix it, but better to encode once. Similarly, HTML-escaping already-escaped content can turn & into &amp;amp;. Be aware of your pipeline: if a framework auto-escapes, do not pre-escape.

Mixing encodings: a page declared as ISO-8859-1 that includes UTF-8 content from an API will display mojibake. Standardize on UTF-8 everywhere. Databases, file systems, and external services may need explicit configuration to use UTF-8.

Truncation in the middle of a multi-byte character: UTF-8 uses 2–4 bytes for non-ASCII. Cutting a string at an arbitrary byte offset can split a character. Always truncate on character boundaries, or use a library that handles grapheme clusters if you care about emoji and combining marks.

Confusing encoding with hashing: Base64 is reversible; hashes (SHA-256, bcrypt) are not. Do not use Base64 to "encrypt" passwords. Use proper password hashing. For file integrity, use checksums. The checksum generator produces hashes; that is a different operation from encoding.

Tools and workflow

The encoding and conversion tools section includes dedicated utilities for each format. The Base64 encode/decode tool handles both directions with optional URL-safe output. The URL encode/decode tool processes query strings and full URLs. For HTML, use HTML entities encode/decode. The character encoding converter switches between UTF-8, UTF-16, ASCII, and other charsets. Text to binary and hexadecimal text converter provide low-level views. All processing runs in the browser; no data is sent to servers.

Frequently asked questions about text encoding

Quick answers to common encoding questions.

Back to Learn · Browse encoding tools · Glossary: Base64 · Glossary: URL encoding

People Also Used