RubanTools

Unicode Character Lookup

Paste a character or type text - get code point, UTF-8 bytes, HTML entity, CSS escape and more for every character.

Character Lookup
Common Characters

Unicode FAQ

Unicode is a universal character encoding standard that assigns a unique code point (number) to every character across all writing systems - over 149,000 characters covering 161 scripts. Code points are written as U+XXXX (hex). UTF-8, UTF-16, and UTF-32 are encodings that map these code points to bytes.

UTF-8 uses 1–4 bytes per character; ASCII characters use 1 byte (backward compatible). It dominates the web. UTF-16 uses 2 or 4 bytes; used internally by Windows, Java, JavaScript. UTF-32 uses always 4 bytes - simple but wastes memory. For HTML/JSON/files, UTF-8 is standard.

Characters outside the Basic Multilingual Plane (U+10000 and above, like emoji) require two 16-bit code units in UTF-16 - called a surrogate pair. The high surrogate (U+D800–DBFF) and low surrogate (U+DC00–DFFF) combine to encode the full code point. JavaScript strings use UTF-16 internally.