Encoding
It’s always been sort of a mystery for me how character is stored. I’ve always known it’s gonna be just 0s and 1s but still, I guess I want to really understand it somehow.
This is a very good read on this subject.
This post is going to summarize my understanding.
ASCII
- ASCII is an
encoding- more on that later, but basically, it deals with how computer stores a character. - A character set is all the characters that an encoding can represent.
ASCII, which uses 8 bits per character, has a character set of 2^8 = 256 characters. ASCIIis not the only encoding out there and it is not nearly enough to encode all characters of all the languages in the world.- Initially,
ASCIIreally only needed 128 bits to represent what it was intended to (English alphabets, numbers and several other controlling signals). Being encoded with 8 bits,ASCIIis wasting whole extra 128 possible mappings. - While
ASCIIis more than enough for English, imagine squeezing a language who has more than 128 characters. Indeed, a lot of language has such huge alphabets anddifferent people came up with their own use for the extra 128 bits. - Eventually, there are languages which contains more than 256 characters it self.
This is when Unicode comes into play.
Unicode
- Unicode is a mapping of a number with a character (a code point - a
Unicodeish concept). The total number of code points is greater than 1 million. Unicodedoesn’t have an upper limit of how many letters it can represent.- Unicode is not an encoding but a standard, mapping of a character to a code point.
Unicodesolves these problems:
- If everyone agrees on the mapping of character-code point, everybody understands that
U+0048 U+0065 U+006C U+006C U+006FisHello, no matter where they come from.- You just invent a new language? No problem, let’s just add it to the
Unicodestandard since it has no upper limit.
But how do Unicode characters stored in the computer? That’s handled by Encoding.
Encoding
- Computers don’t understand or store characters. They only work with bits - 0 or 1.
- When we store a character, we are actually storing a sequence of 0s and 1s, according to the encoding that is used to encode the character.
- An encoding is simply a mapping of sequences of bits to characters.
- Unicode is usually encoded using
UTF-8encoding, which is a variable-width character encoding and has different bit length for different characters see more here.