When we are coding we may often see some encoding specifications in our source codes such as UTF-8,GB2312. Do you know what these encoding mean and why we need them? In this post, Julián Solórzano will introduce the most widely used encoding specification around the world accomodating all different character sets in the world.
UTF-8 is a method for encoding Unicode characters using 8-bit sequences. Unicode is a standard for representing a great variety of characters from many languages.
Something like 40 years ago, the standard for information encoding ASCII was created. ASCII consisted originally of 128 characters, including lowercase and uppercase letters, numbers and punctuation, each one encoded using 7 bits.
Then came "extended ASCII" which used all 8 bits to accomodate for more characters like á, é, ü and so on. A lot of different code pages are used to account for those extra 128 character slots, like latin1, windows-1252, etc (i.e there is no unique correspondence chart for those extra 128 characters, it depends on region, language, operating system, etc).
It became apparent that neither 128 (7 bit) or 256 (8 bit) slots were enough to represent a very big number of characters consistently, so Unicode was created as a standard to represent characters from nearly all writing systems. It currently consists of more than 1,000,000 code points (they have the prefix "U+"). UTF-8 is a method for encoding these code points. A character in UTF-8 can be made up of one or more bytes. The encoding of the first 128 code points is equivalent to their ASCII counterpart. Further code points are represented using more than one byte. Each further byte in a single character starts with a special bit sequence to signal that it's still the same character.
Table from Wikipedia:
For example, the letter á is Unicode code point U+00E1, or 225 in decimal.
225 in binary is 11100001
As 8 bits are needed to represent this number, we have to use 2 bytes to encode it in UTF-8 (because only the first 128 characters use only one byte, i.e. those that only need 7 bits). So, using the first table as reference, we can encode the letter á in UTF-8 like this:
11000011 10100001 (or C3 A1 in hexadecimal, as bytes are more commonly written)
The bold part is the number 225 and the non-bold part is the bit pattern required by encoding.
This way, if you open a text file that contains the bytes c3 a1 and the program interprets the encoding as UTF-8, you will see an á. Otherwise if the program thinks the encoding is latin1 or something like that, you will instead see whatever c3 and a1 mean in that code page, i.e. you will see two characters such as á.