String Encodings? Unicode? ASCII? UTF-8? ãéîöû?

4 years of my career as a Software Engineer and I was able to put off the topic of character encoding and somehow sleep peacefully at night everyday until last night. A piece of python code that I had written broke and the error was UnicodeEncodeError. My reaction was similar to the title of this post. The last night's sleeplessness was not because I had a broken code but because why I had put off something which is needed every time I worked with strings.

ASCII: The time when strings used to be simple!

Just like aliens were only sighted in America, computers too were sighted only in America (or at least only in English speaking countries).

To deal with strings, only English characters were standardised as ASCII and they were mapped to each number between 32-127. The first 32 numbers were mapped to control characters(number 7 made your computer beep). Since computers worked with 8 bit word size then, characters could be represented easily with 1 byte and there will still be 128 numbers left.

But these were just English characters. They did not contain any accented characters like ã or é. It was difficult to show French or Spanish strings let alone a language with different script.

Software developers started using the extra 128 bits(or the numbers after 127) to represent accented characters or a character of different language. IBM-PC had something called OEM characters which were ASCII till 127 but had accented European characters and various line drawing characters for numbers after 127. While this was smart, it caused chaos with the emergence of the Internet. 130 could mean two things on two different systems. As it happened when Americans sent their résumés to an Israeli computer it would spell rגsumגs.

ANSI standard was created for various interpreation of these 8 bits. While 0-127 still mapped to control characters and english characters the rest 128 numbers' meaning was interpreted using code pages. Code pages were various systems which represented meaning of these numbers in their own way. Some examples are here.

This effort of standardisation per region/language of the remaining 128 numbers was still not enough. Some languages had more than 1000 characters and they would no way fit in 8 bits. There were complex ways to handle it but it wasn't a standard and not a solution for a world which was going to be on the Internet now.

Unicode: An effort to capture all the characters of all the languages in the world.

Unicode was new concept to work with strings. The character is a real word/world entity which a language has and is the building block for words.

Unicode has code points to build strings. A code point of A or A(in verdana font) is same but A and a is different. Each real world character is mapped to a Unicode code point.
A code point is represnted by a hexadecimal number in the following way:
A=U+0041
B=U+0042
This way there is no limit to the number of code points that we can have.

Encodings

The earliest idea to represnt code points in systems was through 2 bytes(16 bits) or encoding a code piont to 2 bytes. So Hello would be represented as:
00 48 00 65 00 6C 00 6C 00 6F

Many did not adopt Unicode right away. Porgrammers who dealt only with english text were well off with there ASCIIs and needed only half the space compared to 2 bytes per character in unicode.

The encoding of the Unicode also had to be handled properly between big-endian or little-endian sytems as Hello could be also be represented as
48 00 65 00 6C 00 6C 00 6F 00
A solution to this was by having every encoding start with FE FF. If any encoding had FF FE st the start, it meant that every other byte had to be swapped while reading.

UTF-8: Best of both the worlds?

Unicode did manage to capture all the characters out there in the world but was not convenient for all. ASCII was prevalent and space efficient for people who worked only in English language.

Unicodes encoded code points in to two bytes where as UTF-8 was an encoding that instead of using exactly two bytes used 1 byte, 2 bytes, 3 bytes or more while encoding a code point. This helped in encoding the most frequently used code points(English characters as per ASCII) in 1 byte (from 0 to 127) and others in more than 1 byte.

So anybody who wanted to have a string in a non-English language could have still gone with a standard encoding of UTF-8 at the price of extra space. While anybody stuck in the past or doesn't care about internationalizing their apps or can't afford wasting 1 extra byte per character can use utf-8 as if it was just ASCII for english characters.

The chart below shows the rise of UTF-8 since then:

Utf8webgrowth.svg
By Chris55 - Own work, CC BY-SA 4.0, Link

New way of thinking about strings

So now that we know about encodings and why we need them. It is really important to start thinking about strings in the following manner:

  • There is no such thing as plain string.
  • A string is always identified by its encoding. It is either ASCII encoded, utf-8, utf-16, etc.
  • A character does not exist in computer world.
  • A code point is the building block for strings.
  • A character is represented by an encoding of a code point.
  • A code point can be encoded into 1,2,3,4,5 or 6 bytes.

String encodings in computer languages

Python

Things were pretty confusing in python2 as there were multiple default encoding depending on the type of data. check this stackoverflow answer for details.

Since Python 3.0, the language’s str type contains Unicode characters, meaning any string created using "unicode rocks!", 'unicode rocks!', or the triple-quoted string syntax is stored as Unicode. The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.

Since encodings are bytes, python provides you decode method on bytes to create a string from encoded bytes by specifying its encoding name in decode method.

Similarly there is an encode method to convert a unicode string to bytes in python.

Golang

String in golang is a slice of bytes(encodings). If one tries to iterate over a string, it will happen byte by byte and remember a code point can be encoded in to more than one byte.

But Golang does not represent strings directly in code points of unicode but with something called a "rune". Rune is an alias to int32 to differentiate it from a number. Rune is equivalent to a code point.

So inorder to be able to iterate character by character in a string in golang, one should iterate rune by rune. This can be done by using for-range or utf8.DecodeRuneInString

Source code in Go is defined to be UTF-8 text; no other representation is allowed.

There is a detailed official blog for this.


References

  1. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  2. Unicode HOWTO
  3. Strings, bytes, runes and characters in Go

Comments

Popular posts from this blog

SelfAwarePotato

Converting google/guava ListenableFuture to Java 8 CompletableFuture

A Generic method to convert ResultSetFuture (of Datastax Java driver for Cassandra) to a list of table-row-model(POJO) for any table query