Skip to main content
Drawer

How Unicode is handled in Elixir

Recently, I have stumbled upon an article by Nikita Tonsky about "The Absolute Minimum Every Software Developer Must Know About Unicode in 2023". It's a really cool writeup on peculiarities of UTF encoding and how various languages provide (or more often don't) proper support for it. It also shows how many languages and platforms still fail to support operating on UTF strings properly without help of third-party libraries. I was curious where Elixir stands with regards to that in author's opinion and found that bit in the article:

UPD: Erlang/Elixir seem to be doing the right thing, too.

So, apparently, despite intially critical evaluation, it seems to be all fine! And well, it is, as long as you stick to using String for string specific operations!

iex(2)> s = "πŸš΅πŸ»β€β™€οΈ"
"πŸš΅πŸ»β€β™€οΈ"
iex(4)> String.length(s)
1
iex(5)> String.slice(s, 1, 0)
""
iex(6)> String.slice(s, 0, 0)
""
iex(7)> String.slice(s, 0, 1)
"πŸš΅πŸ»β€β™€οΈ"
iex(8)> s = "πŸ‘¨β€πŸ­"
"πŸ‘¨β€πŸ­"
iex(9)> String.length(s)
1
iex(10)> String.slice(s, 0, 1)
"πŸ‘¨β€πŸ­"
iex(11)> String.slice(s, 1, 1)
""
iex(12)> String.equivalent?("Γ…", "Γ…")
true
iex(13)> "Γ…" == "Γ…"
true
iex(14)> "Γ…" === "Γ…"
true

Funnily enough multi-codepoint emojis are displaying as multiple characters in my terminal console, but that seems to be a problem with unicode support lacking in the console alone, not Elixir per se.

One important thing to note here is that it's quite important to always operate on strings on grapheme cluster level and not codepoints or bytes (unless it's an explicit intent and one knows what they are doing). That's what String.graphemes/1 is meant for.

All in all, Elixir definitely treats modern UTF as a first class citizen does Does the Right Thingβ„’.