Tech
April 26, 2022
We released a blazingly fast Aho-Corasick implementation, written in Haskell, in 2019. This implementation was based on UTF-16 strings, since Haskell's text library uses that for its internal string representation. However, the most recent major update of text changed its internal string representation from UTF-16 to UTF-8. This is good news for us, since most of our customer’s data is ASCII, this update will cut our memory consumption in half. The big problem though is that our highly optimized string search library alfred-margaret assumes that its input is encoded as UTF-16 and uses that assumption to cut a few corners to improve performance. In this post we will illustrate the challenges we encountered implementing UTF-8 support in alfred-margaret and also give some insights into how we optimized our Haskell code for maximal performance.
Before we get started, we will have to specify some of the terms used in this post. As programmers, we usually use the term character differently than our users. Unicode provides us with terms that allow us to be more specific: User-perceived characters may be represented as multiple abstract characters. A visual example for this is the user-perceived character ✌🏿, consisting of the following two abstract characters:
U+270C VICTORY HAND
✌U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6
🏿Similarly, the user-perceived character “x̣” consists of these abstract characters:
U+0078 LATIN SMALL LETTER X
xU+0323 COMBINING DOT BELOW
◌̣1When writing programs, we usually work with abstract characters and leave the user-perceived characters to the user’s GUI/font rendering system (of course there are counterexamples, e.g. the character count of a tweet). Abstract characters include simple letters, but also combining marks such as U+0323 COMBINING DOT BELOW
above and control characters such as U+000A LINE FEED (LF)*
. A code point is a value in the Unicode code space which ranges from 0 to 0x10FFFF. Unicode 14.0.0 (from September 2021) maps 144,697 of those 1,114,112 code points to abstract characters. We will use the notation U+XXXX
for code points, where XXXX
is a hexadecimal number. In order to avoid confusion, we will use the term code point in this post to refer to abstract characters. For further information, you can check out the (surprisingly readable) Unicode standard itself or the website/manifest UTF-8 Everywhere, which has a section discussing the different concepts one might associate with the term "character".
Other terms defined in Unicode that we will use are:
U+00DF LATIN SMALL LETTER SHARP S
: toCasefold(ß) = ss ≠ ß = toLowercase(ß)
U+AB7B CHEROKEE LETTER GU
: toCaseFold(Ꭻ) = Ꭻ ≠ ꭻ = toLowercase(Ꭻ)
In Channable’s Haskell codebase, we use the text package for string operations. This package provides a Text
type, which uses the UTF-16 encoding form to store Unicode strings in memory. For our use case, Text
is much more efficient than Haskell’s String, which is a linked list of boxed 32-bit integers.
In its recent update to version 2.0, the Text
type’s encoding was changed to UTF-8. This change has multiple advantages for us:
Text
values. This is a lot of overhead for what should essentially be a no-op.The second point is more important: It means that compared to UTF-16, where every code point takes at least two bytes to encode, we save a lot of memory! Consider this example:
The diagrams show the string "Beeblebröx" encoded using two different encoding forms: UTF-16 and UTF-8. The lowest row of both diagrams shows the code units used by the encoding in form. In UTF-16, a code unit is 16 bits; in UTF-8, it's 8 bits. For UTF-16, we have to keep in mind that endianness is important since code units span two bytes. Text
uses UTF-16 Little-Endian, meaning that the least significant byte goes first. The second row in the UTF-16 diagram shows the actual bytes of the encoded string.
This small example already shows an important point: Code points in the block Basic Latin (from U+0000
to U+007F
) only take up a single byte of memory in UTF-8 and two bytes in UTF-16. In total, "Beeblebröx" uses 20 bytes in UTF-16 and only 11 bytes in UTF-8.
U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
(the letter ö) is part of the second most used block Latin-1 Supplement (from U+0080
to U+00FF
). UTF-8 encodes this block as two bytes meaning we don’t gain or lose anything here compared to UTF-16. Less than 0.1% of the benchmark data set needs more bytes using UTF-8 than using UTF-16. Compared to the memory saved for Basic Latin code points, their contribution is tiny. In total, our benchmark data set weighs 3.08 GiB using UTF-16 and 1.55 GiB using UTF-8 (this excludes the 16 bytes used to store offset and length of the byte slice in each case). This means we save almost exactly 50% of memory for these strings by simply using another encoding.
When working with Text
values, we usually just import the Data.Text
module. This module acts as an interface which hides implementation details such as the encoding form and provides a view onto the string as a sequence of Char
, which is Haskell's built-in Unicode code point type. The upside of this abstraction is that it makes it easy for us to update to text-2.0
even though the underlying encoding changes from UTF-16 to UTF-8.
However, using this interface exclusively also has a downside: If the encoding used for these Char
s is variable-width, meaning that not all Char
s encode to the same number of bytes, random access into a Text
has a linear complexity. This example shows that both UTF-16 and UTF-8 are variable-width encodings:
In UTF-16, all valid code points in the range U+0000
to U+FFFF
are encoded as a single code unit consisting of two bytes. Code points outside that range such as the last one, U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6
, need two code units and therefore take up four bytes. In UTF-8, a code point can be encoded as one, two, three or four single-byte code units, e.g. U+006F LATIN SMALL LETTER O
, U+00D7 MULTIPLICATION SIGN
, U+270C VICTORY HAND
and U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6
, respectively.
Regardless of whether we use UTF-8 or UTF-16, in order to find the nth code point in a string, we need to do a linear search from its start. For example, in order to find out that the 8th code point in the UTF-8 string begins at byte 10, we need to iterate through "hello×" first and count the number of bytes for each code point.
In a previous article, we discuss our fast Haskell implementation of the Aho-Corasick string search algorithm. If you’ve already read that article, you can skip this section.
In order to conceptualize string search, we use the "Needle in a Haystack" idiom. We are searching for all occurrences of one or more strings (the needles) in another string (the haystack). The use case where Aho-Corasick shines is searching different haystacks for the same set of multiple needles of which the matches possibly overlap. In order to do this, we construct a finite state automaton where each state represents the progress we’ve made to a full match of one of the needles. Feeding this automaton the input string code point by code point is faster than simply searching for each needle separately because we only need to scrutinize each code point of the haystack once.
Consider this example matching the needles "shirts", "shorts" and "tshirt":
Suppose our haystack is "tshorts". Reading the code points "t", "s" and "h", we reach state 3. Because the next code point, "o", doesn’t match any transitions from state 3, we fall back to state 8, where there is a transition for "o", leading us to state 13. Reading the remaining code points "r", "t" and "s" we end up in state 16, having reached a match for the needle "shorts". This small example demonstrates that Aho-Corasick enables us to switch between needles without starting the search over. For the following sections, we assume that the code points used for transitions are in lowercase.
For our implementation of the Aho-Corasick algorithm in alfred-margaret we went past the external Char
-based Text
interface in order to be able to iterate through the input efficiently2. Instead, we accessed the underlying UTF-16 encoded byte array directly, viewing it as a 16-bit code unit sequence.
For an exact string search, comparing strings on the code unit level is sufficient. However, in order to power all necessary features in Channable, alfred-margaret provides functionality to not only match strings exactly, but also case-insensitively. A common approach for this type of problem is to simply run the case-sensitive search algorithm on a lowercase or case folded copy of the input. As we don’t want to keep what is essentially two copies of the same string in memory, our implementation lowercases each code point individually while looping through the input.
In order to do this, we first need to decode code points from our code unit sequence, because Unicode case conversions and case foldings are defined on code points, not code units. In UTF-16, the first 216 code points are encoded as a single 16-bit code unit containing the code point itself, shown on the left side of this diagram:
Other code points are encoded as two code units called a surrogate pair, shown on the right side. In order to tell apart a code unit that encodes a code point and the beginning of a surrogate pair (the high surrogate), Unicode reserves a range of code points that are to be used for surrogates only, namely U+D800
to U+DFFF
(in other words, the code units with the prefix 0b11011
). Surrogates may never appear outside a surrogate pair: A high surrogate must always be followed by a low surrogate. As an interesting aside, this means that UTF-16 can not actually encode all Unicode code points, while UTF-8 and UTF-32 can. However, such Unicode strings are considered ill-formed (see for example Unicode Section 3.9 D92).
A very useful property of UTF-16 is the fact that the code points which can be encoded as a single code unit (the first 216 code points) coincide with the Basic Multilingual Plane (BMP, from U+0000
to U+FFFF
) of Unicode. According to the introduction of the Unicode standard, the BMP encodes the "majority of the common characters used in the major languages of the world". Since the bulk of our customers' data falls into the BMP, we can get away with ignoring surrogates and lowercasing BMP code points only. This saves us the overhead of decoding multiple code units at a time. While it may seem like a questionable hack, the Unicode standard actually acknowledges this approach in Section 5.4 "Handling Surrogate Pairs in UTF-16":
UTF-16 enjoys a beneficial frequency distribution in that, for the majority of all text data, surrogate pairs will be very rare; non-surrogate code points, by contrast, will be very common. Not only does this help to limit the performance penalty incurred when handling a variable-width encoding, but it also allows many processes either to take no specific action for surrogates or to handle surrogate pairs with existing mechanisms that are already needed to handle character sequences.
Reflecting these facts we chose to treat the input as a sequence of fixed-width 16-bit code points, skipping surrogates. Consider how we would step the Aho-Corasick automation using as input the code unit sequence of the string "FOÖBAR🙈":
For each code unit, we determine whether it is a surrogate by checking if it is in the range U+D800
to U+DFFF
. If it isn’t, we interpret it as a code point and use its lowercase version to step the automaton (black arrows). For example, U+0046 LATIN CAPITAL LETTER F
lowercases to U+0066 LATIN SMALL LETTER F
. If the code unit is a surrogate, as in the case of the two code units U+D83D
and U+DB48
constituting U+1F648 SEE-NO-EVIL MONKEY
, we simply use it to step the automaton directly (gray arrows).
While the point of this post is not to explain our Aho-Corasick implementation, this should give you some context for the considerations we made when we updated alfred-margaret to work with UTF-8 strings. As opposed to most of our codebase, where we view a Text
value as an opaque sequence of Char
s, alfred-margaret depends on the encoding used for the internal byte slice.
So what’s the problem with switching to UTF-8? In a nutshell, using UTF-8 means that we can’t treat the input as a sequence of fixed-width BMP code points any more. Consider this example, where only code points consisting of a single code unit are converted to lowercase in the UTF-8-encoded string "FOÖBAR🙈":
This figure illustrates what happens if we just reuse the UTF-16 approach. In the case of UTF-8, only Basic Latin code points are encoded as a single code unit. While we can convert these, for any other code points such as U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS
above, we can’t, since it is encoded using two code units. At least on the BMP, the UTF-8 version should have the same lowercasing behavior as the UTF-16 version in order not to break compatibility. With UTF-16, BMP code points would always be encoded as two bytes; with UTF-8, BMP code points may be encoded as one, two or three bytes. This means that in order to stay compatible, we need to actually decode the UTF-8 sequence in order to lowercase code points on the fly.
If we want to stay compatible with the UTF-16 version, we need to lowercase BMP code points and leave any other code points as they are. Let's take a look at how the BMP is encoded in UTF-8.
Unlike UTF-16, UTF-8 does not use surrogates. Rather, depending on the range the code point falls into, it is encoded as one, two, three or four code units. The most significant bits (MSB) of the first code unit tell us how long this sequence is. This table from Unicode Section 3.9 visualizes how UTF-8 code points are encoded:
The first column shows the binary representation of a 21-bit code point in a specific range. We can read the table as:
How exactly the code units are composed is not relevant to our use case. What’s important is that, in order to know how many code units we need to read in order to read a full code point, we only need to scrutinize the first code unit of a code unit sequence:
Consider the UTF-8 encoding of the string "FÖ✓🙈":
0x46
is 0
. This means that the code point at position 0, U+0046 LATIN CAPITAL LETTER F
, is one code unit long.C3
at position 1 are 110
⇝ 2 code units.E2
at position 3 are 1110
⇝ 3 code units.F0
at position 6 are 11110
⇝ 4 code units.Code units that are not at the beginning of a code point always have the MSBs 10
, which means that it is trivial to skip to the previous or next code point from any position in a UTF-8 encoded string. Ultimately we need to decode any code points consisting of up to three code units in order to decode all BMP code points.
Knowing all this, we assumed that decoding every input code point would add a lot of processing overhead. In order to avoid this, we tried to change the way the Aho-Corasick automaton is built instead. Rather than lowercasing code points on the fly, we would add two transitions for each needle code point to the automaton: One for its lowercase version and another for its uppercase version. The advantages of this idea are:
The following example illustrates this using the needle "Größe" (the German word for size, appearing quite often in the context of E-Commerce). Encoded using UTF-8, it becomes the code unit sequence 0x47 0x72 0xC3 0xB6 0xC3 0x9F 0x65
. This Aho-Corasick automaton matches that sequence:
Whenever we read a character without a designated transition at any state, we fall back into state 0. To show how we can extend the automaton, consider transitioning from state 2 to state 4 in this automaton. This transition means "we have read the code point ö (code units 0xC3 0xB6
)". We can extend this automaton by adding extra transitions and intermediate states where necessary to make it mean "we have read either the code point ö (0xC3 0xB6
) or the code point Ö (0xC3 0x96
)":
We could now use the same stepping code we used for the previous automaton to match our needles case-insensitively, removing the on-the-fly lowercasing code entirely. This can’t be implemented in a way that stays compatible with the old BMP-based solution though. The reason is that for any given c :: Char
, the property
(isUpper c || toUpper (toLower c) == c)
&& (isLower c || toLower (toUpper c) == c)
doesn’t necessarily hold. Here are a few examples:
U+1E9E LATIN CAPITAL LETTER SHARP S
, the uppercase "Eszett"): Char.toLower 'ẞ' == 'ß'
(U+00DF LATIN SMALL LETTER SHARP S
) but Char.toUpper 'ß' == 'ß'
again. This means that the UTF-16 version would match "GRÖẞE", but the new approach would not3.U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
). In Haskell, Char.toLower 'İ' == 'i'
(U+0069 LATIN SMALL LETTER I
) and Char.toUpper 'i' == 'I'
(U+0049 LATIN CAPITAL LETTER I
). The letter is also one of the few case conversions that are locale-sensitive; in a turkish locale, Char.toUpper 'i' == 'İ'
holds, breaking the property for "I".U+212A KELVIN SIGN
) to "k" (U+006B LATIN SMALL LETTER K
) to "K" (U+004B LATIN CAPITAL LETTER K
). In total, there are around 30 such code points in Unicode.An avid reader and Unicode buff will be pulling their hair out by now. That's because the official way to do case-insensitive string comparisons is to apply Unicode case folding to both the needles and the haystack instead of applying a case conversion to a specific case. There are a few reasons alfred-margaret does not do this:
Char -> Char
mapping but Char -> [Char]
mapping. This makes it a bit harder to construct the automaton since it adds a few edge cases.Unicode defines a simple case folding algorithm that is a Char -> Char
mapping solving at least the first problem. It is also mostly compatible with lowercasing. However, Haskell's base
package does not implement it. In an internal Hackathon, we hacked together an implementation but didn't have enough time to polish it to a point where it's usable.
Finally we discarded this approach. While nice in theory, the subtle bugs it certainly would’ve caused down the road weren’t worth it in practice. However, we will probably examine this again at some point in the future.
At this point we caved and decided to just decode the UTF-8 input into code points for lowercasing. In order for this to work we’d have to change the automaton construction: So far, a single transition in our Aho-Corasick state machine has the following representation in memory using a 64-bit unsigned integer:
Each transition has a code unit that it activates on as well as the state it leads to. When the wildcard bit is set, the transition activates on every code unit. Each state is represented by an array of these transitions. The last transition in this array always has the wildcard bit set and leads to a fallback state. To illustrate this, let's loop back to the "shirts", "shorts" and "tshirts" example:
Using the layout above, states 3 and 8 would be represented like this:
State 3 has two transitions: Transition 0 leads to state 0b100
(= 4) with code unit 0b1101001
(= "i"). Transition 1 is a wildcard transition that falls back into state 0b1000
(= 8). State 8 has three transitions: Transitions 2 and 3 leads to states 9 and 13 on the code units corresponding to "i" and "o" respectively, and transition 4 is a wildcard transition that back into state 0.
Because the UTF-8 code uses code points rather than code units for transitions, we had to change this layout just a little bit, increasing the number of bits available for the automaton input:
Other than that, the layout stays the same. In Haskell, the code that handles transitions depends on a few type aliases:
type Transition = Word64
type State = Word32
type CodeUnit = Word16
type CodePoint = Char
Those make the following function definitions a bit nicer to read. To work with transitions, we exclusively use these functions:
transitionCodeUnit :: Transition -> CodeUnit
transitionState :: Transition -> State
transitionIsWildcard :: Transition -> Bool
newTransition :: CodeUnit -> State -> Transition
newWildcardTransition :: State -> Transition
This means that changing the layout of the transitions is purely a matter of updating their implementation and changing CodeUnit
to CodePoint
. GHC helps us find any other places where we need to change the code by pointing out type errors where CodePoint
is to be expected according to the updated interface but CodeUnit
is given. Working backwards from there we updated the remaining code. Finally, we decided to handle code points encoded as 4 code units correctly instead of staying 100% true to the UTF-16 implementation, which ignored those since they were encoded using surrogates. In practice this doesn’t make much of a difference since most of our input data is in the BMP anyways.
By now it should be clear that we care a lot about the performance of alfred-margaret. Optimizing Haskell code can be a little tricky, so we added this section to show some of the things that you should watch out for.
Laziness is a neat tool for writing high-level code while also saving a few CPU cycles here and there by avoiding unnecessary evaluation. However, it can also get in your way if you’re not careful, as it can keep the compiler from applying specific optimizations. Let’s take a look at consumeInput
, the entry point of the automaton stepping code4:
consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput !_offset 0 !acc !_state = acc
consumeInput !offset !remaining !acc !state =
let
...
in
followCodePoint (offset + codeUnits) (remaining - codeUnits) acc state casedCodePoint
Each parameter of this function is prefixed by an exclamation point. This syntax, enabled by the extension bang patterns, effectively allows us to give GHC a helping hand during demand analysis. Demand analysis (a GHC-specific term that includes strictness analysis) determines how a binding is used after its definition. Consider the following example:
f :: Int -> Int -> Int
f x y = case x of
0 -> y
_ -> x
In f
, the function parameter x
is always evaluated as it needs to be pattern-matched on. We call x
a strict parameter. Meanwhile y
is not used in every code path, making it a lazy parameter. Adding a bang pattern to y
forces GHC to evaluate y
whenever f
is entered, which means it will be treated as strict in any case. While GHC's demand analysis handles most cases quite well it is sometimes necessary to add these annotations manually when working with performance-sensitive code. You can find more information of demand analysis in the GHC commentary and this example-driven blog article.
Demand analysis is interesting to us because its results drive some optimizations in GHC, most notably the so-called worker/wrapper transformation. This transformation splits a function into a worker function that has an improved type and a wrapper function that observes the original type. In particular, the wrapper typically unboxes boxed arguments in order to reduce allocations and memory accesses.
Why is this so important if we are working with 32-bit Char
s anyways? The reason is that in GHC's implementation of Haskell, even simple values such as Char
s, Int
s or Bool
s are per default "boxed", i.e. they are pointers to some data on the heap. For example, in this square
function, x is actually a pointer to a heap-residing Int
object:
square :: Int -> Int
square x = x * x
This means that this function has to (at least) dereference x
, perform the multiplication and allocate a new heap object for the result. The worker-wrapper transformation saves us here by splitting this function into something akin to:
-- wrapper
square :: Int -> Int
square (I# x) = I# ($wsquare x)
-- worker
$wsquare :: Int# -> Int#
$wsquare x = x *# x
The worker function $wsquare
now uses unboxed types and operations only (those marked with #
) and performs no allocations. In the wrapper, you can read the pattern match (I# x)
as a dereference to an Int#
and the call to I#
as an allocation on the heap. Wherever the unboxed arguments to square
are available in the remaining codebase, GHC can now simply substitute for $wsquare
to minimize intermediate allocations.
Let’s backtrack to alfred-margaret now and check out the real-world implications. What happens to consumeInput
when we leave GHC’s demand analysis to its own devices and remove some of those bang patterns?
consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput _offset 0 acc _state = acc
consumeInput !offset !remaining !acc !state =
let
...
in
followCodePoint (offset + codeUnits) (remaining - codeUnits) acc state casedCodePoint
In this piece of code, the bang patterns in the first branch of consumeInput
have been removed. We can make GHC dump the results of demand analysis by passing the -ddump-stranal
option. GHC then prints out an intermediate representation of our program in the Core language, which is a glorified lambda calculus that it uses for optimizations specific to functional languages. In the dumped Core, the declaration of consumeInput
's worker looks like this:
consumeInput [...]
:: Int -> Int -> a -> Int -> a
[...
Str=<L,_><S,_><S,_><L,_>,
...]
CodeUnitIndex
and State
have been resolved to Int
. The Str
field in consumeInput
's metadata gives us the results of demand analysis: In order, it lists strictness and usage information (which is not important for our case and therefore elided) for each of the function's parameters. While remaining
and acc
are used in both branches of consumeInput
, offset
and state
are not and are therefore treated as lazy parameters. To see the results of the worker/wrapper transformation, we use -ddump-worker-wrapper
. The worker for consumeInput looks like this:
$wconsumeInput [...]
:: Int -> Int# -> a -> Int -> a
Because offset
and state
are lazily used, they must always be provided as boxed Int
s even to the worker function. consumeInput
is called with a different offset
for each code point in the input, so this implies a lot of small allocations, which is terrible for garbage collection! For this reason, we added a bang pattern to all patterns in the first branch as well:
consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput !_offset 0 !acc !_state = acc
consumeInput !offset !remaining !acc !state =
...
The demand information and the worker declaration now look like this:
-- -ddump-stranal
consumeInput [...]
:: Int -> Int -> a -> Int -> a
[...
Str=<S,_><S,_><S,_><S,_>,
...]
-- -ddump-worker-wrapper
$wconsumeInput [...]
:: Int# -> Int# -> a -> Int# -> a
Because they are dubbed to be strict parameters, offset
and state
are unboxed Int#
s now as well! Overall, this three-character change cut our benchmark times in half and saved loads of allocations.
Basic functional programming courses often place a lot of emphasis on tail recursion – and for a good reason: Tail recursive function calls can be compiled into jumps, thereby avoiding the overhead that usually comes with a call. This is already described in the original alfred-margaret blog article in the section "From State Machines to Mutual Recursion".
Finally, we did another low-level optimization that had a surprisingly big impact. We represent a Aho-Corasick state machine using the following data structure:
data AcMachine v = AcMachine
{ machineValues :: !(Vector [v])
-- machineValues can’t be an UnboxedVector because we don’t have Unbox v
, machineTransitions :: !(UnboxedVector Transition)
, machineOffsets :: !(UnboxedVector Int)
, machineRootAsciiTransitions :: !(UnboxedVector Transition)
} deriving (Generic)
An UnboxedVector
(canonical name: Data.Vector.Unboxed.Vector
) is an array-based data structure that doesn't store pointers to its elements but the elements itself. In C, we'd declare uint64_t machine_transitions[]
. Using unboxed vectors provides a way to store those transitions from before neatly packed adjacently in memory. However, on a hunch, we replaced them by a thin wrapper around ByteArray
from the primitive
package:
newtype TypedByteArray a = TypedByteArray ByteArray
As it turns out, this simple change led to a speedup of around 15% for array reads:
We can see why by checking out the definition of UnboxedVector
:
data Vector a = Vector Int Int ByteArray
An UnboxedVector
is actually just a slice into a ByteArray
! This means that every time we read an element, we save an addition by simply using a ByteArray
instead. Since we know that our fields are not slices but occupy a whole ByteArray
which we don’t need to resize, this is a sensible choice. The TypedByteArray
wrapper made it easy to replace the type where necessary:
data AcMachine v = AcMachine
{ machineValues :: !(Vector [v])
, machineTransitions :: !(TypedByteArray Transition)
, machineOffsets :: !(TypedByteArray Int)
, machineRootAsciiTransitions :: !(TypedByteArray Transition)
} deriving (Generic)
In exchange for just a small amount of additional code, this gave us a noticeable performance boost not only in the microbenchmark above but also across our whole benchmark data set.
By using UTF-8 in our updated version of alfred-margaret, we now save close to 50% of memory when working with Text
values. However, it was also important to preserve the library's performance. To measure this we ran three versions of alfred-margaret on our benchmark data set of roughly 1.6 billion code points:
Text
and UnboxedVector
s.TypedByteArray
improvement.ByteArray
improvement.The task is the same for each version: Across 242 test cases, count the number of occurrences for a set of needles in a haystack. Of course we made sure that all versions deliver the same results. The distribution of code points across the whole benchmark looks like this:
Unicode Block | Share | Number of Code Points |
---|---|---|
Basic Latin | 99.83% | 1,652,853,274 |
Latin-1 Supplement | 0.14% | 2,314,640 |
Latin Extended-A | 0.01% | 242,392 |
Latin Extended-B | 0.00% | 746 |
Other Block, 2 UTF-8 Code Units | 0.00% | 4,114 |
Other Block, 3 UTF-8 Code Units | 0.02% | 371,955 |
Other Block, 4 UTF-8 Code Units | 0.00% | 0 |
Here’s how our candidates did; red bars indicate time needed for case-insensitive search and blue bars indicate time needed for case-sensitive search:
Overall, running time actually improved compared to the original UTF-16 implementation. Unsurprisingly, the UTF-8 implementation takes a bit longer than the improved UTF-16 implementation with the same TypedByteArray
improvement because it needs to decode each code point, even if it’s just a single code unit. One curious observation is that in spite of that, the UTF-8 implementation is actually faster than the improved UTF-16 implementation doing a case-insensitive search. At this point we haven’t found out why that is and are still investigating.
We showcased some internals of alfred-margaret and some approaches of moving it to UTF-8. Having done so enables us to move our codebase to text-2.0
, cutting our memory consumption for strings in half! Even though the UTF-8 version has additional decoding overhead, we managed to speed it up compared to the old UTF-16 version. All in all our journey makes it clear that it is absolutely possible to write low-level code in Haskell to tackle performance-sensitive problems. This is especially nice since Haskell offers us a vast amount of tools to hide this complexity and integrate it cleanly into high-level code.
The section Haskell Performance Tuning shows that it is crucial to have at least some understanding of what happens inside GHC though. It helps a lot to have colleagues well-versed in GHC internals and the tooling that is available in the Haskell ecosystem. I'd like to thank mine for their guidance and the opportunity to work on this project.
Moving all of Hackage to text-2.0
is likely going to take a while, since it contains a lot of open-source contributions. However, since the module Data.Text
is encoding-agnostic, most updates shouldn't need as much deliberation as this one.
1: The user-perceived character ◌̣ actually consists of multiple abstract characters as well, U+25CC DOTTED CIRCLE
and U+0323 COMBINING DOT BELOW
. It just looks nicer that way. ↩
2: Of course Text
s can be traversed using the Data.Text
module only, for example using any one of these Folds or iter
. However, both of these methods construct intermediate data structures on the heap and we don't want to rely on GHC to optimize those away. ↩
3: Unicode actually defines toUppercase(ß) = SS
. This is because U+1E9E LATIN CAPITAL LETTER SHARP S
is a very recent addition to the German writing system, officially established in 2017. In Germany, you will find a lot of street signs where "Straße" (street) is written "STRASSE", see for example this article. However, Haskell's implementations of toUpper
and toLower
have the type Char -> Char
, only ever mapping to single code points. They use the simple case conversion algorithms which Unicode provides exactly for this use case. ↩
4: Working with UTF-encoded strings, we have (at least) two ways to index a string: By code points and by code units. Using CodeUnitIndex
here makes it clear what kind of index we are dealing with: Although we ultimately want a CodePoint, we want to use the fixed-width code units to access the string in constant time. CodeUnitIndex
is a newtype
which means that GHC throws errors if we use an Int
instead. ↩
Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!
Apply now