Consider quite popular in programming task of string-to-enum value parsing.
You have enum declared in your code (C/C++ but can be any modern programming language)
enum FormatterType { simple_formatter, integer_formatter, decimal_formatter, currency_formatter, date_formatter, date_local_formatter, time_formatter, time_local_formatter, enum_formatter, duration_formatter };
And text representation of these values that are coming as tokens – strings:
"text", "integer", "decimal", "currency", "date", "date-local", "time", "time-local", "enum", "duration"
The task
To write a function that returns one of enum FormatterType
constants when given a token – let it be std::string
for brevity :
FormatterType t2e(std::string s) { ... }
In this article I will test three straightforward implementations of such string2enum function:
- linear lookup – that supposedly shall be slowest as it has
O(N)
performance. - std::map lookup – binary search tree that is known of having
O(log(N))
lookup performance. - std::unordered_map lookup – hash table, should be fastest as we’ve been told it has
O(1)
complex lookup.
Linear search
The function is just a chain of if-elif-elif-else’es (I am using ternary operator here that is conceptually the if):
FormatterType if_t2e(std::string name) { aux::chars name_chars = aux::chars_of(name); return name_chars == CHARS("text") ? simple_formatter : name_chars == CHARS("integer") ? integer_formatter : name_chars == CHARS("decimal") ? decimal_formatter : name_chars == CHARS("currency") ? currency_formatter : name_chars == CHARS("date") ? date_formatter : name_chars == CHARS("date-local") ? date_local_formatter : name_chars == CHARS("time") ? time_formatter : name_chars == CHARS("time-local") ? time_local_formatter : name_chars == CHARS("enum") ? enum_formatter : name_chars == CHARS("duration") ? duration_formatter : simple_formatter; }
Where aux::chars is simple slice structure { const char* start; unsigned int length; }
that has operator==
defined as:
bool aux::chars::operator == ( const slice& r ) const { if( length != r.length ) return false; for( unsigned int i = 0; i < length; ++i ) if( start[i] != r.start[i] ) return false; return true; }
As you see it is dead simple – nothing in it is worth writing home about.
Binary search tree lookup
std::map<std::string, FormatterType> map_index = { { "text", simple_formatter }, { "integer", integer_formatter }, { "decimal", decimal_formatter }, { "currency", currency_formatter }, { "date", date_formatter }, { "date-local", date_local_formatter }, { "time", time_formatter }, { "time-local", time_local_formatter }, { "enum", enum_formatter }, { "duration", duration_formatter } }; FormatterType map_t2e(std::string name) { auto it = map_index.find(name); return (it != map_index.end()) ? it->second : simple_formatter; }
Simple and obvious. And so is…
Hash table lookup
std::unordered_map<std::string, FormatterType> unordered_map_index = { { "text", simple_formatter }, { "integer", integer_formatter }, { "decimal", decimal_formatter }, { "currency", currency_formatter }, { "date", date_formatter }, { "date-local", date_local_formatter }, { "time", time_formatter }, { "time-local", time_local_formatter }, { "enum", enum_formatter }, { "duration", duration_formatter } }; FormatterType unordered_map_t2e(std::string name) { auto it = unordered_map_index.find(name); return (it != unordered_map_index.end()) ? it->second : simple_formatter; }
Testing
We will test these functions on random set of input strings that contains our tokens:
std::vector input; void generate_input() { for (int n = 0; n < TOTAL_RUNS; ++n) input.push_back(std::string(input_tokens[std::rand() % ITEMS_IN(input_tokens)])); }
To show the tendency I will test these functions on different sizes of enum sets : 4, 5, 6, 7, 8, 9 and 10 items per enum:
The result
As smaller the number below as faster the function performs. Yellow cells mark best cases.
And here is the surprise – linear lookup, that is supposed to be the slowest, is actually the fastest on small enums:
enum items | std::map (sec) | std::u_map (sec) | linear (sec) |
4 | 3.44 | 2.88 | 2.13 |
5 | 3.6 | 2.93 | 2.36 |
6 | 3.52 | 3.13 | 2.57 |
7 | 3.58 | 3.18 | 2.64 |
8 | 3.74 | 3.23 | 2.78 |
9 | 3.76 | 2.76 | 2.81 |
10 | 3.73 | 2.76 | 2.83 |
And the chart to visualize the results:
As lower the line – the better. That gray line above is our linear lookup results. As you see on small enums it is the champion.
Resume
- On small enum sets linear lookup is better – that is
O(N)
outperformsO(1)
! - On larger enum sets hash table will work better.
- std::map (binary search tree) is the worst possible tool for implementation of such things.
- The best one is so called perfect hash function like the one produced by the gperf . It generates FSA (finite state automata) unique to the given set of items in enum. gperf based function takes 1.86 seconds to perform lookup on enum having 9 items.
So why is that?
The reason is that each lookup method has its own hidden cost of lookup preparation.
- My linear lookup implementation has simply no preparation code – comparisons starts immediately.
- For the hash map to start doing lookup it needs to compute hash value of given string – O(n) task, where n is the length of the string.
- In binary search tree each left/right comparison is a O(n) task in worst case.
- And perfect hash (FSA) does not require preparation code, yet input string is scanned strictly once.
Afterwords
You can take test application terrainformatica.com/experiments/t2e.zip
aux::chars
primitive is a reduced version of aux-slice.h file aux-slice.h file from Sciter SDK. Sciter uses slices instead of strings pretty much everywhere. Strings are used too but only as a storage of characters.
In fact that slice construct is what makes D language of Walter Bright the Great so fast – arrays and strings there are slices – structures containing only {start,length} fields.
There is also Ragel, very efficient FSM C-code generator from regex-like DSL, capable of producing minimalistic CPU–cache–friendly code.
This rather old thing is experiencing its rebirth nowadays, when threads, sockets, memory copying, heavy data structures like std::map, and even OS kernel occured to be too heavy abstractions when dealing with ultra-highloads (>1M RPS). So direct NIC and SSD access, thread per core per NIC with minimum context switches/syscalls/mallocs/memcpys, and explicit finite state automata approach are the keys to success.
Look at venerable Seastar framework (foundation of very fast Scylla DB), and its simple example of state-based memcached clone: https://github.com/scylladb/seastar/tree/master/apps/memcached
They use Ragel for parsing.