Consider quite popular in programming task of string-to-enum value parsing.
You have enum declared in your code (C/C++ but can be any modern programming language)
enum FormatterType {
simple_formatter,
integer_formatter,
decimal_formatter,
currency_formatter,
date_formatter,
date_local_formatter,
time_formatter,
time_local_formatter,
enum_formatter,
duration_formatter
};
And text representation of these values that are coming as tokens – strings:
"text",
"integer",
"decimal",
"currency",
"date",
"date-local",
"time",
"time-local",
"enum",
"duration"
The task
To write a function that returns one of enum FormatterType constants when given a token – let it be std::string for brevity :
FormatterType t2e(std::string s) { ... }
In this article I will test three straightforward implementations of such string2enum function:
- linear lookup – that supposedly shall be slowest as it has
O(N) performance.
- std::map lookup – binary search tree that is known of having
O(log(N)) lookup performance.
- std::unordered_map lookup – hash table, should be fastest as we’ve been told it has
O(1) complex lookup.
Linear search
The function is just a chain of if-elif-elif-else’es (I am using ternary operator here that is conceptually the if):
FormatterType if_t2e(std::string name)
{
aux::chars name_chars = aux::chars_of(name);
return
name_chars == CHARS("text") ? simple_formatter
: name_chars == CHARS("integer") ? integer_formatter
: name_chars == CHARS("decimal") ? decimal_formatter
: name_chars == CHARS("currency") ? currency_formatter
: name_chars == CHARS("date") ? date_formatter
: name_chars == CHARS("date-local") ? date_local_formatter
: name_chars == CHARS("time") ? time_formatter
: name_chars == CHARS("time-local") ? time_local_formatter
: name_chars == CHARS("enum") ? enum_formatter
: name_chars == CHARS("duration") ? duration_formatter
: simple_formatter;
}
Where aux::chars is simple slice structure { const char* start; unsigned int length; } that has operator== defined as:
bool aux::chars::operator == ( const slice& r ) const
{
if( length != r.length )
return false;
for( unsigned int i = 0; i < length; ++i )
if( start[i] != r.start[i] )
return false;
return true;
}
As you see it is dead simple – nothing in it to write home about.
Binary search tree lookup
std::map<std::string, FormatterType> map_index = {
{ "text", simple_formatter },
{ "integer", integer_formatter },
{ "decimal", decimal_formatter },
{ "currency", currency_formatter },
{ "date", date_formatter },
{ "date-local", date_local_formatter },
{ "time", time_formatter },
{ "time-local", time_local_formatter },
{ "enum", enum_formatter },
{ "duration", duration_formatter }
};
FormatterType map_t2e(std::string name)
{
auto it = map_index.find(name);
return (it != map_index.end()) ? it->second : simple_formatter;
}
Simple and obvious. And so is…
Hash table lookup
std::unordered_map<std::string, FormatterType> unordered_map_index = {
{ "text", simple_formatter },
{ "integer", integer_formatter },
{ "decimal", decimal_formatter },
{ "currency", currency_formatter },
{ "date", date_formatter },
{ "date-local", date_local_formatter },
{ "time", time_formatter },
{ "time-local", time_local_formatter },
{ "enum", enum_formatter },
{ "duration", duration_formatter }
};
FormatterType unordered_map_t2e(std::string name)
{
auto it = unordered_map_index.find(name);
return (it != unordered_map_index.end()) ? it->second : simple_formatter;
}
Testing
We will test these functions on random set of input strings that contains our tokens:
std::vector input;
void generate_input() {
for (int n = 0; n < TOTAL_RUNS; ++n)
input.push_back(std::string(input_tokens[std::rand() % ITEMS_IN(input_tokens)]));
}
To show the tendency I will test these functions on different sizes of enum sets : 4, 5, 6, 7, 8, 9 and 10 items per enum:
The result
As smaller the number below as faster the function performs. Yellow cells mark best cases.
And here is the surprise – linear lookup, that is supposed to be the slowest, is actually the fastest on small enums:
| enum items |
std::map (sec) |
std::u_map (sec) |
linear (sec) |
| 4 |
3.44 |
2.88 |
2.13 |
| 5 |
3.6 |
2.93 |
2.36 |
| 6 |
3.52 |
3.13 |
2.57 |
| 7 |
3.58 |
3.18 |
2.64 |
| 8 |
3.74 |
3.23 |
2.78 |
| 9 |
3.76 |
2.76 |
2.81 |
| 10 |
3.73 |
2.76 |
2.83 |
And the chart to visualize the results:
As lowest the line – the better. That gray line above is our linear lookup results. As you see on small enums it is the champion.
Resume
- On small enum sets linear lookup is better – that is
O(N) outperforms O(1) !
- On larger enum sets hash table will work better.
- std::map (binary search tree) is the worst possible tool for implementation of such things.
- The best one is so called perfect hash function like the one produced by the gperf . It generates FSA (finite state automata) unique to the given set of items in enum. gperf based function takes 1.86 seconds to perform lookup on enum having 9 items.
So why is that?
The reason is that each lookup method has its own hidden cost of lookup preparation.
- My linear lookup implementation has simply no preparation code – comparisons starts immediately.
- For the hash map to start doing lookup it needs to compute hash value of given string – O(n) task, where n is the length of the string.
- In binary search tree each left/right comparison is a O(n) task in worst case.
- And perfect hash (FSA) does not require preparation code, yet input string is scanned strictly once.
Afterwords
You can take test application terrainformatica.com/experiments/t2e.zip
aux::chars primitive is a reduced version of aux-slice.h file aux-slice.h file from Sciter SDK. Sciter uses slices instead of strings pretty much everywhere. Strings are used too but only as a storage of characters.
In fact that slice construct is what makes D language of Walter Bright the Great so fast – arrays and strings there are slices – structures containing only {start,length} fields.