When linear search O(N) is faster than hash table O(1) lookup

Consider quite popular in programming task of string-to-enum value parsing.

You have enum declared in your code (C/C++ but can be any modern programming language)

enum FormatterType {
  simple_formatter,
  integer_formatter,
  decimal_formatter,
  currency_formatter,
  date_formatter,
  date_local_formatter,
  time_formatter,
  time_local_formatter,
  enum_formatter,
  duration_formatter
};

And text representation of these values that are coming as tokens – strings:

"text",
"integer",
"decimal",
"currency",
"date",
"date-local",
"time", 
"time-local", 
"enum", 
"duration"

The task

To write a function that returns one of enum FormatterType constants when given a token – let it be std::string for brevity :

FormatterType t2e(std::string s) { ... }

In this article I will test three straightforward implementations of such string2enum function:

  1. linear lookup – that supposedly shall be slowest as it has O(N) performance.
  2. std::map lookup – binary search tree that is known of having O(log(N)) lookup performance.
  3. std::unordered_map lookup – hash table, should be fastest as we’ve been told it has O(1) complex lookup.

Linear search

The function is just a chain of if-elif-elif-else’es (I am using ternary operator here that is conceptually the if):

FormatterType if_t2e(std::string name)
{
  aux::chars name_chars = aux::chars_of(name);
  return
    name_chars == CHARS("text") ? simple_formatter
     : name_chars == CHARS("integer") ? integer_formatter
     : name_chars == CHARS("decimal") ? decimal_formatter
     : name_chars == CHARS("currency") ? currency_formatter
     : name_chars == CHARS("date") ? date_formatter
     : name_chars == CHARS("date-local") ? date_local_formatter
     : name_chars == CHARS("time") ? time_formatter
     : name_chars == CHARS("time-local") ? time_local_formatter
     : name_chars == CHARS("enum") ? enum_formatter
     : name_chars == CHARS("duration") ? duration_formatter
     : simple_formatter;
}

Where aux::chars is simple slice structure { const char* start; unsigned int length; } that has operator== defined as:

      bool aux::chars::operator == ( const slice& r ) const
      {
        if( length != r.length )
          return false;
        for( unsigned int i = 0; i < length; ++i )
          if( start[i] != r.start[i] )
            return false;
        return true;
      }

As you see it is dead simple – nothing in it to write home about.

Binary search tree lookup

std::map<std::string, FormatterType> map_index = {
  { "text", simple_formatter },
  { "integer", integer_formatter },
  { "decimal", decimal_formatter },
  { "currency", currency_formatter },
  { "date", date_formatter },
  { "date-local", date_local_formatter },
  { "time", time_formatter },
  { "time-local", time_local_formatter },
  { "enum", enum_formatter },
  { "duration", duration_formatter }
};

FormatterType map_t2e(std::string name)
{
  auto it = map_index.find(name);
  return (it != map_index.end()) ? it->second : simple_formatter;
}

Simple and obvious. And so is…

Hash table lookup

std::unordered_map<std::string, FormatterType> unordered_map_index = {
  { "text", simple_formatter },
  { "integer", integer_formatter },
  { "decimal", decimal_formatter },
  { "currency", currency_formatter },
  { "date", date_formatter },
  { "date-local", date_local_formatter },
  { "time", time_formatter },
  { "time-local", time_local_formatter },
  { "enum", enum_formatter },
  { "duration", duration_formatter }
};

FormatterType unordered_map_t2e(std::string name)
{
  auto it = unordered_map_index.find(name);
  return (it != unordered_map_index.end()) ? it->second : simple_formatter;
}

Testing

We will test these functions on random set of input strings that contains our tokens:

std::vector input;

void generate_input() {
  for (int n = 0; n < TOTAL_RUNS; ++n)
    input.push_back(std::string(input_tokens[std::rand() % ITEMS_IN(input_tokens)]));
}

To show the tendency I will test these functions on different sizes of enum sets : 4, 5, 6, 7, 8, 9 and 10 items per enum:

The result

As smaller the number below as faster the function performs. Yellow cells mark best cases.

And here is the surprise – linear lookup, that is supposed to be the slowest, is actually the fastest on small enums:

enum items std::map (sec) std::u_map (sec) linear (sec)
4 3.44 2.88 2.13
5 3.6 2.93 2.36
6 3.52 3.13 2.57
7 3.58 3.18 2.64
8 3.74 3.23 2.78
9 3.76 2.76 2.81
10 3.73 2.76 2.83

And the chart to visualize the results:

As lowest the line – the better. That gray line above is our linear lookup results. As you see on small enums it is the champion.

Resume

  1. On small enum sets linear lookup is better – that is O(N) outperforms O(1) !
  2. On larger enum sets hash table will work better.
  3. std::map (binary search tree) is the worst possible tool for implementation of such things.
  4. The best one is so called perfect hash function like the one produced by the  gperf . It generates FSA (finite state automata) unique to the given set of items in enum. gperf based function takes 1.86 seconds to perform lookup on enum having 9 items.

 

So why is that?

The reason is that each lookup method has its own hidden cost of lookup preparation.

  1. My linear lookup implementation has simply no preparation code – comparisons starts immediately.
  2. For the hash map to start doing lookup it needs to compute hash value of given string – O(n) task, where n is the length of the string.
  3. In binary search tree each left/right comparison is a O(n) task in worst case.
  4. And perfect hash (FSA) does not require preparation code, yet input string is scanned strictly once.

Afterwords

You can take test application terrainformatica.com/experiments/t2e.zip

aux::chars primitive is a reduced version of aux-slice.h file aux-slice.h file from Sciter SDK. Sciter uses slices instead of strings pretty much everywhere. Strings are used too but only as a storage of characters.

In fact that slice construct is what makes D language of Walter Bright the Great so fast – arrays and strings there are slices – structures containing only {start,length} fields.