Splitting strings

Have read Splitting Strings by Chris Zetter mini saga…

What this expression:

"".split(",")

shall produce?

Python, JS, Sciter (now):

"".split(",") -> [""]

Ruby, Perl, AWK, Sciter (while ago):

"".split(",") -> []

That reminded me that initially in Sciter I’ve implemented the function in Ruby/Perl way as it looked more natural at that moment.

And only recently redesigned the String.split() to follow JavaScript way.

When linear search O(N) is faster than hash table O(1) lookup

Consider quite popular in programming task of string-to-enum value parsing.

You have enum declared in your code (C/C++ but can be any modern programming language)

enum FormatterType {
  simple_formatter,
  integer_formatter,
  decimal_formatter,
  currency_formatter,
  date_formatter,
  date_local_formatter,
  time_formatter,
  time_local_formatter,
  enum_formatter,
  duration_formatter
};

And text representation of these values that are coming as tokens – strings:

"text",
"integer",
"decimal",
"currency",
"date",
"date-local",
"time", 
"time-local", 
"enum", 
"duration"

The task

To write a function that returns one of enum FormatterType constants when given a token – let it be std::string for brevity :

FormatterType t2e(std::string s) { ... }

In this article I will test three straightforward implementations of such string2enum function:

  1. linear lookup – that supposedly shall be slowest as it has O(N) performance.
  2. std::map lookup – binary search tree that is known of having O(log(N)) lookup performance.
  3. std::unordered_map lookup – hash table, should be fastest as we’ve been told it has O(1) complex lookup.

Linear search

The function is just a chain of if-elif-elif-else’es (I am using ternary operator here that is conceptually the if):

FormatterType if_t2e(std::string name)
{
  aux::chars name_chars = aux::chars_of(name);
  return
    name_chars == CHARS("text") ? simple_formatter
     : name_chars == CHARS("integer") ? integer_formatter
     : name_chars == CHARS("decimal") ? decimal_formatter
     : name_chars == CHARS("currency") ? currency_formatter
     : name_chars == CHARS("date") ? date_formatter
     : name_chars == CHARS("date-local") ? date_local_formatter
     : name_chars == CHARS("time") ? time_formatter
     : name_chars == CHARS("time-local") ? time_local_formatter
     : name_chars == CHARS("enum") ? enum_formatter
     : name_chars == CHARS("duration") ? duration_formatter
     : simple_formatter;
}

Where aux::chars is simple slice structure { const char* start; unsigned int length; } that has operator== defined as:

      bool aux::chars::operator == ( const slice& r ) const
      {
        if( length != r.length )
          return false;
        for( unsigned int i = 0; i < length; ++i )
          if( start[i] != r.start[i] )
            return false;
        return true;
      }

As you see it is dead simple – nothing in it to write home about.

Binary search tree lookup

std::map<std::string, FormatterType> map_index = {
  { "text", simple_formatter },
  { "integer", integer_formatter },
  { "decimal", decimal_formatter },
  { "currency", currency_formatter },
  { "date", date_formatter },
  { "date-local", date_local_formatter },
  { "time", time_formatter },
  { "time-local", time_local_formatter },
  { "enum", enum_formatter },
  { "duration", duration_formatter }
};

FormatterType map_t2e(std::string name)
{
  auto it = map_index.find(name);
  return (it != map_index.end()) ? it->second : simple_formatter;
}

Simple and obvious. And so is…

Hash table lookup

std::unordered_map<std::string, FormatterType> unordered_map_index = {
  { "text", simple_formatter },
  { "integer", integer_formatter },
  { "decimal", decimal_formatter },
  { "currency", currency_formatter },
  { "date", date_formatter },
  { "date-local", date_local_formatter },
  { "time", time_formatter },
  { "time-local", time_local_formatter },
  { "enum", enum_formatter },
  { "duration", duration_formatter }
};

FormatterType unordered_map_t2e(std::string name)
{
  auto it = unordered_map_index.find(name);
  return (it != unordered_map_index.end()) ? it->second : simple_formatter;
}

Testing

We will test these functions on random set of input strings that contains our tokens:

std::vector input;

void generate_input() {
  for (int n = 0; n < TOTAL_RUNS; ++n)
    input.push_back(std::string(input_tokens[std::rand() % ITEMS_IN(input_tokens)]));
}

To show the tendency I will test these functions on different sizes of enum sets : 4, 5, 6, 7, 8, 9 and 10 items per enum:

The result

As smaller the number below as faster the function performs. Yellow cells mark best cases.

And here is the surprise – linear lookup, that is supposed to be the slowest, is actually the fastest on small enums:

enum items std::map (sec) std::u_map (sec) linear (sec)
4 3.44 2.88 2.13
5 3.6 2.93 2.36
6 3.52 3.13 2.57
7 3.58 3.18 2.64
8 3.74 3.23 2.78
9 3.76 2.76 2.81
10 3.73 2.76 2.83

And the chart to visualize the results:

As lowest the line – the better. That gray line above is our linear lookup results. As you see on small enums it is the champion.

Resume

  1. On small enum sets linear lookup is better – that is O(N) outperforms O(1) !
  2. On larger enum sets hash table will work better.
  3. std::map (binary search tree) is the worst possible tool for implementation of such things.
  4. The best one is so called perfect hash function like the one produced by the  gperf . It generates FSA (finite state automata) unique to the given set of items in enum. gperf based function takes 1.86 seconds to perform lookup on enum having 9 items.

 

So why is that?

The reason is that each lookup method has its own hidden cost of lookup preparation.

  1. My linear lookup implementation has simply no preparation code – comparisons starts immediately.
  2. For the hash map to start doing lookup it needs to compute hash value of given string – O(n) task, where n is the length of the string.
  3. In binary search tree each left/right comparison is a O(n) task in worst case.
  4. And perfect hash (FSA) does not require preparation code, yet input string is scanned strictly once.

Afterwords

You can take test application terrainformatica.com/experiments/t2e.zip

aux::chars primitive is a reduced version of aux-slice.h file aux-slice.h file from Sciter SDK. Sciter uses slices instead of strings pretty much everywhere. Strings are used too but only as a storage of characters.

In fact that slice construct is what makes D language of Walter Bright the Great so fast – arrays and strings there are slices – structures containing only {start,length} fields.

Epic discussion: VS Code Uses 13% CPU When Idle Due to Blinking Cursor Rendering

VS Code Uses 13% CPU When Idle Due to Blinking Cursor Rendering

Discussions: Reddit and Hacker News

Couple of citations:

Assuming 1 million users of VS Code This blinking cursor will waste $3 million per year in electricity costs, and output 32,000 tons of c02 per year.

2150: “Grandpa, why did the ocean’s dry up?”
“well sonny, long ago, Microsoft didn’t optimize their code for a code editor’s cursor and pushed earth’s climate over the edge”

TL;DR:

All this is about Microsoft Visual Code editor that is Electron application — Chromium based.

They blink cursor by this CSS declaration:

@keyframes monaco-cursor-blink {
    50% {
        opacity: 0;
    }
    100% {
        opacity: 1;
    }
}

.cursor-blink {
    animation: monaco-cursor-blink 1s step-start 0s infinite;
}

Problems with this solution:

  • In chromium / webkit this animation runs with animation frame rate — each 16ms — 60 FPS.
  • As Chromium (WebKit with Skia backend) uses mostly CPU rasterization than simple task of filling 1px rectangle eats CPU (and so drains batteries). On each frame rate whole window gets updated.

In principle all this can be solved by something simple as:

setTimer(function() { caret.toggleClass("visible") }, 1000);

When you want to ask for help in programming …

then use the following template:

  • Explain details that you see as much as possible. Environment, problem, etc.
  • Explain exactly what you want – what you think should be happening.
  • Explain what is actually happening.
  • Explain why you think it should be working differently.

By walking through all these steps you may find answer by yourself. In any case this will greatly increase a chance of getting adequate answer fast.