Performance Improvements Using Micro-Optimizations

D. Richard Hipp
drh@sqlite.org

Abstract

Over the past two years, the number of CPU cycles needed to run a reference workload in SQLite has been reduced by half. This performance improvement results from countless micro-optimizations, each scarcely measurable by itself but collectively providing a significantly faster product. This technical note provides an overview of how micro-optimizations were used to improve the performance of SQLite, with emphasis on how similar enhancements might be accomplished in the TCL core.

1.0 Background

SQLite is a TCL extension that has escaped into the wild. SQLite is an embedded, transactional, SQL database engine, and the most widely deployed database engine in the world today, being found on all smartphones as well as on most PCs, television sets and set-top boxes, automobile dash-board systems, smart appliances, and so forth. There are many billions of SQLite installations in active use.

SQLite is written in ANSI C code, like TCL. The SQLite core is about 100,000 lines of code, exclusive of blank lines and comments, making it roughly half the size of the TCL core.

Because SQLite is so widely used, it is important that it be efficient. Figure 1 shows the number of CPU cycles required by SQLite to perform a particular reference workload. The graph shows that performance was relatively flat from 2010 through the middle of 2013. But then in the second half of 2013 the slope of the performance tilted abruptly downward so that by the middle of 2015 the number of CPU cycles had been reduced by over half.

Figure 1
Figure 1

The downward bend beginning in the middle of 2013 corresponds to the development of a new optimization technique for C-language programs, namely the implementation of many small "micro-optimizations" each of which would be unnoticeable by itself but which have a large cumulative effect. The purpose of this report is to further explain the micro-optimization technique with the hopes that it can be used to obtain similar performance improvements in TCL.

2.0 Procedure

The optimization process used on SQLite can be summarized as follows:

  1. Implement a deterministic reference workload that approximates real-world usage.

  2. Run the reference workload using cachegrind.

  3. Generate a performance report from the cachgrind data using a small TCL script that processes the raw output of cg_annotate.

  4. Study the performance report. Find optimization opportunities and implement them.

  5. Verify that the implemented micro-optimization is correct (that no new bugs were introduced) and that it really does make the software run slightly faster.

  6. Check-in the change and return to step (2).

2.1 The Workload

A key factor in successful optimization is finding a workload that is representative of real-world usage of the software. This step necessarily involves a great deal of experience and engineering judgement. There is no correct answer, and reasonable people may disagree over which of two candidate workloads is best. In the case of SQLite, we normally use a 1,200 line C program called speedtest1.c. For TCL, a script of a few hundred lines will likely suffice.

A critical property of the workload is that it be deterministic. It must run the same (or nearly the same) number and sequence of CPU instructions from one execution to the next. In other words, it must be repeatable. Otherwise it will be difficult to determine if a particular micro-optimization has done any good. This means that the workload should not use multiple threads nor make calls to random()-like functions. There will always be some amount of variation in performance from one invocation of the workload to the next, but it is best if the number of CPU cycles reported by cachegrind is consistent to 6 or 7 digits.

2.2 Cachegrind

The cornerstone of this procedure is the use of cachegrind to measure the exact number of CPU cycles used by each source line of the software under test. Cachgrind is one of the valgrind family of tools. Cachegrind is, in essence, a CPU simulator. It is able to provide precise and repeatable counts of the number of executions of each line of code in the software under test. This is far more accurate than other profiling tools (example: gprof) which only provide a statistical sampling of where the software is spending its time.

Cachegrind is essential for micro-optimization, since each micro-optimization might only improve performance by 0.05%. Statistical profiling tools such as gprof are unable to resolve this level of detail since their time reports vary by as much as 1.0% or more between consecutive runs. Furthermore, statistical profiling tools do not provide the same level of detail as to what parts of the software are using time. Gprof might identify a subroutine that is heavily used, but cachegrind can easily identify a particular line of code within that subroutine, or even a particular operation within that line. And cachegrind can do this repeatably.

There are downsides to using cachegrind. First, cachegrind is a CPU simulator, so software run under cachegrind runs more slowly. To a first approximation, cachegrind on a high-end workstation performs about the same as native code on a first-generation smartphone. Secondly, cachegrind is not available on all platforms. There is some (spotty) support for cachegrind on recent Macs and on Solaris, but for the most part cachegrind requires a Linux workstation.

The command-line to run cachegrind on a linux workstation would typically look something like this:

valgrind --tool=cachegrind ./tclsh sample_workload.tcl

It is not necessary to compile the software being analyzed in any special way. However, we find that it works best to compile the software using the -Os optimization option (optimize for size). The -Os optimization level on recent GCC and Clang implementations does most kinds of optimization, but omits radical code movement, loop unrolling, and excessive function in-lining that can make analysis of the code difficult.

2.3 Report Generation

The performance data output by cachegrind is contained in a binary file named "cachegrind.out.NNNNN" where NNNNN is the process ID. The cg_annotate command (included in a standard installation of cachegrind) will convert the binary cachegrind.out.NNNNN file into a human-readable format. However, cg_annotate requires arcane command-line arguments and the report it generates includes information on individual source files in a different order for each run, complicating comparisons using "diff". For this reason, it is best to run cg_annotate using a simple TCL script that automatically invokes cg_annotate with appropriate options and sorts the output into a consistent order. The following is one suggestion:

#!/usr/bin/tclsh
#
# A wrapper around cg_annotate that sets appropriate command-line options
# and rearranges the output so that annotated files occur in a consistent
# sorted order.
#

set in [open "|cg_annotate --show=Ir --auto=yes --context=40 $argv" r]
set dest !
set out(!) {}
while {![eof $in]} {
  set line [string map {\t {        }} [gets $in]]
  if {[regexp {^-- Auto-annotated source: (.*)} $line all name]} {
    set dest $name
  } elseif {[regexp {^-- line \d+ ------} $line]} {
    set line [lreplace $line 2 2 {#}]
  } elseif {[regexp {^The following files chosen for } $line]} {
    set dest !
  }
  append out($dest) $line\n
}
foreach x [lsort [array names out]] {
  puts $out($x)
}

Readers are invited to revise this script to suit their own preferences.

The cg_annotate output includes source code listings with cycle-count information added on the left. The addition of new text on the left messes up the tabs typically found in TCL source files, and so the "string map" line had to be added to convert the tabs into spaces. (The SQLite style guidelines prohibit the use of tabs and so this was never a problem when optimizing SQLite.)

2.4 Finding Micro-Optimization Opportunities

The output from cg_annotate shows source code listings with the number of cycles on the left margin. The search for micro-optimizations involves reading this output and looking for places where the numbers on the left margin are large and where some simple code changes might make the numbers smaller.

Here is an example micro-optimization taken from SQLite, specifically from SQLite check-in [618d8dd4ff4] on 2015-09-03. The code before the change looked like this:

         .    /* It is acceptable to use a read-only (mmap) page for any page except
         .    ** page 1 if there is no write-transaction open or the ACQUIRE_READONLY
         .    ** flag was specified by the caller. And so long as the db is not a 
         .    ** temporary or in-memory database.  */
   514,274    const int bMmapOk = (pgno!=1 && USEFETCH(pPager)
   771,867     && (pPager->eState==PAGER_READER || (flags & PAGER_GET_READONLY))
         .  #ifdef SQLITE_HAS_CODEC
         .     && pPager->xCodec==0
         .  #endif
         .    );  
   514,274    if( pgno==0 ){
         .      return SQLITE_CORRUPT_BKPT;
         .    }

The page number (pgno) is checked to see if it is 1 (the initial page of the database file that contains the database file header) and then immediately checked for 0 (which is illegal and indicates database corruption). The check for pgno==0 is always false except in the very rare circumstance where a database file has been corrupted in a devious way. So we would like to omit the second test and save half a million CPU cycles. But we cannot do that, in general, without making applications vulnerable to attack from adversaries that are able to tamper with database files.

We cannot completely eliminate the test of pgno==0 but we can reduce its impact as follows (changes in bold):

         .    /* It is acceptable to use a read-only (mmap) page for any page except
         .    ** page 1 if there is no write-transaction open or the ACQUIRE_READONLY
         .    ** flag was specified by the caller. And so long as the db is not a 
         .    ** temporary or in-memory database.  */
   514,274    const int bMmapOk = (pgno>1 && USEFETCH(pPager)
   514,578     && (pPager->eState==PAGER_READER || (flags & PAGER_GET_READONLY))
         .  #ifdef SQLITE_HAS_CODEC
         .     && pPager->xCodec==0
         .  #endif
         .    );  
       304    if( pgno<=1 && pgno==0 ){
   257,289      return SQLITE_CORRUPT_BKPT;
         .    }

The pgno variable is unsigned and for the initialization of the bMmapOk variable we do not care about the case of pgno==0, so we can change the initial test from "pgno!=1" into "pgno>1". Then, as a prefix to the pgno==0 test, we add the opposite conditional "pgno<=1".

The (optimizing) C-compiler recognizes that "pgno>1" and "pgno<=1" are opposites and that the two tests occur right after each other, so it codes them both using a single branch opcode. Hence, the CPU does no extra work to evaluated the added "pgno<=1" prefix before the "pgno==0" test. But, that prefix means that the "pgno==0" test is omitted whenever pgno is greater than one.

The example above was compile with gcc 4.8.4 using -Os. At that optimization setting, there is some code movement, and cg_annotate can get slightly confused and assign cycle counts to a line adjacent to the line on which the work actually occurred. So, for example, we see 257,289 cycles occurring on the "return SQLITE_CORRUPT_BKPT;" line even though that line is never executed. The key point is that by making a few simple adjustments, the total number of cycles is reduced by 513,970. That is about a 0.07% improvement for the workload being analyzed. Obviously it will take a great many such changes to make a noticeable performance difference. Perseverance is an important character trait for those who seek to make significant performance improvements using micro-optimizations.

2.4.1 Reduction Of Function Setup And Breakdown

Many micro-optimizations are ad hoc changes such as shown in the previous example. But there is one kind of micro-optimization that has been especially useful at speeding up SQLite: reducing or eliminating the stack pushes and stack pops that occur upon entering and exiting subroutines.

The following is a snippet of code taken from a recent (circa 2015-09-22) trunk of the TCL core. Comments and blank lines have been elided for brevity, but the code is otherwise unchanged.

        .  static int
        .  CompareVarKeys(
        .      void *keyPtr,                /* New key to compare. */
        .      Tcl_HashEntry *hPtr)        /* Existing key to compare. */
2,016,738  {
        .      Tcl_Obj *objPtr1 = keyPtr;
  336,123      Tcl_Obj *objPtr2 = hPtr->key.objPtr;
        .      register const char *p1, *p2;
        .      register int l1, l2;
  672,246      if (objPtr1 == objPtr2) {
  672,246          return 1;
        .      }
   90,126      p1 = TclGetString(objPtr1);
   30,042      l1 = objPtr1->length;
   90,126      p2 = TclGetString(objPtr2);
        .      l2 = objPtr2->length;
  240,336      return ((l1 == l2) && !memcmp(p1, p2, l1));
2,352,861  }

In the code above, the largest cycle counts are 2,352,861 on the closing curly-brace, and 2,016,738 on the opening curly-brace. These are, respectively, time spent restoring register contents by popping values off the stack upon function exit, and time spent preserving register contents by pushing values onto the stack upon function entry.

The alarming thing here is that much more time is spent preserving register contents than is spent doing actual work.

2.4.1.1 Function Calling Conventions

In order to see how these stack pushes and pops can be avoided, it is necessary to have a rudimentary understanding of how function calls are coded on modern processors. Calling conventions vary by CPU architecture (x64 is different from x86 which is different from ARM) and to some extent by compiler (Microsoft has their own calling conventions that are different from gcc/clang). But while the details may differ, there are some general patterns:

  1. The first few function arguments (perhaps as many as 6) are passed in registers. Excess arguments are stored on the stack. The function argument registers are not preserved. The function is free to return with different values in those registers.

  2. The return value is in one designated register.

  3. The caller expects the subroutine to preserve the values of some registers but allows the contents of other registers to be modified. Call the registers that do not need to be preserved "scratch registers".

In order to comply with these rules, the function implementation needs to preserve register contents on the stack if it changes any register other than the argument registers and scratch registers. Hence, the more local variables a function uses and the more computation it does, the more likely it will need to save some register contents on the stack.

The second point is that if the function calls subfunctions, then the function will need to preserve the contents of argument and scratch registers that are used after the subfunction returns. The need to have preserved registers available in which to save intermediate results across subfunction calls is the most common reason for pushing and popping at function entrance and exit.

Consequently, functions that invoke subfunctions frequently spend a lot of time pushing and popping registers so they will have space available to save intermediate results across subfunction calls. Note, however, that if all subfunction calls occur as the very last operation before the function returns, and if the return value of the subfunction is the same as the return value of the function, then nothing ever needs to be preserved across a subfunction call, and the pushing and popping can be reduced or eliminated.

The previous paragraph is essential to understanding what is about to happen. So if you did not quite understand it, please go back and reread it now.

2.4.1.2 Putting Infrequent Work In A Tail-Recursion Subfunction

If you look closely at the timings of the unoptimized CompareVarKeys() implementation, you will see that the function usually returns upon seeing that objPtr1==objPtr2. None of the code prior to that point uses subroutines. All of the subroutine calls happen only in the unusual case where objPtr1!=objPtr2. In other words, the part of the CompareVarKeys() routine that runs most often never actually needs to push any registers onto the stack. The need to push registers onto the stack only arises if objPtr1!=objPtr2.

So to reduce the amount of stack pushing and popping, we can split CompareVarKeys() into two nested routines. The outer routine handles the common case that does no subfunction calls and does not need to preserve registers to the stack. The inner routine handles the exceptional case that does preserve registers.

        .  static int
        .  CompareVarKeys(
        .      void *keyPtr,               /* New key to compare. */
        .      Tcl_HashEntry *hPtr)        /* Existing key to compare. */
        .  {
        .      Tcl_Obj *objPtr1 = keyPtr;
  336,123      Tcl_Obj *objPtr2 = hPtr->key.objPtr;
  672,246      if (objPtr1 == objPtr2) {
        .          return 1;
        .      } else {
   30,042          return CompareDistinctVarKeys(objPtr1, objPtr2);
        .      }
  612,162  }

        .  static NOINLINE int CompareDistinctVarKeys(
        .      Tcl_Obj *objPtr1,
        .      Tcl_Obj *objPtr2)
  150,210  {
        .      register const char *p1, *p2;
        .      register int l1, l2;
   90,126      p1 = TclGetString(objPtr1);
   30,042      l1 = objPtr1->length;
   90,126      p2 = TclGetString(objPtr2);
        .      l2 = objPtr2->length;
  270,378      return ((l1 == l2) && !memcmp(p1, p2, l1));
  120,168  }

The outer routine is still CompareVarKeys(). The new inner routine is CompareDistinctVarKeys().

The CompareVarKeys() routine no longer makes calls to subfunctions, except for the one call to CompareDistinctVarKeys() and that is a terminal call for which no intermediate results need to be saved, and so no preserved registers are needed to accommodate that call. Hence, the time needed to push and pop register values from the stack in CompareVarKeys() is completely eliminated. (The 612,162 cycles on the close curly-brace of CompareVarKeys() is really the time spent doing "return 1;".) Pushes and pops are still required for the CompareDistinctVarKeys() routine, but that subfunction only runs about 5% of the time, so the impact of the pushing and popping is greatly reduced.

For the workload that provided the previous traces (one in which global variables were heavily used) the change above improved performance by 1.6%.

2.4.1.3 Prevent Function Inlining

The NOINLINE macro on the CompareDistinctVarKeys() function is essential for the previous micro-optimization to work. Without NOINLINE, an optimizing compiler will recognize that CompareDistinctVarKeys() was only called from a single place and will inline that routine, which will completely undo the code transformation that provides the 1.6% performance gain.

Different compilers have different ways to prevent a function from being inlined. Hence a macro must be devised for portability. Something like the following might work:

#if defined(__GNUC__)
#  define NOINLINE  __attribute__((noinline))
#elif defined(_MSC_VER) && _MSC_VER>=1310
#  define NOINLINE  __declspec(noinline)
#else
#  define NOINLINE
#endif

2.5 Verification

Micro-optimizations can be tricky. The examples above were relatively straight-forward examples. Real-world examples tend to be more subtle and more complex. In the case of SQLite, several of the more productive optimizations involve replacing entire subsystems with a new implementation, possibly using a new algorithm. Hence, it is essential that all micro-optimizations be thoroughly vetted before being committed to trunk, to avoid introducing bugs.

3.0 Other Considerations

SQLite is normally delivered in the form of the "amalgamation" - a single large source code file named "sqlite3.c". The amalgamation is a build product. A TCL script (mksqlite3c.tcl) is used to concatenate all of the canonical SQLite source files, and a few machine-generated files, into the amalgamation.

The SQLite amalgamation is often viewed as a deployment convenience - it is easier to add a single "sqlite3.c" source file to a project than to add a directory hierarchy full of files and Makefiles. But the amalgamation also offers significant speed advantages. Because all the source code is contained within a single translation unit, the compiler is able to do more cross-function optimization. A speed test of SQLite using the amalgamation is between 5% and 6% faster than the same test built using separate source files. It is unknown if similar speedups might be obtained by creating a TCL amalgamation source file, but it seems worth trying the experiment.

4.0 Retrospective

A large two-year effort was involved in doubling the performance of SQLite. Nobody believed the a doubling of performance was possible when the effort was first started. After each new optimization, the developers would think "surely there is nothing more that can be done!" But with further study, new optimization opportunities would emerge.

The amount of effort involved is perhaps too large to be justified for an ordinary application. But for an infrastructure component like SQLite, or like TCL, that is used by countless other projects, an investment in making the core library run faster is multiplied across many applications.