Brett Wilson

Using SQLite in client applications

I’ve written some of the SQLite code shipping in both the Mozilla Firefox and Google Chrome web browsers, in particular major parts of the history systems for each of the browsers. Through these several years of using the program, I’ve learned some things that work well, but in particular some things to watch out for.

SQLite is not magic

When confronted with Sqlite for the first time, my impression was “great, this solves all my problems! Since it’s SQL, it seems like it will be superfast, and it will handle all the I/O and optimization for me. All I have to do is worry about is figuring out the SQL syntax and my system will be awesome.

In fact, SQLite really only has one I/O algorithm and two search algorithms: brute-force over an entire table, and binary search over an index. Since it has to be general purpose, it it likely to be lower performance than a good implementation of something domain-specific. I’ve heard stories of companies using SQLite in a project who discover too late that it doesn’t provide the performance they expected in the real world. You don’t want to be one of them.

SQLite is not magic. It’s important to think of it not as a black box that stores your data, but as a collection of canned storage algorithms that can be applied to your problem. Unless you have extremely simple needs you have to take an active role in managing your data to get good performance.

SQLite will not solve all your performance problems

It’s easy to look at SQL and assume that SQLite will optimize it so that it runs quickly. SQLite’s optimizer is not the most advanced, and in some cases it would be possible to optimize a query but it doesn’t. In other cases, if you actually work through what needs to happen to execute your SQL, you’ll notice that it’s not possible to optimize with your indices. Because SQL is so removed from the actual query, this can be difficult to see.

I’ve caught several trying to check in SQL saying that “SQLite should be able to optimize this.” Never assume Sqlite can optimize a statement unless you’re sure it’s a simple query over an indexed column. To check your more complicated queries, use EXPLAIN QUERY PLAN from the handy command-line tool:

sqlite> CREATE TABLE a (x, y);
sqlite> CREATE TABLE b (n, m);
sqlite> EXPLAIN QUERY PLAN SELECT m FROM a JOIN b ON a.y = b.m WHERE x = 34;
0|0|TABLE a
1|1|TABLE b

Here, we’re selecting an element from a join. “TABLE a” means a scan of the given table for the element without an index. You can see there are two such brute-force scans here, one for x = 34 in table a, and one for the row in b associated with a. If we add some indices, we can see the lookup starts using them:

sqlite> CREATE INDEX index a_index ON a(x);
sqlite> EXPLAIN QUERY PLAN SELECT m FROM a JOIN b ON a.y = b.m WHERE x = 34;
0|1|TABLE b
1|0|TABLE a WITH INDEX a_index
sqlite> CREATE INDEX b_index ON b(m);
sqlite> EXPLAIN QUERY PLAN SELECT m FROM a JOIN b ON a.y = b.m WHERE x = 34;
0|0|TABLE a WITH INDEX a_index
1|1|TABLE b WITH INDEX b_index

In the first example, an index is used only for the second part of the query, which only gets us halfway there. Adding the second index gives us two indexed lookups. SQLite will use a binary search when using indexed lookups, giving you much better access.

SQLite will not solve your memory problems

A nice feature of SQLite is that it will page data in as necessary, so it doesn’t have to have everything in memory to work with it. This can lower the memory footprint of your application. But it means you’re going to be spending more time waiting for data to be paged in and out by the operating system. In many cases, the operating system will cache your data and you won’t notice any problems, but this means that your users will get unexpectedly poor performance when their system is low on memory. Don’t forget to test in low-memory situations!

One thing you will find yourself doing is adding more and more indices over your data as you follow the above instructions to optimize them. However, indices are not free. They take time to update when you insert, and use a lot of space. Google Chrome’s history database is typically almost half indices over the data, rather than actual data. This ends up being many megabytes of extra space, and it’s important it doesn’t get any larger. You can use the analyzer program from the SQLite download page to see the usage for your databases so you can make good trade-offs.

SQLite will not solve your I/O problems

SQLite uses a page-based disk structure. Pages are read from disk lazily and into a cache, and pages will be re-used when they’re empty. Sqlite does not rearrange pages on disk in normal operation. This means your file won’t get smaller when you delete data: you have to call VACUUM for them to get released. Unfortunately, the vacuum command is very slow so you will want to avoid it if possible.

The re-use of pages means that the file can get fragmented. If you perform many large updates to your database your pages will get quasi-randomly distributed around the SQLite file, which will give poor read performance (remember each seek requires a mechanical head to move and wait for the correct spot on a spinning metal disk). On top of that, if your user has a very full hard disk, the different parts of the file will be located on very different parts of the disk and you will get a perfect storm of suckage. If you do large updates on machines you don’t control, you may want to look into more application-specific storage.

The paging system also means that reading pages for the first time into the cache require a lot of seeking around the file. For small databases or pages close to each other, the operating system’s cache will prevent an actual disk seek. But for larger ones, even for an unfragmented database, you will get wild seeks around the file which can hurt performance.

Google Chrome optimizes the initial I/O for the main history database, which is used frequently and soon after startup. A custom modification to SQLite primes the cache by reading the first N pages off disk in one large read operation. This prevents the many seeks that would be required to bring all those pages in organically and substantially improves first-use performance. Keep in mind it only works if most of your database will fit into SQLite’s cache and that it only speeds up large operations that happen soon after startup.

Committing and fsyncs

Atomic commits require calling fsync on the file, which (hopefully) forces the data to be written to the disk. Because this requires waiting for the mechanical hard drive, it can be very slow. If you don’t need the ability to rollback, you can batch your transactions for better performance. Google Chrome’s history system keeps an open transaction which it committs periodically. The history system is a perfect application for this technique because the “durable” property of an ACID database is less important (if the user loses the history entry for the page they visited a few seconds before crashing, they probably won’t notice or care).

Commits on Linux can be even more expensive. When running in “ordered” mode, the Ext3 filesystem flushes all file caches when it gets an fsync. This means that your commit won’t just wait for the disk to seek and write your data, but potentially many megabytes of data of random files throughout the system. So avoid unnecessary commits. Study the atomic commit article and also consider if your application can run with synchronous = off.

SQLite will not solve your concurrency problems

SQLite allows more than one connection to the same database. Some people will think “great, now I can run more than one copy of my program sharing the same data.” Although this will work, it’s generally best to avoid.

Depending on the state of other processes, the file may be locked for writes, or both reads and writes. If another program wants to access it, it must wait or somehow try later, and you have to program this explicitly in your application. All the disk sync problems I talked about before now get a lot worse: even reading the data might depend on a disk sync completing as a result of a write caused by another process. Plus, SQLite must throw away its memory caches when another program has mutated the data. Hopefully the OS has cached the data, but it will still involves extra work.

SQLite will not solve your data integrity problems

When you work on a popular client application, your databases and code will be on millions of devices. Every bad thing will happen on this many machines. Unlike a big Oracle installation where people will spend millions of dollars ensuring database integrity and doing backups, your customers will spend zero. Some hard drives and filesystems lie about file syncing, so commits are not actually atomic. Sometimes a memory bug in another part of your program will clobber a corrupt database page that will later get flushed to disk. You need to be able to handle errors and recover or continue if possible.

Other hints

Use pre-compiled statements

When you execute SQL, SQLite, parses the string and builds up a set of operations that run in a simple virtual machine. You can avoid this compilation step by re-using statements that have been previously compiled. Firefox’s history system creates all the statements it needs when it starts up and uses those compiled statements later. This works well, but is some extra startup overhead, especially since some of those statements aren’t used very often.

The other problem with creating all statements up-front is that the SQL and the code using it is very removed. When there are a lot of columns being queried, it can be difficult to keep track of how your application code interfaces with the SQL when the two are hundreds of lines apart. The first pass of the Google Chrome history system did not use precompiled statements in an effort to keep the SQL and the C++ code closer together. However, profiling revealed Sqlite was spending almost as much time compiling statements over and over as actually executing them.

Google Chrome’s SQLite wrapper now includes an automatic caching system based on the file and line number of the calling code. This means that the SQL commands are close to the use of the statement, that statements are cached, and that they are compiled lazily when they are first used. Typically, we do not re-use identical statements in different places, preferring instead to write a C++ function to execute the common SQL operation that. Here's an example of how the calling code looks in the download database, where SQL_FROM_HERE is a macro that identifies the file and line number of the caller:

bool DownloadDatabase::UpdateDownload(int64 received_bytes,
                                      int32 state,
                                      DownloadID db_handle) {
  sql::Statement statement(GetDB().GetCachedStatement(SQL_FROM_HERE,
      "UPDATE downloads SET received_bytes=?, state=? WHERE id=?"));
  if (!statement)
    return false;

  statement.BindInt64(0, received_bytes);
  statement.BindInt(1, state);
  statement.BindInt64(2, db_handle);
  return statement.Run();
}

However you do it, always re-use the compiled copy of a statement that is likely to be executed many times. Keep in mind, too, that it’s wasteful to keep open a compiled statement that will be used only once (this happens most often in the CREATE TABLE statements you’ll have at the beginning of the code).

You don’t necessarily have to be all relational-databasey all the time

The relational database people have a lot of rules about how you’re supposed to do relational databases that helps ensure your data is consistent and normalized. When you’re a bank that’s keeping your critical customer data and have a team of professional database administrators, this is super important. If you’re doing the database backend for a web server, even a small one, you will want to be careful with this as well.

But if you’re using SQLite, you’re most likely writing an embedded or client app in another language and are using SQLite as a convenient storage layer. Probably most people on your team don’t know SQL at all, and you might not have a deep understanding of it yourself. I think it’s perfectly OK to use SQLite as a very fancy lookup table rather than a relational database with triggers and constraints to keep everything consistent. When the point of your application is the database and you have administrators for it, complicated constraints and triggers keep your database maintainable by other administrators. But when you’re on a team of C++ programmers, using simple SQL and perhaps more complicated C++ can actually be more maintainable by the team.

Keep the relational database rules in mind because they’re good ideas. A non-normalized schema that duplicates data is wasteful and more prone to bugs. But you don’t need to feel guilty disobeying some of the rules dictated by the High Relational Database Masters in your embedded application.

Use SQLite only from a background thread

Because of all the I/O, and in particular the fsyncs which can take a very long time, it’s important to not use SQLite in a way that will block your user interface. Google Chrome is carefully written to run SQLite only on a background thread which the rest of the program interacts with asynchronously. This helps a lot with the responsiveness of the application. This is something that’s hard to add later, so design your program from the start to be asynchronous.

© 2009 Brett Wilson