Analyzing Performance Analysis Performance

Why is EXPLAIN ANALYZE slow? “Why is EXPLAIN ANALYZE so slow?”, I asked into the void one day. Specifically, why is PostgreSQL’s EXPLAIN ANALYZE an order of magnitude slower than the query it’s analyzing when run in a container in Docker for Mac? (See Summary for the summary) Background Benchling is a data platform for life science research. […]

Why is EXPLAIN ANALYZE slow?

“Why is EXPLAIN ANALYZE so slow?”, I asked into the void one day. Specifically, why is PostgreSQL’s EXPLAIN ANALYZE an order of magnitude slower than the query it’s analyzing when run in a container in Docker for Mac?

(See Summary for the summary)

Background

Benchling is a data platform for life science research. Our users upload DNA sequences, proteins, notebook entries, and other research data, and search over them to find the data they need to run experiments. I was investigating a PostgreSQL query used in search that was too slow. It was a tricky query with many JOINs, multiple UNIONs – the kind that doesn’t fit in small console windows.

The UI for the search I was debugging.

We use MacBooks and run Docker for Mac to get pretty close to our production systems. I tried running it locally to investigate. Running it with just EXPLAIN gave me a pretty big query plan so, as usual, I ran it with EXPLAIN ANALYZE to get some timing numbers to help me focus on the slow part. But it turned out EXPLAIN ANALYZE was taking almost a minute when the original query took only a few seconds.

That’s the start of how I ended up going down a rabbit hole for an hour and reasoning about a time warp and at this point you think this is an Alice in Wonderland joke but it’s not. I’m writing the post that I wished existed when I started this because, although all the information here exists elsewhere, I didn’t find it put together for our particular stack.

Spoiler: the answer is restart Docker for Mac.

Setup

We can easily reproduce the issue without needing the full Benchling development setup.

docker run –rm –name db postgres:9.6.9 gives us a PostgreSQL server. Once it’s up, we can get a shell with docker exec -it db psql -U postgres.
And now, for the fun part:

postgres=# timing
Timing is on.
postgres=# SELECT count(*) FROM generate_series(1, 100000) AS numbers1 JOIN generate_series(1, 100000) AS numbers2 on numbers1 = numbers2;
count
--------
100000
(1 row)
Time: 154.999 ms
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM generate_series(1, 100000) AS numbers1 JOIN generate_series(1, 100000) AS numbers2 on numbers1 = numbers2;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212.16..212.17 rows=1 width=8) (actual time=7640.240..7640.246 rows=1 loops=1)
-> Merge Join (cost=119.66..199.66 rows=5000 width=0) (actual time=2987.783..6887.173 rows=100000 loops=1)
Merge Cond: (numbers1.numbers1 = numbers2.numbers2)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual time=1477.798..2266.221 rows=100000 loops=1)
Sort Key: numbers1.numbers1
Sort Method: external sort Disk: 1368kB
-> Function Scan on generate_series numbers1 (cost=0.00..10.00 rows=1000 width=4) (actual time=18.623..729.027 rows=100000 loops=1)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual time=1509.904..2303.086 rows=100000 loops=1)
Sort Key: numbers2.numbers2
Sort Method: external sort Disk: 1760kB
-> Function Scan on generate_series numbers2 (cost=0.00..10.00 rows=1000 width=4) (actual time=8.836..735.784 rows=100000 loops=1)
Planning time: 0.247 ms
Execution time: 7644.305 ms
(13 rows)
Time: 7646.616 ms

So the query itself ran in well under a second, but EXPLAIN ANALYZE on that same query took over 7 seconds. What gives?

“Timing”

The PostgreSQL documentation has a caveats section for EXPLAIN.

Second, the measurement overhead added by EXPLAIN ANALYZE can be significant, especially on machines with slow gettimeofday() operating-system calls. You can use the pg_test_timing tool to measure the overhead of timing on your system.

pg_test_timing, you say?

$ docker exec -it db pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 7353.94 nsec
Histogram of timing durations:
< usec % of total count
1 0.00000 0
2 0.00000 0
4 0.00000 0
8 82.43007 336271
16 14.80805 60409
32 2.27505 9281
64 0.25101 1024
128 0.18336 748
256 0.04020 164
512 0.00907 37
1024 0.00221 9
2048 0.00074 3
4096 0.00000 0
8192 0.00025 1

Hm. Is that good?

Interpreting results
Good results will show most (>90%) individual timing calls take less than one microsecond. Average per loop overhead will be even lower, below 100 nanoseconds. This example from an Intel i7–860 system using a TSC clock source shows excellent performance:
Testing timing overhead for 3 seconds.
Per loop time including overhead: 35.96 nsec
Histogram of timing durations:
< usec % of total count
1 96.40465 80435604
2 3.59518 2999652
4 0.00015 126
8 0.00002 13
16 0.00000 2

So… no. Their results show 96% of calls taking under a microsecond, averaging 36 nanoseconds. My results show most calls taking 8–16 microseconds, averaging 7.4 microseconds. But what’s a 200× difference between friends? Maybe it was just slow that one time because something else was using my CPU or something. Let’s try again.

$ docker exec -it db pg_test_timing
Testing timing overhead for 3 seconds.
Detected clock going backwards in time.
Time warp: -34612 microseconds

Well that’s downright unfortunate. “Going backwards in time” is not something you want to see until you’ve very carefully prepared your chronosphere. pg_test_timing repeatedly calls gettimeofday and the second call returned a time 34 milliseconds before the first call.

So here’s where we’re at: something is causing PostgresSQL’s performance analysis to have bad performance. Time warps caused us to not be able to analyze that further since we don’t have a trustworthy timing mechanism. Let’s figure out why.

Time Warp

There’s a GitHub issue with 50+ comments about time drift between macOS and the Linux VM that runs all the Docker stuff. It is really bad when your clock goes backwards, like it did above. Docker’s fix for this is

Use Simple NTP to minimise clock drift between the VM and the host.

which works when we’re concerned about second-level resolution, but not when we’re talking about microseconds.

So we know there’s a clock sync problem. Do we care to fix it? We’re not doing the sort of thing that is sensitive to microsecond-level precision (time-based authentication, etc.). It would be nice if our EXPLAIN ANALYZEs were super accurate, but we just want it to run in a reasonable amount of time.

TSC vs. HPET

There’s another detail in the PostgreSQL docs that is interesting. Requoting with emphasis:

This example from an Intel i7–860 system using a TSC clock source shows excellent performance

What’s that?

This is the simplest explanation I could find:

The preferred clock source is the Time Stamp Counter (TSC), but if it is not available the High Precision Event Timer (HPET) is the second best option.

OK. What are we using?

$ docker run -it postgres:9.6.9 cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
$ docker run -it postgres:9.6.9 cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

That doesn’t look good. HPET is our current clock source and TSC isn’t even an available clock source.

Why aren’t we using TSC? The answer lies in the kernel, so things get a bit tricky. We need to search our Docker for Mac’s Linux VM’s kernel logs to see what happened to our clock source. If your Docker for Mac VM hasn’t been running very long (but long enough to experience enough drift), you can find the answer with
docker run –privileged -it postgres:9.6.9 dmesg | egrep (tsc|hpet)
Otherwise, you may have to go digging through other logs:
docker run –privileged -v /var/log:/logs -it postgres:9.6.9 egrep (tsc|hpet) /logs/kmsg.log.0
(This volume mount mounts /var/log from the Linux VM, not your macOS volume, because /var/log is not shared with the VM.) In any case, you should see something along the lines of

clocksource: timekeeping watchdog on CPU2: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_now: eb454b8a wd_last: ea528dea mask: ffffffff
clocksource: 'tsc' cs_now: 2c195dabad88 cs_last: 2c0e63f53a4a mask: ffffffffffffffff
clocksource: Switched to clocksource hpet

So besides making pg_test_timing angry, the time warps were also making Linux switch from TSC to HPET. What if we just run our EXPLAIN ANALYZE before it’s marked unstable? Restarting Docker for Mac, we see

$ docker run -it postgres:9.6.9 cat /sys/devices/system/clocksource/clocksource0/{available,current}_clocksource
tsc hpet acpi_pm
tsc

and rerunning the first commands gives

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM generate_series(1, 100000) AS numbers1 JOIN generate_series(1, 100000) AS numbers2 on numbers1 = numbers2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212.16..212.17 rows=1 width=8) (actual time=184.858..184.858 rows=1 loops=1)
-> Merge Join (cost=119.66..199.66 rows=5000 width=0) (actual time=128.732..179.602 rows=100000 loops=1)
Merge Cond: (numbers1.numbers1 = numbers2.numbers2)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual time=69.437..81.579 rows=100000 loops=1)
Sort Key: numbers1.numbers1
Sort Method: external sort Disk: 1368kB
-> Function Scan on generate_series numbers1 (cost=0.00..10.00 rows=1000 width=4) (actual time=12.373..24.829 rows=100000 loops=1)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual time=59.291..73.667 rows=100000 loops=1)
Sort Key: numbers2.numbers2
Sort Method: external sort Disk: 1760kB
-> Function Scan on generate_series numbers2 (cost=0.00..10.00 rows=1000 width=4) (actual time=9.795..19.561 rows=100000 loops=1)
Planning time: 0.229 ms
Execution time: 188.242 ms

a much better result!

Summary

After running for a while, Docker for Mac experiences time drift which causes the clock source to switch from TSC to HPET which makes PostgreSQL sad. My performance analysis query, EXPLAIN ANALYZE, was running slow enough that I had to analyze its performance and the solution ended up being turning Docker for Mac off and on again.

At this point, you’re probably wondering if there’s a more permanent fix than “restart Docker for Mac every time the VM switches to HPET”. I am, too. I assume if this were easily fixable, Docker, Inc. would have shipped something less kludgy than an NTP sync.

Was this worth the investigation time? The point of this wasn’t to save a minute waiting for EXPLAIN ANALYZE to finish. When the overhead of your performance analysis dominates the time spent, it’s hard to trust the resulting analysis. If it says that some section of the query supposedly took 1.23 seconds, how much of that 1.23 seconds is just gettimeofday() being slow? So I spent an hour not to speed up my performance analysis, but to enable my use of the tool.

As always, if you’re interested in working on interesting technical challenges or solving problems in the life science industry, Benchling is hiring.

Discuss on Hacker News

Thanks to Josh, Somak, and Vineet for reading drafts of this.


Analyzing Performance Analysis Performance was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling