Using Uber's Pyflame and Logs to Tackle Scaling Issues

Here I was again, looking at my screen in complete disbelief.

This time, it was different though. My distributed web crawler seemed to be slowing down over time. Adding more nodes only had a temporary performance boost; the overall crawling speed gradually declined afterwards. So simply put, it couldn't scale. But why?

In this post, you'll find out what techniques and tools I used to diagnose scaling issues - and to an extent, more general performance issues - in my Python-based web crawler.

The problem

First off, I had designed my crawler as follows:

The main components were:

  1. A crawler dispatcher, responsible for dispatching URLs to be crawled to the N crawler supervisors.
  2. N crawler supervisors, responsible for supervising N child processes (crawlers). Those child processes performed the actual crawling.
  3. Two MongoDB servers. The first one stored the initial seed URLs as well as the crawled HTML data; the second one stored robots.txt files and timestamps indicating when a domain was last crawled.

Additionally, all interprocess communications across servers happened thanks to ZeroMQ queues. So in theory, it would be easily scalable. I could add more supervisors and expect performance to increase linearly, or so I thought.

But as crawling was going on, I discovered a curious thing. The overall crawl rate was actually decreasing over time. And I noticed that the effect was even stronger when adding more nodes. Here's an overview with 2 supervisors:

And here's what it looked like with 4 supervisors:

What the hell! There was such a clear trend that it couldn't be just a coincidence. Something was putting the brakes on my crawler. Big time! And I absolutely needed to determine where the bottleneck was. Otherwise, my crawler would be running forever, instead of a few days or weeks.

Attempt #1: Analyzing my application's CPU and memory use

So first of all, I looked at the CPU and memory use of the dispatcher node. Thus, I would know if a lack of resources was a contributing factor.

[acme@dispatcher ~]$ dstat -tcmdsn -N eth0  --top-cpu-adv --top-mem
----system---- ----total-cpu-usage---- ------memory-usage----- -dsk/total- ----swap--- --net/eth0- -------most-expensive-cpu-process------- --most-expensive-
     time     |usr sys idl wai hiq siq| used  buff  cach  free| read  writ| used  free| recv  send|process              pid  cpu read write|  memory process
19-12 20:30:15| 25   4  71   0   0   0| 268M 5644k  197M 18.7M|   0   176k|  34M  466M|  18k   19k|dispatcher           5145  32%8355B 385k|dispatcher 155M  
19-12 20:30:16| 22   2  75   0   0   0| 268M 5652k  197M 18.7M|   0    48k|  34M  466M|9622B   18k|dispatcher           5145  26%8013B 340k|dispatcher 155M  
19-12 20:30:17| 28   4  68   0   0   0| 268M 5652k  197M 18.7M|   0  4096B|  34M  466M|  12k   21k|dispatcher           5145  34%9983B 283k|dispatcher 155M  
19-12 20:30:18| 25   4  71   0   0   0| 268M 5652k  197M 18.7M|   0  4096B|  34M  466M|  15k   14k|dispatcher           5145  29%  21k 303k|dispatcher 155M  
19-12 20:30:19| 28   3  68   0   0   1| 268M 5652k  197M 18.7M|   0     0 |  34M  466M|  27k   18k|dispatcher           5145  33%8955B 199k|dispatcher 155M  
19-12 20:30:20| 29   4  60   5   0   1| 268M 5652k  197M 18.7M|   0  3800k|  34M  466M|  16k   17k|dispatcher           5145  30%  11k 334k|dispatcher 155M  
19-12 20:30:21| 22   5  73   0   0   0| 268M 5660k  197M 18.7M|   0    48k|  34M  466M|  11k   16k|dispatcher           5145  29%  12k 322k|dispatcher 155M  
19-12 20:30:22| 24   3  72   0   0   1| 268M 5660k  197M 18.7M|   0     0 |  34M  466M|  13k   15k|dispatcher           5145  28%  12k 245k|dispatcher 155M  
19-12 20:30:23| 24   4  72   0   0   0| 268M 5660k  197M 18.7M|   0  4096B|  34M  466M|  10k   15k|dispatcher           5145  28%7097B 315k|dispatcher 155M  
19-12 20:30:24| 26   4  71   0   0   0| 268M 5660k  197M 18.7M|   0  4096B|  34M  466M|  15k   18k|dispatcher           5145  32%  12k 320k|dispatcher 155M  
...

And here are the same stats for one of the supervisor nodes:

[acme@supervisor-1 ~]$ dstat -tcmdsn -N eth0  --top-cpu-adv --top-mem
----system---- ----total-cpu-usage---- ------memory-usage----- -dsk/total- ----swap--- --net/eth0- -------most-expensive-cpu-process------- --most-expensive-
     time     |usr sys idl wai hiq siq| used  buff  cach  free| read  writ| used  free| recv  send|process              pid  cpu read write|  memory process
20-12 20:33:35|  8   2  90   0   0   0| 466M  296k 15.5M 7956k|   0    40k|  22M  478M|  10k 5357B|crawlers/management/cr2638 9.0% 954B1860B|crawlers/man65.0M  
20-12 20:33:36| 13   2  84   0   0   0| 465M  296k 15.5M 7988k|   0     0 |  22M  478M|  15k   10k|crawlers/management/cr2658 9.0%3578B4614B|crawlers/man65.0M  
20-12 20:33:37|  7   2  90   0   0   1| 465M  296k 15.5M 7992k|   0     0 |  22M  478M|3761B 3350B|crawlers/management/cr2638 9.0%   8B1945B|crawlers/man65.0M  
20-12 20:33:38| 10   3  87   0   0   0| 465M  296k 15.5M 7992k|   0     0 |  22M  478M|5596B   11k|crawlers/management/cr2638 9.0% 115B1454B|crawlers/man65.0M  
20-12 20:33:39| 10   2  88   0   0   0| 465M  296k 15.5M 7992k|   0    40k|  22M  478M|6360B 6489B|crawlers/management/cr2638 9.0%   8B1802B|crawlers/man65.0M  
20-12 20:33:40| 21   4  74   0   0   0| 466M  304k 15.5M 7812k|   0    44k|  22M  478M|  25k   19k|crawlers/management/cr2658  10%2856B3903B|crawlers/man65.0M  
20-12 20:33:41| 39   2  59   0   0   0| 466M  304k 15.5M 7824k|   0     0 |  22M  478M|  55k   19k|crawlers/management/cr2658  18%9145B4954B|crawlers/man65.0M  
20-12 20:33:42| 19   3  78   0   0   0| 466M  304k 15.6M 7840k|   0     0 |  22M  478M|  27k   14k|crawlers/management/cr2657  12%4302B4217B|crawlers/man65.0M  
20-12 20:33:43| 14   2  84   0   0   0| 466M  304k 15.6M 7784k|   0     0 |  22M  478M|  20k   14k|crawlers/management/cr2638  10%1650B2963B|crawlers/man65.0M  
20-12 20:33:44| 24   3  73   0   0   0| 466M  304k 15.6M 7796k|   0     0 |  22M  478M|  56k   11k|crawlers/management/cr2658  17%  37k3771B|crawlers/man65.0M  
...

In both cases, the CPU and memory use was under control. So it couldn't explain the slowdown I was observing.

Attempt #2: Analyzing my application's logs

Then I looked at my application's logs1. Analyzing the frequency of appearance of some strategic lines would reveal if any particular operation was slowing down over time.

Here's an excerpt of a log file generated by an individual crawler child process2:

2016-12-20 14:29:19,244 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.perdu.com  
2016-12-20 14:29:20,086 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 200 None  
2016-12-20 14:29:20,135 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://www.perdu.com/' will be crawled.  
2016-12-20 14:29:20,136 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://www.perdu.com/'.  
...
2016-12-20 14:29:22,239 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.theansweris42.com  
2016-12-20 14:29:23,103 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 404 315  
2016-12-20 14:29:23,137 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://www.theansweris42.com/my' will be crawled.  
2016-12-20 14:29:23,137 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://www.theansweris42.com/my'.  
2016-12-20 14:29:23,139 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.theansweris42.com  
2016-12-20 14:29:24,072 - requests.packages.urllib3.connectionpool - DEBUG - "HEAD /my HTTP/1.1" 404 0  
2016-12-20 14:29:24,074 - crawlers.web.standalone_web_crawler - ERROR - Could not retrieve content for URL 'http://www.theansweris42.com/my'.  
2016-12-20 14:29:24,099 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://maddox.xmission.com/' was postponed.  
...
2016-12-20 14:29:31,368 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): wikipedia.org  
2016-12-20 14:29:31,439 - crawlers.web.throttling.url_action_manager - ERROR - Unable to determine if the URL 'http://wikipedia.org/' is allowed to be crawled.  
...
2016-12-20 14:29:31,524 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://wikipedia.org/' was ignored.