I've been spending a lot of time in the web-scraping world these days, and have had pretty great success parsing out good content from about 90% of the websites I come across. That 10% is the result of this new fangled technology called
AJAX that is sweeping across the web. AJAX is allowing web developers to create very minimal HTML pages to send for the initial request, and then send the rest of the actual content after-the-fact. This is quickly gaining traction because load times are slower on mobile devices, and often times getting even the skeleton of a page will save users from losing interest. The solution to my AJAX problem is to find some way to actually let the javascript (AJAX) run before I start parsing the HTML. This is impossible when just using cURL - cURL is very simply an HTTP client. There is no sort of browser functionality, aside from the fact that it knows the HTTP standards up and down, and is pretty quick at making requests. The common solution is to actually borrow the engines that power modern browsers to run the javascript, and parse afterwards. The most popular library for doing this today is
PhantomJS.
Truth be told, PhantomJS is actually a wonderful product. I've used other packages built on top of
WebKit before and usually have had countless errors and warning, and often end up getting incorrect results. The only thing holding me back from completely stripping cURL out of my scraping solutions was to make sure my load times arent going to sky rocket due to PhantomJS. I do a lot of scraping, and about 60% of my load time is already dedicated to the cURL requests - I would hate to have that double and make my system nearly unusable. So,
I wrote a benchmarking utility to compare PhantomJS to cURL. Setup the utlity from the gist above, and then simply run the command as follows:
[code]./benchmark <url> <# of times to load>[/code]
Note: Be very careful with higher load counts (the second parameter). This utility will actually load double that value - half for PhantomJS, half for cURL. Depending on your ISP as well as the host that you are checking, this may be flagged as a DOS attack.
The obvious first choice is to simply run this utility with a large load count, and see how the numbers stack up as it scales. In order to avoid looking like a DOS attack, I decided to run the utility on my local router with 100 requests. My results are below:
Local Router 100 Request Benchmark
Trial # |
PhantomJS Load Time (s) |
cURL Load Time (s) |
1 |
36 |
3 |
2 |
30 |
2 |
3 |
26 |
1 |
Average |
30.666 |
2 |
On Average, every PhantomJS request took 15.333 times longer than a regular cURL request
Well, that's pretty unfortunate - I guess there's no way I will be using PhantomJS. However, something seems wrong with those results. There's no way PhantomJS could be as popular as it is with its requests taking 15x longer than cURL - hell, I could almost do it manually in that much time! Unfortunately, this test also isn't a reflection of the real world - I'm not going to be scraping things that are on the same LAN as me, I'm going to be making requests to a hostname instead of an IP address, and I'm also going to be doing so few request counts that real data need to be measured in milliseconds, not seconds.
We can make a little modification to our script to
use millseconds by changing all the date commands to:
[code]date +%s%N | cut -b1-13[/code]
Now we just need to find a site that loads content via javascript. I will be using
gawker, because they've been a particularly nasty thorn in my side with this javascript problem. I will also be dropping my request count to 10, to avoid any DOS problems (I tried 20, but cURL kept hanging for some reason). I also decided to move to a different LAN so that my available bandwidth didn't fluctuate based on other people on the network. The results are below:
Gawker 10 Request Benchmark
Trial # |
PhantomJS Load Time (ms) |
cURL Load Time (ms) |
1 |
78810 |
31590 |
2 |
62050 |
2308 |
3 |
27397 |
12858 |
Average |
56085.666 |
15585.333 |
On Average, every PhantomJS request took 3.599 times longer than a regular cURL request
That's a significantly better result! I never expected PhantomJS to be quite as quick as cURL, so only 3.6x longer isn't really that bad. Especially on a site that loads almost 100% of the content via javascript, I'd say that's a good result. However, I'm still a bit uneasy with the idea of multiplying all of my load times by 3, so there's a couple more tests I would like to run.
The first test is a site that doesn't load any content via javascript, but has a fair amount of javascript built into the actual design of the site. This should speed things up because phantomjs won't be making any HTTP requests (AJAX), but it still has some processing to do. For this one I will be doing the test on my
portfolio homepage.
WegnerDesign 10 Request Benchmark
Trial # |
PhantomJS Load Time (ms) |
cURL Load Time (ms) |
1 |
62585 |
10353 |
2 |
64234 |
10132 |
3 |
59841 |
14073 |
Average |
62220 |
11519.333 |
On Average, every PhantomJS request took 5.401 times longer than a regular cURL request
The final test will be on a site that pretty much runs no javascript at all, to see if PhantomJS tacks much time on top of a regular HTTP request. I will be using the Google homepage for this test.
Google 10 Request Benchmark
Trial # |
PhantomJS Load Time (ms) |
cURL Load Time (ms) |
1 |
7435 |
1577 |
2 |
11497 |
1673 |
3 |
7145 |
1433 |
Average |
8692.333 |
1561 |
On Average, every PhantomJS request took 5.568 times longer than a regular cURL request
Those are some odd results - I never expected that PhantomJS would actually start performing relatively worse than cURL as we got less javascript. I can't actually think of any logical reason for that to happen - my guess is that it is actually some sort of random occurrence, and would even itself out if I did significantly more trials. One way or another, though, it seems that a best-case scenario is to expect PhantomJS to take three times longer than cURL. For me, I think that 3x longer is going to be a deal breaker due to my high load, but it's not a terrible stat and may work perfectly in other scenarios.
One of our readers, Ariya, pointed out in the comments that these results don't exactly reflect what I was trying to test in this post. For my own uses, I intentionally left load-images and load-plugins on, but for a barebones benchmark those will just skew the load times. I've redone the tests, and have posted the drastically different results in a new post. Thanks Ariya!