From 512kb.club to GTmetrix API library

How reading one habr.com blog post caused a chain of events leading up to me writing a good documentation for a pet project

512KB ClubGreen Team

512kb.club

Via a habr.com blog post (ru) I've learned about a number of websites which aim to promote caring about size of websites by listing / praising those which fit within some limit (2Mb, 1Mb, 512kb, 128kb, etc). The most appropriate for my blog site was 512kb.club, which listed sites where uncompressed size of main page together with all assets fits within 512kb, with further separation into "teams" of sites under 100kb, 250kb, and 512kb.

Lazier lazyblog

At that time, main page of this website was just a bit over 100kb, which pushed me into orange team instead of a (slightly) more prestigious green team. Hence, the time had come for some optimisation.

Worth mentioning that initially lazyblog was designed with performance in mind, not size. One of optimisations which sacrificed page size (made it bigger) in favour of performance (made it faster) was storing data for each blog (title, intro, dates) twice in the index page. Reason for this was that back in Dec 2016 I found out that the fastest way to get information about each blog post by javascript (for the purpose of tags, search and sorting) was by splitting a single string with all this data into parts, like this:

var object=document.querySelector('…');
var string=object.innerHTML;
var data=string.split('\n');

instead of any other method which was parsing HTML one way or another (worth noting that picking title, description, tags, dates with individual document.querySelector('…') calls was the slowest one in my test).

At the same time I didn't want to lose compatibility with javascript-disabled visitors, so had to keep all same data in HTML code. Moreover, worth noting that this optimisation came rather late in lazyblog development - otherwise I would probably done it differently (maybe with a single <script> tag with one big JSON with data for all blog posts).

However, after several years of using the website, I realised that data which is needed only for search is not needed until search is actually performed. :) Moreover, search is usually performed not that often, and almost never - during initial load.

Hence, a proper way of loading data is:

Only data which is needed for initial page rendering (tags and created date) is loaded as-fast-as-possible during first load
Data which is needed for search (titles and descriptions) is loaded later. Moreover, since it doesn't block initial page load, it can be loaded using a slower method.

This way lazyblog became even more lazy - now it doesn't load search data until it's really necessary!

If interested, you can see commits to the lazyblog and this blog repos.

Site size checker script

On the 512kb.club's F.A.Q. page and in their issue tracker on GitHub they've mentioned that's it's rather hard for them to periodically recheck sizes of all sites.

Having some experience with Webdriver, I naturally decided to help.

First version of my script was starting Google Chrome, opening website in it, and measuring total uncompressed size of all received resources (note that uncompressed here means only un-gzip-ping, and not decoding JPEGs to BMPs).

However, the values generated by it for various websites didn't always match to those measured by GTmetrix website, which was used by 512kb.club as a reference. And after few ~~hours~~ days of debugging, it turned out that it was quite hard to match them exactly. As a minimum, one should:

disable system-wide adblocker
use same size of browser window (because some images might get loaded only when they are close to the visible area of the page, some CSS styles are loaded only on specific page widths, etc)
restart browser between testing each site in order to avoid reusing cached resources

Hence, a different script was written, which used GTmetrix API. Basically, instead of measuring site size itself, it asked GTmetrix to do it, and just presented the number. Nothing really fancy.

GTmetrix API library

When writing that script, I used the Python GTmetrix API library, but didn't really like it. First, it didn't approve my email address, which was perfectly valid for GTmetrix API themselves. Second, it hang when asked to test a non-existing website (or, rather, any webpage which ended up with error in GTmetrix). And lastly, it looked unmaintained - at least, none of my PRs got any reaction.

Hence, a new library was born. It also became an implementation of my long-planned idea to make a good code project - unlike my previous hobby projects, where I tried to add new features as fast as possible, without thinking of making code look good or wasting time on refactoring. Here I don't hesitate to stop, rethink, refactor, or even rename some public classes (I can afford that while library is not used by anyone else).

First goal of "good code" project (and first attribute of a good code) was 100% code coverage. To achieve this in API library we need, well, to mock the API. For that, an pytest-httpserver library comes useful - basically, it's a fixture for pytest, which spawns a helper HTTP server, and lets you specify what requests this server should expect, and what responses it should reply with. Exactly what we need for testing the API!

By the way, it's called dependency injection - I don't actually expect someone else reimplementing GTmetrix API, but for the purpose of testing, API endpoint is passed as optional argument to the library constructor - normal users will just ignore it and use official API endpoint, and unit tests can use mock HTTP server instead. Same with time.sleep function: where real users will wait for a few seconds between retrying request (for example, while waiting for a test to be completed), unittests can just log that Python was instructed to sleep for N seconds and keep running at maximum speed. To make things easier, this "logging" is currently implemented by sending HTTP requests to the same mock server.

Worth noting that having 100% code coverage (measured by lines) in unit tests doesn't protect you from all bugs. First, there are various code coverage metrics (line coverage, branch coverage, expression coverage, etc). Second, passing unit tests only mean that code of your program matches expectations in your unit tests, but it doesn't guarantee that it matches reality. For example, I once misspelled name of GTmetrix API endpoint: I wrote "report" (singular) instead of "reports" (plural). I did it both in program code (which made requests to "report" endpoint) and in tests (which expected to receive requests to "report" endpoint). Hence, tests were passing, but when I tried to use this library with real API - server responded with "you made a wrong request" error code. Hence, another kind of test is needed - integration tests.

Second goal of "good code" project is error checking. Unlike my other hobby projects, which say "garbage in - garbage out" and may produce unexpected results or crash in unexpected ways when met with unexpected inputs, this library carefully checks all data received from GTmetrix API and raises an exception explaining what went unexpected.

And lastly, third goal is good documentation. I'm trying to keep as much of documentation as possible in docstrings (part of source code), hoping that it never gets out-of sync with actual implementation, and use some sphinx autodoc magic to convert them to a nice website.

Conclusion

That's a weird sequence of events which led from a random habr.com blog post to me writing a Python library and registering at pypi and readthedocs websites. Wonder if it ends here or there will be something more?