Last month we had a brief discussion on debian-devel about what images would be good to have for lenny — we’re apparently up to about 30 CDs or 4 DVDs per architecture, which over 12 architectures adds to about 430GB in total. That’s a lot, given it’s only one release, and meanwhile the entire Debian archive is only 324GB.

The obvious way to avoid that is to make use of jigdo — which lets you recreate an iso from a small template and the existing Debian mirror network. I’ve personally never used jigdo much, half because I don’t usually use isos anyway, but also because the few times I have tried jigdo it always seemed really unnecessarily slow. So the other day I tried writing my own jigdo download tool focussed on making sure it was as fast as possible.

The official jigdo download tool, ttbomk, is jigdo-lite — which you give a .jigdo file, and the url of a local mirror. It then downloads the first ten files using wget, and once they’re all downloaded, it calls jigdo-file to get them merged into the output image. This gets repeated until all the files have been downloaded.

By doing the download in sequence like this, you miss out on using your full network connection in two ways: one during the connection setup latency when starting to download the next package, and also while jigdo-lite stops downloading to run jigdo-file. And if you’ve got a fast download link, but a slower CPU or disk, you can also find yourself constrained in that you’re maxing those out while running jigdo-file, but leaving them more or less idle while downloading.

To avoid this, you want to do multiple things at once: most importantly, to be writing data to the image at the same time as you’re downloading more data. With jigdodl (the name I’ve given to my little program), I went a little bit overboard, and made it not only do that, but also manage four downloads and the decompression of the raw data from the template. That’s partly due to not being entirely sure what needed to be done to get a speedy jigdo program, and partly because the communicate module I’d just written to deal with this sort of parallelism making that somewhat natural.

In the end, it works: from wireless over ADSL to my ISP’s Debian mirror, I get the following output:

Jigsaw download:
Filename: debian-40r3-amd64-CD-1.iso
Length:   675477504
MD5sum:   d3924cdaceeb6a3706a6e2136e5cfab2
Total: 679 s; d/l: 586 MB at 883 kB/s; dump: 57 MB at 57 MB/s

Finished!


which is only slightly short of maxing out my downstream bandwidth, taking a total of about 11m20s. Running jigdodl with a closer mirror works pretty well too, though evidently some of my more recent changes weren’t so great, because I’ve gone from 9153 kB/s on a 100 Mbps link down to 7131 kB/s or lower. The CPU usage also seems a bit high, hovering at between five to ten percent at 900 kB/s.

For comparison, running jigdo-lite on the same file took 17m41s, which is about 566 kB/s, with the overhead being about 6m20s. What that means is if I doubled my bandwidth to about 20Mbps, jigdodl would halve its time for the download to about 5m50s, while jigdo-lite would still have about the same non-download overhead, and thus take 12m10, which is still 69% of its original speed. Going from 10Mbps ADSL speed to 100Mbps LAN gets jigdodl down to 1m31s (13% of the time, with optimal being 10%), while jigdo-lite would be expected to still be about 7m51s (43% of its original time).

I suspect the next thing to do is to rewrite the downloading code to use python-curl instead of running curl, and thus downloading multiple files with a single connection, and tweaking the code so that it writes the file in order, rather than updating whichever parts are ready first.

Anyway, debs are available for anyone who wants to try it out, along with source in the new git source package format.