Torrenting the Debian Archive

Continuing the theme from my previous post — the first and fundamental thing any distro needs to have, and thus the first and fundamental thing to think about disintermediating, is some way of distributing software. That might be burning CDs and sending them out to resellers or OEMs, or having your stuff available for download, whether by putting it up on your own server, hosting it via a dedicated high-reliability provider like akamai, or maintaining a network of mirror sites and keeping them up to date.

And, of course, anyone who wants to distribute their own software has to do the exact same thing — which either means doing it themselves, and not being able to take advantage of the scalability the distributor has already made work, or going through the distribution and dealing with the problems that entails, as well as the benefits.

In this case, disintermediation and decentralisation are pretty much one and the same thing: and decentralising content distribution is already a well understood problem: that’s what peer-to-peer is all about, and peer-to-peer distribution of, well, distributions is already very successful — at least when it comes to CD (and DVD) images. Which means that just about anyone can create a CD image, and distribute it to the world in exactly the same way a major organisation like Debian or Red Hat would — upload a torrent file, run a seed, and let the magic of BitTorrent do its thing. Scalability is no longer something you need a distribution organisation to manage, instead it’s built directly into the technology itself.

Unfortunately BitTorrent is designed for large static images — not something you update on a daily basis. And for better or worse, my preferred distribution (Debian testing, currently lenny) does update that frequently.

Fortunately, though, BitTorrent’s an open protocol and has dozens of open source implementations — so last year I was able to propose it as part of the Google Summer of Code, and was lucky enough to get a few responses and a student to work on it. It didn’t get much further than that unfortunately — the student lost internet access not long into the programme (and for most of its duration), and that was pretty much that. So when this year’s SoC rolled around, I didn’t really expect much, and didn’t put the idea up for consideration again, but lo and behold someone came through and asked if it was still possible and if there was any more information, and when I forwarded on some of the mails from the previous year we ended up with a second chance at the project.

So far it’s looking pretty good — we’ve had a lot more success at keeping in touch, and thanks to the extended schedule for the SoC this year we’ve been able to do a much better job of keeping on top of what’s going on. So much so, in fact, that there’s a first (alpha) release out before the SoC is officially due to start! Wonderful stuff!

What it does at the moment is allow you to take a Packages file (the stuff “apt-get update” downloads, which includes descriptions of all the packages that are available, how they inter-depend, and so forth), and from that information create a usable torrent from which to obtain the packages themselves which can then be used to share and distribute the packages.

There are two crucial steps in that: the first is allowing the torrent to work without requiring huge amounts of extra information to be distributed (which would introduce scalability problems of its own, instead of solving them), and the second is that the pieces that make up the torrent are selected in a way that matches the actual packages, so that when you upload a single new package, you are in fact only making a minor change to the torrent, rather than having it (on average) completely redefine half of the torrent (and again introduce scalability problems rather than solve them).

There’s more information on the DebTorrent wiki page and Cameron’s blog if you’re interested.

Anyway, it’s just an alpha release at the moment, which means while there’s code that does something, it’s not actually all that useful yet. The current plan is to next add some code to make it automatically prioritise packages based on what you’re actually using — so that rather than downloading all the debs it see, it’ll only download the ones you’ve got installed, or maybe only newer versions of the ones you’ve got installed, which should get us pretty close to the point where it’s actually useful for something.

The end result, of course, is to build a tool that you can point at a Debian archive, run it on a machine connected to the Internet, and you won’t have to do anything more to have a reliable, scalable and reasonably efficient means of allowing your users to distribute and update their systems. In this case, scalable means that if you end up with as many users as Debian or Ubuntu, your users will have a comparable experience, as if you’d arranged for a similarly comprehensive mirror network, without actually having to do the leg work.

And heck, presuming that works, it doesn’t even matter if no one else actually does that — it’s worth it even if it saves Debian or Ubuntu the effort of keeping track of a mirror network by hand.

There are interesting possibilities at smaller scales too, of course. :)

Leave a Reply