Side project #1: Pageant

So as per my post from a week ago, here comes the description of my first little side project. But first a quick reiteration of the aim: I’m trying to get a feel for what it’s like actually doing a tech startup; so not charging for my time, but rather making something once that I can then sell repeatedly without having to do a lot more work. This is intended to make me more experienced rather than wealthy, so “success” means learning something, rather than making much money. As a consequence I’m aiming for business ideas that are in the bad-to-mediocre range, that will nevertheless involve some interesting/useful technology. That way if the business part goes badly, I don’t feel like I’ve screwed up a chance to make a bazillion dollars, or wasted my time doing something pointless.

So the first interesting-tech/mediocre-business idea I have is related to popcon. I like to think a comment I made once helped inspire popcon’s existance back in the day:

I think It’d be interesting to have a debian-survey style package that when installed, informs the `project’ (stats@debian.org?)  who’s using which packages. This would allow us to get a *much* better indication on which packages’s are in fact moderately stable and tested, and which are just gathering dust; and give us a better idea of what’s appropriate for inclusion in stable and/or unreleased.

Sadly that mail disappeared from the web (it was in the archives mentioned at the bottom of one of my initial posts to debian-devel regarding (what became) the testing suite, but disappeared after an upgrade/reinstall of www.debian.org) — but it was nominally in the public domain as of late July 1998, and lo and behold, popularity-contest appeared some three months later, doing everything I’d thought of and more. (For all I know, my comment played absolutely no part in Avery’s implementation, but I still like to think it did :)

Anyway, cool as popcon (and my original idea!) is, there’re interesting ways you could extend it, getting more information, and doing more with it. You could, for instance, survey more information about packages — what version’s installed would give you hints about how many people are pulling from backports, or mixing stable and unstable, or Debian and Ubuntu; or checking conffiles against their original md5sum might give you useful information about how often the default configuration is sufficient. Or you could analyse the information more thoroughly — eg, seeing if there are any unexpected correlations between people who use particular combinations of packages, or doing a netflix-like “I see you use package foo, many other people who use it also use bar, maybe that might be worth investigating.” (I once tried to do that sort of analysis on the popcon data, but all I ended up with was a pretty animated gif, that apparently crashed some people’s browsers… Red dots were systems, blue dots packages, with a package being installed on a system implying attraction, and uninstalled applying repulsion)

You could also gather completely different data — like information about the hardware, or things like the default language or timezone, or potentially even things from logs. That would let you answer questions like “do many people run Debian on HP hardware?” or “which IBM hardware is popular with Linux users?” which might influence future hardware development or purchases; or tell you surprising things about where Linux is actually being used; or give you some feedback on questions like “is the OOM killer a common occurence?” or “is IPv6 adoption actually going anywhere?”

As well as just gathering data from otherwise passive users, you could also use the data collection as an opportunity to make introductions between users — having established you’re running Debian and have a particular Intel graphics card, you could be automatically given the address of a section of the Debian wiki that’s dedicated to issues with that card; with the idea being that you can see any helpful solutions other users have already come up with to problems you’re having, or leave your own tips for future users. The same principle potentially applies to other sorts of data: if you have an old version of wordpress installed, it might be reasonable to point you at some security alerts that apply to it, or having determined you’re running Debian on some HP server, you might get directed at some updated management software that enables some extra features.

Another interesting improvement I think you could make is to provide ways users can aggregate and anonymise their own data. Even in the age of social networks and ubiquitous transparency, managing privacy of this sort of data is important: it would be spectacularly bad to provide a website that told people exactly which machines were vulnerable to which secuirty exploit, but that’s exactly what a list of which machines have which versions of which packages installed would provide. The popularity-contest software goes to some lengths to avoid that, by identifying data against a randomly generated UUID rather than an internet address, email or username; by not storing detailed information about package versions; and by restricting who has the ability to run any detailed analysis on the data. But you can go further than that by aggregating and filtering the data even before it makes its way to a centralised server — eg, rather than have each individual machine on a network reports its statistics to Debian, you could have the information sent to a proxy server that aggregates all the packages into a single report (30 computers, 10 of which have apache, 15 of which have exim, …), thus removing certain correlations (do all the machines running apache also run exim? or do none of them?), and potentially filtering things like the UUID (which might reveal something about the random number generator, particularly given Debian’s recent issue with randomness…) popcon version (which gives an indication what version of Debian is in use, and in some cases how recently it’s been updated) or timestamp (that may give away that the machine has been down). And if you’re running a network that’s intended to be somewhat locked down, it might be more reasonable to have computers reporting to a machine that you control, rather than one just out their in the wild.

So that, in very rough terms, is the spec for this project, which is currently going by the name “pageant” (ie, a popularity contest that takes itself a bit more seriously…) The technical goal is to provide a pageant client that people can run on their systems, which can report potentially arbitrary information to a central server and can receive and present relevant snippets of advice related to that information; a pageant proxy that can intermediate and filter pageant clients to provide a slighter higher level of anonymity/privacy; and a pageant server that can collect the data, provide relevant advice to clients, and analyse the data. I think it’s feasible to do an interesting job of that, that should go a little further than existing programs, and be usable by actual people, though I suspect the server side will have to be a bit beta-ish to be finished within a  week or so.

The business goal, obviously, is to turn some of the hypothetical benefits touched on above into actual income, ideally without turning it into a vast NSA-like data hoarding corporate conspiracy. I figure there’s a few reasonable ways to approach that:

  • First, I figure that providing the same information other systems currently do at no charge makes sense: so getting basic stats on how many Debian users have nickle installed, or Ubuntu users have network-manager, or Fedora users have a Synaptics touchpad should be free.
  • Second, I figure providing further analysis for companies and researchers should probably be possible, and cost something: probably more depending on how complicated the analysis is. Possibly there could be an extra fee for the analysis to not be also made available to the public; that could be entertaining.
  • Third, I figure that it probably should be possible for companies to at least provide advice to users of their hardware through the system, and that at least in some cases, that probably should be for a fee. I’m not sure if there’s a line in there somewhere between necessary advice (security updates?), helpful tips (here’s some non-free drivers for that hardware?), or outright advertising (buying our hard drives will give you 200% better performance!) that might mean “advice” should vary between free, paid and blocked. An approach might be to say distros’ advice is free, other people pay.
  • Fourth, I think it would be interesting to allow users to optionally pay a fee to register their hardware. This could have a couple of benefits: it provides a low-maintenances way to discourage ballot stuffing — it’s not at all difficult to hack up popcon to pretend you have thousands of servers running your favourite package to try to bias the statistics, but it’s somewhat harder to come up with even a few dollars thousands of times; and possibly more interestingly, it provides an easy means to link a small payment for “using Linux” with the software that’s being used — so distributing 80%-90% of those fees to the authors of the software that’s actually being used might be an efficient way of helping support free software development.

Anyway, that’s the project! My notes have a few other things in them worth mentioning — there’s a couple of not entirely little complications in a few of the above ideas, for one — but this is already long enough, and it’s not like I can’t blog again later. Even though there’s a few similar projects around (popcon and smolt in particular) I’m planning on taking a NIH approach and starting from scratch, on the basis that current stuff is mostly pretty basic to reimplement, and getting an architecture I’m comfortable with is pretty important in making it appropriately generic. As always, helpful tips, questions and/or any general encouragement appreciated, either by email or the comment link…

One Comment

  1. Hi Anthony,

    This sounds like an interesting project alright – I’m curious to see how much of it you can do in a few weeks. Another aspect that people might be willing to pay for is having a local pageant server (the kind of organisations that don’t like letting data offsite are often also the kind of organisations with money).

    Nice graphical output is a must – will you be using R which seems like flavour of the month for this sort of thing :)

    Anyway, best of luck with it.

Leave a Reply