Changing The Security Infrastructure

One of the most exciting things about working on Debian is that since it’s developed in the open, when you want to make changes everyone sees them, so depending on what you’re working on, the risk of breaking stuff can provide a real adrenaline rush, while you’re just sitting in your chair. So what better adventure sport could there be than hacking up Debian’s security support?

One idea that comes to mind: doing it with only a minimal idea how it works at the moment.

Prior to finishing the plan I mentioned previously, then, the task was to get a decent idea of what the actual code looked like, and perhaps more importantly, get the code on security.debian.org synchronised to the code in CVS (which had only recently been synchronised with the code on ftp-master). Naturally that was a two way sync, with changes specific to security having to be kept, and changes from CVS and ftp-master having to be merged in; in the end it looked something like:

        * Merge of changes from klecker, by various people

        * amber: special casing for not passing on amd64 and oldstable updates
        * amber: security mirror triggering
        * templates/amber.advisory: updated advisory structure
        * apt.conf.buildd-security: update for sarge's release
        * apt.conf-security: update for sarge's release
        * cron.buildd-security: generalise suite support, update for sarge's release
        * cron.daily-security: update for sarge's release, add udeb support
        * vars-security: update for sarge's release
        * katie.conf-security: update for sarge's release, add amd64 support,
        update signing key

        * docs/README.names, docs/README.quotes: include the additions

From the looks of things, the security changes had actually been synced to a slightly later version of CVS than the ftp-master changes, which had subsequently been lost, months ago — hence the miscellaneous changes to the READMEs.

Along with that came the initial changes in preparing to support the new security queues, namely:

        * Changed accepted_autobuild to queue_build everywhere.
        * Add a queue table.
        * Add a "queue" field in the queue_build table (currently always 0)

        * jennifer: Restructure to make it easier to support special
        purpose queues between unchecked and accepted.

These had already been rolled out on ftp-master for a little while, and had been working fine, so it seemed fairly safe. But hey, one thing at a time let’s you at least limit the damage, if there’s going to be any, which of course there isn’t.

<Joey> dak@klecker is broken.

Or maybe there is. The error turned out to be:

File “/org/security.debian.org/katie/db_access.py”, line 269, in get_or_set_fingerprint_id
projectB.query(“INSERT INTO fingerprint (fingerprint) VALUES (‘%s’)” % (fingerprint));
pg.ProgrammingError: ERROR: duplicate key violates unique constraint “fingerprint_pkey”

Which is a relief — I hadn’t touched the fingerprint table, so it’s definitely not my fault. Of course, it also makes no sense: sure we want unique fingerprints, but that’s why we have a SELECT just before the above and only do the INSERT if we don’t already have that fingerprint. And indeed there’s no such fingerprint in the table, and running the INSERT statement by hand doesn’t work either. At this point I grabbed James “I don’t have a blog” Troup as the voice of experience, to see if he had any idea. He did have an idea, namely “maybe klecker ran out of space and the database is corrupt”, and proceeded to drop the database and restore it from the most recent backup. Sadly, that didn’t change anything. Poking around further, he noticed that the constraint above doesn’t actually limit the uniqueness of the fingerprint, but rather the primary key — in this case an autoincrementing id. Turns out somewhere along the line that had reset from around 140 back to 8, and the error was because there was already a fingerprint with id 8. It further turns out, as we discovered by poking through the regular database dumps to see when it had reset, that this happened mid November, which was coincidently when postgresql had been upgraded. The reason it hadn’t shown up for a few weeks since then, it seems, was that all the security uploads had been signed by keys already in the table — and the other autoincrementing ids that had been reset would only come into play when we next released, or added a new component or similar.

I’d like to emphasise the next point: all of this means that the problem was, demonstrably and conclusively, not my fault!!!

At around this point, clever as I am, I came up with a second way to make this more exciting, namely “tell the only security team member who’s actually talking to you that this is all taking too long and he’d better answer you now…. or else!”. I tell you, if you could bottle the heart-pounding effect of

<Joey> well, if you rather want to fuck up the security infrastructure, don’t expect the security team to love you.

drug authorities the world over would be declaring you enemy #1 and adventure sports junkies would be getting your face tattooed on their nipples. This is also the point at which I recall that I’m not, actually, an adventure sport junky. Oops.

Anyway, with the stakes thus raised, and fall guy determined, and more importantly, the dak install on security practically begging to be updated to actually include the real changes, we come to:

        * katie.py: Move accept() autobuilding support into separate function 
        (queue_build), and generalise to build different queues

        * db_access.py: Add get_or_set_queue_id instead of hardcoding accepted=0

        * jennifer: Initial support for enabling embargo handling with the
        Dinstall::SecurityQueueHandling option.
        * jennifer: Shift common code into remove_from_unchecked and move_to_dir
        functions.

        * katie.conf-security: Include embargo options
        * katie.conf-security: Add Lock dir
        * init_pool.sql-security: Create disembargo table
        * init_pool.sql-security: Add constraints for disembargo table

Along with some bugfixes in those changes and some other tweaks, that leaves us with the ability to upload a package into the “OpenSecurityUploadQueue” and have it go to the unembargoed queue rather than follow the embargoed uploads (currently still going into accepted), to then be autobuilt by the same buildds building embargoed uploads, and for the builds to automatically follow the source into the unembargoed queue if and when they’re authorised and uploaded.

Of course, following all that coolness, you then find in your mailbox:

Could somebody please explain us why not all security updates were uploaded to ftp-master? It is *REALLY* annoying having to veryfy whether files were uploaded or not and having to move files around manually.

This week, the following uploads were missing: […]

Sadly, at least for my blood pressure, this is one of the things I’d touched. Still, I hadn’t touched it that much, so it did seem a bit odd. Poking through the logs on ftp-master naturally showed that the uploads had been made… on the day Joey sent the above mail, all around the same time… and thus presumably by Joey himself, manually fixing the breakage before complaining about it. Not what I’m looking for. Looking at the security logs, though, shows no evidence of the package. In December, anyway, turns out the one I was looking at was an advisory for early November. Aha! That probably means it’s not my fault. Excellent. Looking through the November logs on ftp-master shows that an upload was made — but only for the sparc build, which naturally got rejected because the source wasn’t there. Same thing for another package. Oddly, the sparc upload for at least the first package had been delayed for a week compared to the other builds, presumably some buildd problem at the time. Nothing seemed obviously different between the changes files to make one get accepted and the rest ignored, but the fact it was the last upload getting through seemed odd — maybe the other changes files were just being ignored? But apparently not — running “amber -n” on some current changes files to see what would happen indicates all the debs and whatever getting uploaded. Oh well, let’s look at the code. Oh, hrm, the changes get uploaded at the end — which makes sense since you want all the stuff to be accepted at once, so you don’t mistakenly REJECT a build before the source is uploaded. And… ah.

if not changes.has_key(upload_uri):
    changesfiles[upload_uri] = [];
changesfiles[upload_uri].append(changes_file);

changes happens to be a real variable that stores the contents of the changes file we’re looking at, and never has the appropriate key, leaving changesfiles to be cleared everytime the loop runs, and hence only contain the very last entry it was meant to.

The best part? Not only was it obviously a bug since before my changes, but adding redundant trailing semicolons to python code are one Mr Troup’s serious problems, so putting those elements together, we come up with the following equation: 1 + 1 = not my fault!!!! Oh, also, pretty easy to fix once found, which is always nice.

Unfortunately we’ve run out of time to go into the minor, indeed utterly trivial, problem (if you could even call it that!) involving updating the dak install on security to new code that requires a lockdir without actually creating that lockdir, and thus causing security uploads to get stuck in the unchecked queue and random “jennifer broke” cron spam to get sent to ftp-master and the security team every five minutes once something got uploaded… And anyway, as long as it got fixed, who really cares whose fault it was?

Leave a Reply