Henry Tax Review

The Henry Tax Review is supposed to be released tomorrow. Since that might warrant a blog post, and possibly even some criticism, I thought it might be interesting to note down some criteria beforehand to remove one avenue for bias.

One issue for regulatory reform is whether changes make the entire system simpler or more complex — more complex regulation potentially handles trickier situations more “fairly”, but at the same time forces everyone to incur the cost of understanding all the complications, even if only to be sure they don’t apply in their situation. The Rudd government made an election promise to that effect:

Labor believes that when making new regulations, governments should remove an existing regulation and should design rules with small businesses in mind. We call this approach Ďthink smallí. It will require government departments and agencies to better understand the realities faced by businesses on the ground. Labor will adopt a Ďone-in, one-outí principle for federal government regulation. This means that when a new regulation is proposed it must be accompanied by a proposal to remove an existing regulation.

There’s a deregulation group as part of the Department of Finance, but I haven’t seen much talk either way as to how this promise has been holding up. In theory, based on this principle, the Henry review should be proposing about as much reduction in regulation as new regulation though.

One of the obvious ways to reduce the complexity of the tax system would be to remove the various GST-free categories of goods (unprocessed food, etc). It would probably be appropriate to compensate that with a small increase in some welfare payments.

It’s probably also one of the few changes to the GST that’s within the review’s purview, given the clause in its terms of reference that goes “The review will reflect the government’s policy not to increase the rate or broaden the base of the goods and services tax (GST); preserve tax-free superannuation payments for the over 60s; and the announced aspirational personal income tax goals”. It’ll be especially interesting to see how true the Henry review has stayed to that policy, compared to the conclusions being drawn from Rudd’s hospital plan on a backflip there.

Personally, I quite like the “Reform 30/30″ proposal, which involves a massive simplification of both welfare payments and income tax. Supposedly it would boost government revenue by $15B per year, which is a significant fraction of the $125B in income tax or $43B in GST received in the 2008/9 financial year. On the other hand it comes at a cost of not giving welfare bonuses to people doing good things (having kids, buying houses, studying, etc) and taking less account of various other ways in which you might be rich other than having a high paying job (rich parents, rich spouse, money already in the bank, nice house, etc).

Presumably anything like that would be a non-starter politically, but some movement in that direction ought to be plausible. There’s been some talk for a while now about having a simplified tax return, so that you can just tick a box and accept whatever the ATO says rather than fill out a bunch of forms — basically heaps easier and quicker, but you don’t get to claim lots of deductions. Given the ATO’s electronic systems and reporting of interest payments by banks, and PAYG contributions by employers, that ought to be pretty plausible to setup, and might start paving the way for cutting out lots of personal tax deductions — why keep them if barely anyone’s using them, after all?

That, at least, is kind-of like cutting welfare payments — a tax deduction for $1000 is roughly the same as a receiving a cheque from the government for $300 if you’re at a 30% tax rate. Of course that means that deductions are being more considerate of the welfare of people paying more tax, which is similar to being more considerate of the people who lease need consideration.

I can’t see how the Henry review will be able to recommend much in the way of cutting welfare expenditures in general ($125B of expenses in 2008/9), but they’ve at least been told “The review should take into account the relationships of the tax system with the transfer payments system and other social support payments, rules and concessions, with a view to improving incentives to work, reducing complexity and maintaining cohesion”. So maybe there will be some ideas on this.

Maybe this will also mean the Ergas review will be revealed soon too. It looks like it’s even more out there than the 30/30 proposal, with a roughly flat 20% income tax, raising tax on income from superannuation, and taxing the family home. I’m pretty surprised that there’s anything out there more wacky than what the Liberal Democratic Party came up with, but maybe that’s due to its progenitor — supposedly Turnbull ordered the review as shadow treasurer without bothering to even tell Brendan Nelson. Still, it would be interesting to be able to compare the reasoning and recommendations to those of the Treasury-Secretary’s in tomorrow’s report.

WoBloMo 2, Epilogue

This year’s woblomo was a bit more consistent than last time — every post was either on the appropriate odd day of March, or before midday the next day. (I did backdate a few posts that actually got posted between midnight and about 4am the next day, just to keep the calendar widget in the sidebar pretty)

I felt a bit pressured this time around on what I was posting — there were a couple of topics I would’ve liked to have posted on, but didn’t because I wasn’t sure I’d be able to finish them in time. On the other hand most of the posts were interesting to me at least, and I learnt a few things in writing them (first time I’ve played with R, or done a youtube screencast, in particular). Overall I’d call it a pretty good experience.

I think for April I’m going to try to do a bunch of blog posts again, but aiming to be a bit more bursty (so if I want to post about X, I can spend a couple of days thinking about it first). I’m trying out Erica.biz’s getting things done tips at the moment too, which I think should work okay with that plan.

Excuses to use gearman

Sometime ago I stumbled across gearman and thought it looked cool — map/reduce and distributed processing for shell commands? Neato! Unfortunately, at the time I was looking for a distributed database (and found couchdb) so that didn’t go anywhere. The session at lca reminded me how cool at was, but didn’t get much further on an actual use.

I’m thinking now, though, that it might be a good match for my notmuch usage: that is, I could have incoming mail tagged as unread and filed into an inbox or list or whatever, and then separately have all that mail sent through gearman to get flagged as spam. The win there being that mail gets delivered immediately, still gets flagged pretty much ASAP, and can easily get flagged as spam before I see it even if the initial check doesn’t flag it as spam. Gearman (and notmuch’s tagging) should allow that to be rate limited, queued, and handled asynchronously without much hassle.

Fingers crossed. :)

Email: how much does it suck?

Yes, this post is going to mention notmuch. Whether that’s the answer to the question posed is another matter…

My email habits have defaulted to mutt and procmail for quite some years now (and prior to mutt, pine which was essentially the same except older and less nifty). I had a brief interlude under OS X with Thunderbird and IMAP right up until poor filesystem performance drove me back to Linux, and with it mutt. Spam has obviously also been a problem, and I’ve variously used spamassassin, greylisting and dspam to combat it. Greylisting was great when it started, but now seems mostly useless, at least for my mail — and the delay it causes when you register to websites who want to send you an email for confirmation gets annoying. dspam was working great for a while, but ended up taking too much time and CPU the way I was using it, and trying to move it from my laptop to my server broke it almost completely.

When I went off to linux.conf.au this year I decided to take my netbook, and decided it wasn’t worth trying to move my email from my laptop over to my netbook (having 40k mails in your inbox isn’t very workable, least of all when 80% of the recent ones are spam), so decided to get all my new mail copied to my gmail account instead. I figured when I got back I could work something out, and in the meantime, gmail was at least easy.

The only real work I ended up needing for that was to setup a bunch of filters for incoming list mail (I’ve got 19 at the moment) and a couple more to do away with stupid automated mails I don’t actually want to ever see. That worked, but what was really shocking was just how wonderful it was that all my spam just went away — my gmail spam folder currently has 7057 messages from the last 30 days, but I only ever actually see maybe a couple a day. Presumably the fact that gmail sees heaps of emails and has lots of users pressing “Report spam” makes it a lot less likely for any individual user to see any individual spam, too. And having effectively declared email bankruptcy was nice too; a clean inbox really is much easier to work with. I’m not especially convinced by tags; I’m not really getting much more value out of them than folders, but having filters that can automatically (and quickly) apply to existing mails as well as new mail is quite useful. Another advantage Gmail has is it’s easy to access from multiple devices — different laptops and my mobile phone in particular.

Last time I looked at Gmail I dismissed it with the comment:

GMail is kind-of nice, but I like to be able to read my mail offline, so whatever.

And though I suppose I could try Google Gears or something to deal with that, ultimately I still feel pretty much the same way. There have been a couple of times Gmail’s been not working quite right (not loading an email in particular), and it’s often a little bit slow just due to network lag, and, honestly, I want all my email archives in one place, and I don’t particularly want to upload it all to Google. Gmail also seems somewhat unfriendly to people who want to send/read their mail in fixed-width, 80-column format.

So the question becomes how to run my own email system again and have it not only be usable, but have it be as pleasant as I’m finding Gmail.

So far, I’m thinking that (despite what I just said above) tagging is probably a key feature: not so much from making it different to use, but making it easier to work with mails. Tagging messages as spam, versus moving them from one Maildir to another just seems lots easier to deal with. And not having messages get constantly renamed when being read or replied to would be nice too. So using the aforementioned notmuch seems like a win on that score.

The idea, I think, would be to basically mimic gmail: have tags for “inbox”, “unread”, “starred”, “spam” and “trash” as well as tags for different lists I follow, and tags for other collections of things (receipts for tax purposes, particular projects). Make sure those tags are applied automatically to incoming mail, and then have some folders corresponding to various tags.

I’m not quite sure how I want to deal with spam. A major problem with dspam for me was that once a few spams got through (perhaps because I simply wasn’t reading mail), the hit rate would drop off and even more would get through afterwards. I’m not sure if that’s simply to be expected, a configuration error on my part, or what, but I think the solution needs to ultimately be doing spam scans on unread but already delivered mail, as well as during delivery. That way once I mark a few mails as spam, the system can recheck others and make them go away too, reducing my workload. I think I’d also like to start using something like Vipul’s Razor, to maintain some of the benefit of many other people noticing something is spam. I’m still not sure whether I prefer spamassassin or dspam or both, though.

For mobile access, IMAP seems like a no-brainer of a way to go; especially since Gmail’s already come up with a pretty reasonable behaviour mapping from IMAP to tags. IMAP also has a SEARCH command which might possibly be a reasonable way of exposing notmuch’s searching. An IMAP server would also solve the problem that you can currently only really use notmuch from emacs or vim, neither of which appeals to me. I guess it would also support disconnected operation to some extent by way of offlineimap. Ideally, though, I guess there’d be a way that synchronises the Xapian database. In theory it ought to be possible to make extra notmuch functionality available via IMAP extensions, but you’d need the client to support those too.

I’m not sure what client I’d want to use — I suspect it’d just be a toss up between mutt and Thunderbird. The only possible drawback there is that neither of them have the clever per-thread view of messages that notmuch and Gmail have adopted. I’m not sure how much I’d miss that.

Anyway, I’ve currently imported my old mail into notmuch (mostly), and I’m currently running my old inbox through spamassassin to try to clear out all the garbagey spam stuff. The notmuch stuff went quickly enough, but spamassassin’s currently only about a quarter of the way through.

(And yeah, I’m claiming the Hawaii/New York woblomo exemption again. Even missed UTC this time…)

Consumer reports: Mobicity

When I was looking at what new smartphone to get, the cheapest place to get it from seemed to be Mobicity which seems to be a local shopfront for a Hong Kong warehouse. That there was a discount voucher on tjoos helped too. Anyway, the price was right and it got here reasonably quickly so it all sounded good.

The other week I put it on to charge overnight and woke up the next morning to find it was dead — the “charging” light would come on, but nothing else would do anything, plugged in or not. It came with a one-year warranty, via Australian Warranty Services (who seem to sell warranties for imported phones generally too, maybe), and you fill out a web form to do claims. Despite Mobicity being just a suburb away, I got an email saying that I’d need to send my phone to Sydney Cellular Repairs. As it happened I was going to Sydney for the LA F2F the next weekend so I dropped it off on the way; and at that point things were pretty straightforward: they took the phone, got some receipt number for Mobicity, checked the phone over and a couple of days later I got an email that it was fixed and on the way back to me. And it arrived working, so… sweet! Cheap and supported. Who knew?

On not being able to think straight

When I last posted about my pygame/trigrid hax0ring, I said:

Iím now at the point where that all works, but thereís no intelligence ó peonís will buy and transport goods without checking first that anyone actually wants them.

Getting past that is proving troublesome, so this is me thinking out loud about it.

The key scenario is having two markets, with an agent who can move goods between them. Agent Example In the example that should hopefully appear at the right, we’ve got two visible markets (represented by blue circles), and an agent who can carry goods from one to the other (represented by the smaller green circle on the red line). There are three other agents connected to each of the markets, but we’re not concerned about them for the time being. The agent might already own some goods, which could either be stored at one or the other of the markets, or be being carried right now.The agent can move between the two markets (which takes some time), possibly carrying some goods. While at a market, the agent can drop off any goods being carried, or pick up some goods to carry.

As far as trade is concerned, each market maintains a list of offers to sell goods that agents can either add to or accept — in the example pictured, one market has a good offered for $9, and the other has a good offered for $15. Once an offer is accepted, the agent that made the offer is expected to ensure the goods are at the market (by delivering them, eg), and to indicate when this happens, at which point the market transfers ownership to the agent that accepted the offer, and transfers the associated payment in the other direction. The offers are stored in a list in each market, and at the moment, each agent follows the following simple, and stupid, procedure:

while True:
    (src, jobid) = wait_for_a_job()
    (cargo, price) = accept_offer(src, jobid)
    wait_for_job_completion(src, jobid)

    dst = other_end(src)
    dst_jobid = make_offer(dst, cargo, price+profit)
    job_is_completed(dst, dst_jobid)

What it should be doing is something more like:

  • scanning available offers at one market and making a new offer with some additional profit at the other market
  • only accepting an offer when its offer is accepted
  • getting payment upfront, so you don’t need to pay the supplier before you’re paid
  • dealing with offers and acceptances asynchronously with actually fulfilling them

Somehow the code for that should look something like:

def update_offers(profit, deliver_time):
    for (src, dst) in [(left, right), (right, left)]:
        for (cargo, price, arrive_time) in get_offers(src):
            add_offer(dst, cargo, price + profit, arrive_time + deliver_time)

That risks two instances of unbounded recursion: if get_offers() returns offers made by me, I’ll be making offers to deliver from left to right to left to right to left to… with huge costs and delays. So get_offers() shouldn’t return offers you made. But also, if I offer from my left to my right, someone else offers from right to “up”, and someone else offers from “up” to my left, we have the same problem.

I figure that’s best solved by adding a “contingent” field, to say “this offer is only able to be accepted by you if the offerer is able to accept the contingent offer, otherwise it’s void”. In the event that you have a chain of offers to get from A to B to C to D that allows the markets to accept all the offers simultaneously rather than giving time for some other agent to accept the offer from B and deliver it to E and muck up the whole transaction.

def update_offers(profit, deliver_time):
    for (src, dst) in [(left, right), (right, left)]:
        for (jobid, cargo, price, arrive_time) in get_offers(src):
            contingency = (src, jobid)
            add_offer(dst, cargo, price + profit, arrive_time + deliver_time, contingency)

If you let markets keep track of the “root” of the contingency tree, that also lets you limit the possibility of multi-agent offer loops.

That then needs to be hooked into actually doing the deliveries. When an offer of ours is accepted, two things happen: we’re committed to handing over some goods at one market at time t, and someone else is committed to doing the same at the other market at time t-d. We can thus maintain two lists for each market we interact with: a set of times when some goods are expected to arrive at this market, and a set of times when we’re meant to deliver goods to this market. We can then resolve our obligations by noting that either cargo will arrive at each market before it needs to be delivered, and we can collect our profit without actually doing any work; or else (exactly) one market will require some goods before they arrive, and we’ll have to transport some goods from the other market to satisfy this. Maintaining a schedule of what we should do, and working out a minimum delivery_time for our additional offers should be straightforward at that point.

(For completeness, agents should be able to be penalised for failing to deliver, and potentially rewarded for delivering goods early. Occasionally it’ll work out that the penalties for dropping one commitment will be outweighed by the rewards of delivering on something else — I’m optimistic that just coding that logic should make the overall system much more dynamic while remaining fairly understandable and predictable)

Hmm, I think that covers the next step. Guess we’ll find out when we start coding…

Thoughts on auditing systems

One of the XP systems I look after had a trojan this month — looks like it came from a fake “UPS package” mail with a zipped attachment that got clicked on, then stuck its tendrils into the registry and all over the place, and started popping warnings about viruses and instructions on how to pay for a fix. After a couple of attempts at removing the infection and finding it just coming back, looks like a reinstall is going to be easier.

Of course, on a free operating system it’s at least theoretically reasonable to know what everything on the system is supposed to be doing, so it should be possible to fix that sort of problem. There’s been a bit of discussion this past month about the md5sums control files which goes some of the way to handling that for Debian — but of course, that assumes your md5sum files aren’t compromised along with the rest of your system.

Ultimately you want two things to cope with potential compromises like this — one is to detect them as early as possible, and the other is to work out what’s infected and what’s recoverable. Which basically means you need a description of how things should be and the ability to compare that to how things actually are.

In some respects, that’s difficult to do: “how a system should work” is hard to define, and tends to change over time — and often people don’t think their systems work as they “should” even when they’re freshly installed and completely uncompromised. But if you aim a little lower, you can at least get somewhere. You could say “my system should be built from the latest Debian testing packages” and verify that, for example. Or you could keep a running tally of packages installed and removed, and say “each entry in my running tally should say what happened and be dated and match my recollection, and the packages from that tally should be Debian packages, and the files on my system should match those packages”.

Knowing what packages you’re meant to have is probably the first challenge — maybe you’re running puppet or similar and have an easy answer to that, but if you just run apt-get and aptitude whenever you want something, it’s a bit harder to tell. Are you running an ircd because you thought one day it’d be a fun thing to do, or because some warez kiddies are using it to control their botnet?

Once you know you’re meant to be have, say, python-llvm installed, you need to know which version it’s meant to be. You could say “the last version I installed, of course” — except of course your only record of that might be on your compromised system. You might say “well, I follow testing, so the latest version in that”, except that there might have been an update to that package in testing while you were compromised, or you might have installed something from unstable or experimental (or backports, or compiled it from scratch). You certainly want to know the architecture, version number and whether it was from Ubuntu, Debian or somewhere else.

Going from that step to knowing what the contents of the package is meant to be is slightly harder. If you happen to know you’re looking for the current version of python-llvm in Debian testing, then you can establish a trusted path to verify what its contents should be by downloading test testing Release, Release.gpg and Packages.gz files which will give you a verified download of a deb file (assuming you trust gpg and sha256, which is reasonable for the moment at least).

If you’re running an outdated version of the package, you’ve got more problems. You could find the original .changes file uploaded with the package to verify it based on the developer’s signature — but that will only tell you that that developer built that package, not that it was uploaded to Debian, distributed far and wide, and installed on your machine. You could find the Release/Packages files that were current when you downloaded it, and verify them, but that’s something of a chore in and of itself. You could make a note of the name, version, location and sha256sum of every package you install and keep it somewhere secure, but that’s a chore too. The easiest solution I can think of is just to treat “outdated” as “potentially compromised”, and install the current version of the package anyway. (For locally generated packages, you should presumably be able to either find an uncompromised version to compare against easily enough, or you’ll have to rebuild it from scratch as part of your recovery anyway)

Once you’ve downloaded the deb file, it’s a relatively simple matter to verify the package is correctly unpacked; a good approximation is something like:

TAR_CMD='printf "%s%s\n" "$($CKSUM - | sed s/-$//)" "${TAR_FILENAME#./}"'
export CKSUM
ar p "${DEB_PATH}" data.tar.gz | tar --to-command="$TAR_CMD" -xzf - > "$HASHES"
(cd / && $CKSUM -c) < "$HASHES"

(Caveats: assumes data.tar.gz, some debs have data.tar.bz2 instead; the extraction command above takes about 7m on my netbook (HP mini 2133) for the 420 or so debs that happen to be in my /var/cache/apt/archives (about 480MB worth); the above assumes that you have a trustworthy ar, GNU tar, gzip (or bzip2), md5sum (or sha1sum etc), and filesystem, as well as copy of the .deb; the above includes conffiles in /etc many of which will have be intentionally modified; some, but very few, .debs expect some of their distributed files outside /etc to be modified too)

You can skip the first command in that sequence if you use the md5sums files shipped with debs, but that comes with a few drawbacks, in that you're forced to rely on the md5sums files, which can be lost, not present, incomplete or, if you're using the local cache of the hashed files that dpkg keeps in /var/lib/dpkg/info, potentially compromised along with the rest of your system. The upside is there's an existing tool to verify them (debsums).

Personally, I'm now running a patched version of dpkg that generates its own .hashes files as packages are installed. That doesn't do anything about lost or compromised files, but it does ensure they're complete and at least initially present.

But even if all the files that are meant to be installed are exactly as they should be, that's not enough. You've also got to worry about extra files -- maybe your "ls" command isn't invoking "/bin/ls" but "/usr/local/bin/ls" which has been compromised. To some extent that's easy enough with tools like cruft, but there are quite a few places where extra files can screw you over.

Probably the hardest part is checking your configuration files are correct. On both Linux and Windows, you can do a great job of taking over a system just by messing with configuration files, whether that be zeroing a password field, or adding a preload so that every time you run a program a trojan starts up as well, or a timed job to start your trojan back up if it gets disabled. If there's enough configuration data, you might be able to hide a copy of your trojan in their, so that it'll be re-extracted even if the rest of the system is completely cleaned up.

I'm not really sure what the solution here is. For Windows and its registry (and other configuration scattered about the place) I don't think its solvable; there's just too much of it, that's changed in too many ways to really control. So as far as I can see, it's a matter of scrubbing everything, and reinstalling from scratch there.

For Debian and Linux in general, things still probably aren't great, but there's at least a few things you can do. You can probably rely on /etc not changing too much, which means you can do things like track changes with something like etckeeper and review the diffs to make sure they're sensible. Unfortunately reviewing configuration diffs is probably something of a chore, but with distributed version control and a remote append-only repository you've got a chance of that being at least feasible to leave until you're looking to recover your system.

That doesn't help you with dot-files in your home directory though, and honestly I'm not sure anything will. Compared to 10MB in /etc on my netbook, there's 86MB in ~/.mozilla alone for me, often in inscrutable XML and binary files. Worse, applications feel free to create their own dot files at any time, and also to hide them underneath other directories (.config/gnome-session, .gnome2/evince, .kde/share/apps/ScanImages etc). Some need to be per-machine, others don't.

You could imagine having a .bashrc that sets LD_PRELOAD to include some file in .mozilla/firefox/194653e1.default/Cache/36A45162d01, which then checks every now and then to see if it can run "sudo" without needing a password to give itself root permissions, for example. Perhaps a .bashrc and LD_PRELOAD would be noticable (though I think not to many people), but there's also .xsession and a myriad of other bits of configuration that'll let you get a trojan started up that way.

On the other hand, the amount of valuable configuration in dotfiles isn't that large -- manually deciding which dotfiles are interesting and keeping them in version control, while scrubbing the rest every now and then (when things break, when you switch computers, once a week, whatever) could be feasible.

Another place to worry about stuff is /var. It's usually a little safer in that it's generally full of data, so it won't spontaneously launch software quite so much, but not completely so. Adding something to /var/spool/cron/crontabs/root could get you into trouble pretty quickly, eg. If you modified /bin/ls to do evil things, and someone tried reinstalling with apt-get, if you'd also added some code to /var/lib/dpkg/info/coreutils.prerm you could make sure /bin/ls was reinfected immediately.

I'm honestly not to sure what there is to be done about that either. It might be feasible to monitor just the "risky" parts of /var in a useful way, but it would be pretty easy to miss things. It might be possible to classify great swathes of /var as "not-risky" and treat the other bits similarly to /etc, but I don't think there are tools to do that at present. It might be possible to get programs to move the risky bits into /etc, /usr or users' home directories, but I know people were talking about some of those things over a decade ago, so it's not likely to happen soon.

Finally, there's the disk and filesystem in general -- having /etc/shadow be world readable or having a misplaced setuid bit can ruin your whole day, and you can put a fair bit of information in extended attributes these days if you're looking to hide it from the suspicious admin. You also want to make sure your boot isn't compromised -- perhaps your bootloader is jumping to code other than the kernel you thought you were pointing at, or your BIOS firmware has some code to setup a timer and a ring-0 trap that'll take control of your kernel a little while after it's booted. On the upside, there's nothing inherently difficult with dealing with that: just reflash your system and your bootloader; all your configuration should be elsewhere in the filesystem, so that should be easy. (Whether it is or not is just a matter of how good your tools are)

Ever tried modelling?

Subtitled: David Pennock’s Wall-Street pick up lines

Dr Pennock’s latest post is about fitting stockmarket data — he comes up with a nicely matching randomly generated histogram based on a Laplace distribution over the daily log differences (that is, take the log of the ratio between daily close prices — so if you gained 20% in a day, take the log of 1.2). As well as pretty pictures, logs of differences have the nice property that their sums and averages are actually meaningful — if you invested $p, at an average log difference of x over n days, then your total at the end of the n days is p*enx.

Dr Pennock doesn’t state the figures he came up with, but by my maths (well, R‘s maths assuming I issued the right incantation, and Yahoo’s data) the 60 year average (between 1950 and 2010) daily difference for the S&P 500 is 0.0004596 (with a variation of b=0.006505). Annualising that (ie, multiplying it by 365) and converting it to a percentage ((ex-1)*100) gives an 18.2% annual return over the fifty year period. All very reminiscent to the way of thinking about interest via logarithms I posted about some time ago.

Of course, you only get that result by averaging some really good years and some really bad years, but there’s no reason you have to apply the model to exactly that fifty year period — you could, eg, apply it to 365 49-year periods starting anywhere up to a year after the start of data.

One of the things Dr Pennock notes is:

At the aggregate level the stock market is well behaved: itís randomness is remarkably predictable. Itís amazing that this social construct ó created by people for people, and itself often personified ó behaves so much like a physical process, more so than any other man-made entity I can think of.

If the stockmarket were a random physical process — like beta decay or similar — the parameters pulled from the statistical fitting would have a physical meaning, and scientists would look at them to see if they were fundamental constants or if (and how) they varied depending on external influences. These parameters probably can’t be given too much meaning because they only relate things to the US dollar, which has all sorts of other influences, but at least we can have a look at how the parameters change over time.

(I sat up late reading Feynman anecdotes last night. I’m trusting taking a physics-esque approach to questions will be a short-lived consequence)

Anyway, taking 20-year periods gives us 40-years worth of data points (ie, investments beginning from 1950 to 1990; or equivalently ending between 1970 and 2010). Graphing the mean and variation for 400 of those periods gives something like the following:

S&P 500 20yr

An interesting thing to note from that is that the mean is both positive and fairly consistent — meaning that if you invested your money in the S&P 500 for 20 years, it doesn’t much matter when you did it, the log of your daily returns would average between 0.00025 and 0.0005 (generally in the 0.0003-0.0004 range) — which by the maths above means an annual return of between 9.5% and 20% (generally 11.5% to 15.7%), which compounded over 20 years is between 521% and 3757% (generally 794% to 1757%). It’d be interesting to see how that changed when adjusted for inflation.

The other interesting aspect of that chart is that the variation seems to be gradually increasing — meaning that while the overall result of the 20 year investment is roughly the same (in so far as a 6x return and a 38x return is “the same”), on a day to day basis you can expect to see both larger gains and larger losses in more recent investments.

If you start reducing the investment period things get a bit more lively, though. With a ten year investment, if you have particularly lousy timing you might have no more money than you started with:

S&P 500 10yr

The variation isn’t as stable here either — you can pick some periods of fairly constant variation, some increases and this time even some decreases. There’s also a very sharp increase in the variation fit for investments that span the last couple of years.

Shortening the period still further to a five year investment gives us the possibility of ending up with less cash than we started with:

S&P 500 5yr

Though it’s worth noting both that losses are still pretty rare at that point (pretty much limited to people trying to cash out during the 1970s by the looks), and that even investments that ended anytime in the past five years or so look like they should have made a reasonably healthy profit (financial crisis or not).

Investing for a period of just one or two years is still somewhat reasonable, but you’re starting to have some bigger risks of losing money, and it’s getting hard to predict just how chaotic things are going to seem if you check your net worth every day.

S&P 500 2yr

S&P 500 1yr

Of course, the modelling is breaking down at this point too; without lots of data, guesses at the mean and variation aren’t going to be incredibly meaningful. So if you shorten the period further, to just a month or a quarter, you get pretty useless results:

S&P 500 qtr

S&P 500 month

In general, though, the Laplace analysis seems to support ideas about index funds and long term investing being productive and relatively save ways about dealing with the sharemarket, and possibly provides some interesting ways to analyse different funds.

At least, if I’ve been doing my maths right, anyway…

The simple scripts in life are often the best

Possibly my longest blog post title ever?

Anyway, here’s a link to today’s little bit of scripting. I’ve now written this script three or four times, so I figure that means it’s useful and maybe worth keeping around. I’m calling it dir2tree and all it does is take a (sorted) list of pathnames and convert it into a tree structure. So, eg:

$ dpkg -L samba | grep man.*gz


$ dpkg -L samba | grep man.*gz | sort | dir2tree

So yeah. There you go.

(A related useful tool is tree, which generates a prettier tree and does the directory walking itself. I wanted something that I could use with find, and I couldn’t spot anything that already existed.)

Free software and the future

So this past weekend I had (hopefully!) my last Linux Australia face to face meeting — handed over the chequebook to the new treasurer, passed on some advice, and whatnot — which more or less ends my major existing responsibilities to the open source world. That happened to more or less coincide with a tweet from some random guy that was retweeted by some other random guy I follow, which sums up one of my concerns about free software these days:

Open Source is fighting the last war. It is a sideshow in cloud/online. What matters now is Open Data.

9:43 PM Mar 13th via UberTwitter

That happens to be from a Software-as-a-Service guy, so it’s a bit talking-his-own-book, but it still rings true I think. Consider to Canonical’s recent attempts building their SaaS storage application or their music store, and the “exciting news on the Linux front” that Ubuntu has a new theme that’s purple instead of brown and the window buttons have moved to the top-left.

Sure, open source is still interesting — but mostly because it’s a cheap/free way to build proprietary apps provided over the web, that you can then charge for on a monthly basis.

That does have a bunch of advantages. For vendors, piracy’s no longer much of a concern, because you’re not even giving people a copy of your binaries, let alone source. It’s easy to keep contact with your users because they interact with your computers every time they use your software. It’s easy to distribute bug fixes, because they only need to go on your own computers. It’s easy to duplicate bugs, because you can track every interaction every user has with your system. And you can sell direct, so you can charge retail prices for your product rather than wholesale ones, and change prices immediately rather than worrying about lag as your distributors deal with old stock. And there’s no incentive to make random changes, because your users are already paying you every month, they don’t need to be forced to buy a new version.

For users, you’ve got a single point of contact for support — you can’t have your application vendor telling you its your OS or hardware that’s the problem, because it’s all running on their systems. It’s easy to use any program you want on any system you have, because all you need is a free web browser and maybe a couple of common plugins. It’s easy to scale, because you only need a username/password to verify you’ve paid for the product, not a bunch of license codes or a hardware dongle to prove you’re not a pirate.

That’s not without its problems: it’s somewhat harder to manage data export this way — vendors can’t just leave it to users to extract their data from the files the program uses, they’ve got to provide an export UI, and if there’s no export UI users have to go through screenscraping. SaaS products can be changed or removed at any time, and users have absolutely no recourse — you can’t keep on using old versions of Google Docs in the same way you can use old versions of Microsoft Office, if you happened to not appreciate the latest set of changes.

And, of course, there’s very little opportunity to customise anything. You can’t make it run complete on your laptop so you can use it when you don’t have internet, you can’t tweak the source code and rebuild it to make it work a little better, and you certainly can’t find and fix bugs that are getting in your way.

But for almost everyone, those are things they don’t do anyway. So with all the advantages (for both you and your users) of grabbing existing free stuff like Linux, Apache and MySQL and writing proprietary webapps on top of it, why would you release your killer new idea as actual free software?


So much like last year, the Linux Australia face to face meeting has somewhat spoiled my WoBloMo posting frequency. Though, technically it’s still the 13th in UTC, New York and Hawaii, so there’s that.

Anyway, I’ve bitten the bullet and signed up for the Upstarta Meetup. I’ve been in two minds about Upstarta for a while — on the plus side, cool local people chatting about startups; but on the minus side, I disagree with some of the Upstarta Principles. But I’ve decided I like them in practice enough not to care.

But like that’s going to stop me disagreeing about them on my blog! The main one I don’t accept philosophically (as opposed to in practice) is the first one:

Neither a borrower nor a lender be: no credit or external funding.

In practice, or on a day to day basis, I think it’s a good idea — Paul Graham’s essays on ramen profitability or the challenges of fundraising argue for a similar approach for startups specifically, and in the broader sense of things you don’t have to look far for negative consequences of either borrowing or lending more than (it later turns out) was affordable.

But on the other hand there are a bunch of times when I think borrowing and lending is useful.

I wouldn’t have had the opportunity to go to university straight out of high school without borrowing money in some form or another — as it happened it was paid via HECS which meant paying a fairly small portion of my course’s cost to the government explicitly either upfront or through my taxes at a low compounding interest rate, and having the rest of it paid by taxpayers at the government’s discretion. The new HELP system does the same. If I’d had to earn enough cash to pay for that education in full in advance on my own — no loans from family even — I can’t see how I would have had the opportunity to learn the same stuff, which would have resulted in not knowing how to make Debian’s “testing” actually happen.

In theory at least, I’m also something of a fan of self-funded pensions — that is, investing some of your income over the course of your life, so that you can eventually live off the proceeds without having to work. Personally, I’d love to be able to put a million dollars or so in the bank at 5% or more and get fifty grand or so every year before I even have to lift a finger — but the only way that works is if that million dollars is being lent to some borrower, who’s making enough use of that million dollars that they can afford to give me fifty grand just for the privilege.

And it seems like in at least some cases investment of cash — borrowing — is a valuable contribution: per this criticism of StackOverflow’s VC hunt, giving a bunch of Starbucks franchisees money to open stores to follow a proven business model seems to have worked out pretty well for pretty much everyone involved.

But there’s a lot of space between all of those and giving people millions of dollars to spend on foosball tables and beanbags because they’ve had an idea for a webapp; and ultimately it’s not entirely worth making sure there’s a clear distinction between things that are “good advice in the here and now” and “fundamental principles to be adhered to forever” when the aim of the moment is to make an interesting business from nothing.


Ooops. Emergency woblomo post coz I forgot. Here’s the link to the other day’s screencast that apparently didn’t make it through aggregators. Hohum.

Some more PyGame

So here’s where I’m up to with my “trigrid” project. The idea is you’ve got a bunch of roads on a triangular grid (hence the name), with little peons on each segment of road that will carry goods from one end to the other. Some segments spontaneously create goods, others destroy them. The point is to try to get these guys to self-regulate by adding market prices to the goods — so they cost $1 when created, and every peon charges an extra $1 for carrying them around. I’m now at the point where that all works, but there’s no intelligence — peon’s will buy and transport goods without checking first that anyone actually wants them. Anyway, tada:

Screencast created using pyvnc2swf and vnc4server to create a FLV for youtube. I tried recordmydesktop first, but something about the output completely confused youtube when I tried uploading it.


I don’t have anything interesting to say today, so instead I’m going to link to an oldish post on my junkcode wiki (rss feed). Namely redcat — a program that merges ed-format diffs, so they can be applied in a single pass. That makes a big difference when dealing with anywhere more than a few small patches to a single large file, most particularly when working with apt and pdiffs. There’s no actual plans to improve apt along these lines as far as I know — at least, there wasn’t any response when I poked the apt list some time ago — but it’s still got a couple of interesting coding techniques folks might find interesting.

The Red Pill

ajtowns – Scamming my way onto “Team Samba” (“hey, I use it!”) was a good idea. Winners! #lca2010 #hackoff

Wellington Perl Mongers were awesome enough to run the Hackoff during LCA 2010. It consisted of a couple of hours of team hacking to decode craftily hidden eight character tokens. I’d seen Rusty carefully putting the “Samba Team” earlier (“Here’s what’ll happen: I’ll put together a team so awesome that they won’t want me on it”), so when I wandered in with Biella to check out what was going on and found myself sticking around to see how it went, it seemed obvious which team I needed to latch onto.

The first four problems went down pretty well, albeit with a chunk of brute force rather than any elegance. The first question was some text munging of an HTML document, pretty much perfectly designed for a perl solution. Personally, I got stuck on trying to get the damn thing to display on my underpowered netbook, but that’s what teams are for, right? Problem two was pretty similar — it had an ascii punched tape that you had to treat as binary (holes are one, untouched is zero), translate to hex, translate to ascii, then actually read the resulting text rather than just plugging in the promising look eight characters from the end. I think I abused iprint and shell to get that one out — see: users can be contributors too! Number three was one for the spreadsheet mungers: a fixed width text database where you had to pull out various bits of information. We solved this one as a team: people worked on different queries however they liked; Andrew Bartlett and I imported it into openoffice.org and used its sorting facilities; other people did cleverer things. The fourth problem, which was the last one anyone got out before the organisers had to call time (and that after an extension), was an OpenOffice document inside a tarfile with various letters highlighted; putting the highlighted letters together gives you the answer. I think it was Jelmer who managed to pull the document apart and programmatically extract the answer from that, and with it the win.

The next question we got was somewhat horrific. It was an animated gif, consisting of 1300 odd frames of green text falling down the screen Matrix-style, with one or two characters randomly adorned with a blue or red square. The instructions on the first frame suggests choosing the “red pill” (the rabbit hole one) or the “blue pill”, with little pointers at the different coloured frames. So the answer seems to be go through every frame and pull out the characters highlighted in red, and see what they end up saying.

We were kinda stymied by this, and after some back and forth, pulling the gif apart, converting to png, and basically getting nowhere ended up just assigned chunks of 1000 frames to different team members. We got a bunch of frames done, but were still a ways off an answer when it got shutdown. Then it was off to the nearby pub for dinner and beers (and Tridge’s analysis of the the “Harry Read Me” climate data stuff).

(We’d only looked at the red squares, trusting in fandom to not let us down; one of the other teams had actually looked at the blue highlighted letters though — turns out they were just DEADBEEF repeating. Nice.)

After heading back I spent a bit of time playing with my pygame project; I think getting from just having a triangular grid, to having little moving circles on top of the grid or something equally compelling. But that then inspired me to have another poke at that problem again. So I pulled out pygame, and the 1349 frames converted into PNM, and set about automatically extracting the relevant characters.

I took a few assumptions: that the characters would be laid out on a nice rectangular grid, that all the characters would look the same (every A is the same combination of pixels as every other A), and that the highlights would be in a predictable position relative to the character grid. I pulled up a frame in gimp to work out the pixel width of each character and the offsets for the frames, at which point it was just a simple matter of programming.

And rather than explicate that further here, I’ll point at my junkcode page for the explanation of that programming and the prettified source instead. All in all, pretty neat, I thought.

Anyway, submitting that (sitting on the pavement outside the venue; a security guard questioned me about stealing the conference wifi, and a group of party girls walked past and commented “He’s facebooking!” — sadly, they were right) then allowed us access to the final problem they’d prepared, which was a midi file disguised as a spreadsheet where the answer was encoded as errors in a repeating tune. Beyond listening to it for a bit, I didn’t make an attempt at that point: graphics and OCR were scary enough; audio? Please no. I see now that a couple of other teams actually got that one out too eventually. Kudos!

(Tridge and Jelmer had a go at solving it independently too; they did it in C, loading each PNM into a structure in memory directly, scanning that for the red square, copying it into a separate structure, writing that to a new file, and running gocr over the file; then doing the next frame. PNM is a particularly great format for doing that in C.)