Markovian madness
This is an experiment, about an experiment: I wrote a silly script that does fun stuff, then wrote up the experience on Storify. Since it didn’t give me the option to post to my blog in draft mode I don’t think I’ll be doing things this way again. But it was fun, for a once-off.
-
Brendan Adkins has a problem: he has (at time of writing) 17 alter egos on Twitter, only one of which he’s actually responsible for. The rest range from hilarious cock jokes through dead-on-arrival throwaways to haiku. Recently I joked about contributing to the madness:
-
@BrendanAdkins So @MarkovBrendan can hope for your blessing after all? (No.)likes·comments
-
@tTikitu I sometimes wonder if I already am MarkovBrendan.likes·comments
-
Well, who could resist that?! So I whipped up a simple little Twitter-driven Markov chain language model in Python (it’s on bitbucket).
Markov chains are a simple statistical model that probably has lots of more appropriate real applications, but I know them as a tool for generating nonsense. A Markov model takes a bunch of documents (here tweets), splits them into tokens (words, more or less), and constructs a statistical model of which tokens follow which other tokens, and how frequently. Using the statistical model you can generate nonsense that uses the same words with a similar statistical structure, so that it has the flavour of the original documents: train a Markov model on Carroll and you’ll get Carrollian nonsense, train it on a Bible and it generates biblical nonsense. (Yes, from the King James you’ll get begats, and no, they don’t come in order.) A lovely thing about Markov chains is how incredibly simple they are. My implementation is 32 lines of Python and took me less than two hours to write (including another 65 lines or so for hitting twitter and storing the model in a file, but also including a bug in the installation script which meant noone except me could use it).
Of course I trained it on some recent Brendan tweets, and to my great delight: -
@BrendanAdkins So I went ahead and made MarkovBrendan. It said, among copious nonsense, “Okay, I already am MarkovBrendan.” #notkiddinglikes·comments
-
(Look at the top of this story for some prominent statistical source material.) And on the strength of that, someone went ahead and hooked it up to a real account:
-
WHO eventually learn to THERE WAS right: this is an ELABORATE PRANKlikes·comments
-
Shit Just Got Real. @brendan_ebooks, Season 2. Instant View Now.likes·comments
-
(The @brendan_ebooks account won’t make much sense unless you know about @horse_ebooks: a spam account that tweeted nonsensical fragments of ebooks about horses, and became bizarrely popular among the same kinds of people that making altBrendans appeals to. (Admittedly it doesn’t make much more sense when you do know the context, but I suppose that’s rather the point.) I’m guessing the original @brendan_ebooks author discovered that writing high-quality nonsense is just as demanding as any other kind of high-quality writing, whereas picking the gems from a few hundred lines of Markov chain output is both easy and fun.)
Here’s the moment when I realised my little algorithm was all grown up: when it started meeting people I don’t know. -
@endoftheshow Twitter brendanslikes·comments
-
@brendanadkins @brendan_ebooks WHAT JUST HAPPENEDlikes·comments
-
But like any proud parent, I can’t keep myself from meddling. The simple model I built uses bigrams (two-word sequences), which produces extremely high-grade nonsense; trigram models (built on three-word sequences, in which the third word is predicted from the preceding word pair rather than the preceding word) produce more natural-seeming output but also remove some of the sharp-angled course changes: which makes for funnier output depends heavily on the corpus, and for Twitter (and the BrendanSphere) I really don’t know. Finding out should be fun though.