Sasha Laundy

In Search of Funny Gibberish

David and I paired this week to build a bot that analyzes 15MB of old Hacker News and generates completely new headlines that really sound like they came from HN. You can follow the Twitter bot or check out our code.

Inspiration

I was inspired by this Hacker News parody and this Twitter bot to build a Markov chain generator as my warm-up project in Python.

A lightweight introduction to Markov chain generators

Markov chain generators are an interesting and dead simple way to generate a chain of anything - words, weather predictions, market simulations - anything that has happened before and can happen again.

The chain is generated by examining the current link in the chain, picking the next link based on what typically followed the first link in the training corpus. Then, it completely forgets about the initial link. It chooses the third link only based on what generally comes after the second link. And so forth.

So there is no history, no memory. The generator can switch between training sentences stochastically, so it can generate sentences that sound kind of like the original source, but don’t necessarily make any sense, and are hopefully funny.

Bigrams vs trigrams

The first design decision to make is how much history to examine when choosing a new link in your chain. If you only go one word back, you are looking at pairs of words, or bigrams.

Some sentences generated with bigrams: (please pardon that they’re all seeded with ‘why’ - this is from an early branch)

1
2
3
4
5
6
7
8
9
10
Why I Am I Talk About To Be Designing For The Daily Check Out Of Braid Creating False
Why Machines Is Right Fit
Why Nokia Partners
Why We Doing In Siberia
Why Are Literally Amp Chrome Opera Singer Is Coming Soon
Why The Twitter Besides Buying Groupon Will Not Good Freely Available To Effectively Off
Why Objective C Safer In Gears
Why Some Sleep Deprived Brains
Why I Like Instapaper Redesigns Foursquare Checkin Offers Readers Cause Problems And Should Set Theorists
Why Dropbox S More Music Gear Online Teaching Ror Developers

They are nice and random, and clearly contain the right buzzwords, but they aren’t very grammatical and therefore can’t be funny. Humor relies on surprise: setting up an expectation, then delivering something different. With bigrams, you never get a coherent enough sentence to generate an expectation, so no shot at being funny.

Let’s smooth out our chains by moving from bigrams to trigrams. So instead of looking at what follows a given individual word in our training corpus, we’ll look at what follows a pair of words. Here are some examples - note that the grammar is significantly better and some are worth a chuckle. I particularly like 7 and 10.

1
2
3
4
5
6
7
8
9
10
My Year Of Experience Is A Big Twist
Engine Yard
Mini-microsoft: Compensatory Arrangements Of Certain (microsoft) Officers
The 12-step Landing Page
More Webmaster Questions - Answered
Scripting Gnu Screen
The Full Social Network Buttons To Operate On Your Terrorist Neighbor
Typical Tech Entrepreneur?
The Lost Lesson Of 'free'
Contact Lenses Are Curing The Founder's Syndrome

Which seed?

HN is at its least self conscious and most easily lampooned when doling out advice. Consider these ‘how’- and ‘why’-seeded sentences:

1
2
3
4
5
6
7
8
9
10
How To Choose The Right People And The Chance To Present
Why Javascript Is Broken
How Do You Manage Your Startup’s Pr At Tech Startups Are Moving To Rackspace
Why Do Organic Eggs Come In Bunches
How Apple Is The Prevalence Of Qwerty
Why Google Wants To Magically Transfer Gov Debt To Darwin
How To Finance Your App From The Lhc Will See Global History Of Governments And Geeks Parse The World
Why Are Bank Security Questions On Agile
How To Hack The Us - So Stock Up 879.55%
Why Computer Displays Suck For You

When we choose a first word randomly from all the words that have ever been in headlines, we get a bigger assortment, but I think they’re less funny:

1
2
3
4
5
6
7
8
9
10
Canonical Contributes Only 1% Of Profit
Google Uneveils New Search Results With Google's Closure Of Paid Prioritization
What Do You Deal With Worldnow, Adds 19 Million Potential Users
Diminishing Dead-tree Media And Mobile Computing Is A Beautiful Monster
Ask Hn Yahoos: What Yahoo Should Do To Excel
China Demands New Pcs Is Ruined By A Thousand Years
Freemium: A Business Plan Competition
Buy My Blog, Please
Why I Am In Your Field?
8 Tips To Considerate When Planning To Move Themselves (neural Network)

The other drawback is that they’re more likely to hit on a seed with only one possible resulting sentence, like “Buy My Blog, Please,” above.

1
2
$ grep -i "buy my blog" ../hnfull.txt
Gawker media boss Nick Denton: Buy my blog, please

If you think of the bot as walking through the possibilities, a common seed like ‘how’ will branch of lots of different ways, so there are many paths for the bot to walk and thus many possible sentence outcomes. A less common seed, like “scripting” will result in fewer possible paths, and more likely to just return a real headline verbatim. There were only 10 ways to finish “buy my” in our corpus of 350,000 training headlines, compared to 7,710 ways to finish “how to.”

Raising the stakes

But how to make it as funny as possible? Some possible improvements:

  • Crowdsource funniness ratings (with mechanical turk or a ‘hot or not’ app, etc). Only tweet out the funniest headlines.
  • Feed the funniness ratings back into the algorithm. For example, only use the funniest seeds.
  • Do semantic analysis of parts of speech in the training corpus and use them with templates. This would improve grammar but decrease spontaneity.
  • Hire Nick and Dave to crank out more of these :)

Any other ideas?