A big part of life at Hacker School is code reviews. Sometimes facilitators review code but for the most part, it’s our super talented peers.
Getting reviewed is really valuable to accelerate learning. It points unclear parts of your code, and blind spots you didn’t know you had. In some of my projects I’ve implemented things in a brute-force way, and my reviewers have pointed me to data structures or libraries I hadn’t yet discovered that allowed me to cut out lots of code.
Today Allison (a facilitator) offered a demo code review, so we could see how she thinks about a code review. We went over a Lisp implemented in Python written by Nick.
The crowd watching was pretty diverse, so everyone took something different away. Some learned about Python, some about the Git workflow, and some about Lisps.
TL;DR. I noticed a bug in Octopress. In fixing it, I found a separate, truly spectacular error, and learned a lot of interesting things about bash. I would never have learned so much outside of Hacker School, since it gives me the time and space to open up the box.
The original bug.
Octopress’ rake preview lets you preview your post in the browser before deploying. It spawns Rack to serve local requests and Guard to watch for file changes. Guard in turn spawns Fsevent to do the actual watching.
But when you crtl-C, Fsevent stays running in the background until you kill it or modify a file in the watched dir. This throws a TCPServer Error if you try to run rake preview a second time. Sad pandas.
Initially I had no idea what was going on here. I’ve dabbled in Ruby but not modified a Rakefile before. I learned ps and kill ten years ago, but didn’t know much about what was going on under the hood. Ripe conditions for learning a ton.
This is what I love about Hacker School. I have the time and space to go down these rabbit holes. At a startup, priority goes to shipping code. Here, shipping code is the means, not the end.
In the end, the actual bug was pretty small: when you interrupt rake, it passes kill 9 to guard. Guard ends itself but doesn’t properly terminate its child process fsevent`.
This can be duct-taped together by sending 3 (QUIT) instead of 9 (KILL) to Guard, but we’re asking the Guard team if this is a known issue. But it gets better!
Curiouser and curiouser
Things got really weird when I was playing around with the different exit codes. The Ruby documentation says that if you pass kill a negative argument, it will kill the entire group of processes, not just the child process. Promising!
However, it broke very spectacularly. If you want to follow along, grab the development branch of Octopress, currently 2.1:
Weird! Press any key that sends input. PRY EXPLODES. Press enter and you get a bash prompt back, but now you can’t see anything you type. It’s still getting to bash (try ls <enter>) but some things, like control-L, no longer work.
Down the rabbit hole
So I started reading about TTY and POSIX signals and using stty. Interesting stuff, particularly the history of our terminal evolving from ticker tape outputs.
You can also change all sorts of wacky things about your terminal with stty. Try stty -echo (stty echo to undo it). This explains why I wasn’t able to see my own typing after the Pry explosion - when control was reluctantly handed back to bash, the flags on my terminal weren’t properly reset, including the flag to use raw (non-canonical) input processing, which is why it won’t process things like control-l until you hit enter.
I didn’t find all the answers in my reading, but I’m asking significantly better questions:
How groups are being used is unclear. I expected the process group to inherit its gid from the parent process and to be killed cleanly by passing a negative argument to kill, but that didn’t work. Passing -3 kills Rack but only mortally wounds Guard.
One remaining mystery: why is the bash prompt printed to the terminal after the Guard prompt when Guard is still running? Maybe control is passed from Guard to bash and then back to Guard?
It seems that Pry is getting something unexpected from its stdin, which triggers its explosion, which may or may not be coming from bash. But how to intercept it?
The Readlines library defines some bash shortcuts, like mapping ctrl-l to clear, but lets you override them with an .inputrc. So you can do awesome things to your prompt like adjusts how ctrl-w works. There are also vim and emacs modes for bash.
Stuff I learned: a little Ruby, Rakefiles, POSIX signals, the history of TTY, more about stdin, stdout, and stderr, stty, working with pids, gids, and the Process module in Ruby. And my first accepted (albeit one-character) pull request of Hacker School, which is certainly the highest learning-to-LOC ratio I can think of.
This blog is powered by Octopress, which is basically a set of rake tasks, themes, and other add-ons generate a blog from Markdown posts. It’s in turn powered by Jekyll.
The documentation is generally pretty good, but they didn’t really explain one fundamental thing. It’s pretty simple once you dig into the Rakefile, but here’s a quick explanation if you just want to get up and running.
WTF is going on with the branches?
On github, you will have two branches: source and master. But locally:
$ git br
Huh. Interesting. Locally, you only have one branch: source. Wat?
Basically, source holds your posts in Markdown and other files before they are transmogrified into HTML. Once you run rake generate, Octopress will generate all the HTML & CSS and put it all in /public.
And when you run rake deploy, Octopress pushes the contents of /public (on local branch source) to the home directory (on remote branch master).
So: each time you finish a post, run BOTH rake deploy to deploy and git push origin source to back up your source files to github.
Sublime Text <3 Markdown
So you’re writing your blog posts in Markdown like a boss. There are a few things you can do to make Sublime Text 2 a lean, mean blogging machine:
Custom themes. For example, MarkdownEditing, a series of custom themes and shortcuts for Markdown. (Hint: installing Package Control first will make this easier)
Sublime Text 2 uses Hunspell for spell checking, the same library used in Word. This process was a bit more convoluted:
Command-shift-p in ST2 to open Package Control
Choose “Add Repo”. Paste in http://www.sublimetext.com/docs/2/spell_checking.html
Command-shift-p again. Choose “Install Package,” then “Dictionaries”
Find your preferences in Preferences > Package Settings > Markdown Editing > Markdown Settings - User
Last time, we asked your advice on how to handle all possible trigrams in 15 MB of seed text - a task that took up about 1 GB of memory in our first implementation, rapidly blasting through Heroku’s limit of 512 MB. The culprit? Dictionaries.
Our initial data structure was a dict of dicts. The key was a bigram, or word pair, and the value was a dict of all of the words that followed that bigram in the training text and the number of times they occured. Wat? Here’s an example:
With 15 MB of training text and so much redundancy, this can get out of hand quickly! Every word appears at least 3 times in the matrix, and some appear thousands of times.
Generators. for iterating over the headlines. This helped a lot - got us from 1.3 G down to about 850 MB.
Compression. Tossing out common values, like 1 for the count of the long tail of words.
Creative use of data types. E.G. storing words as ints instead of strings.
Lists. They take up less memory than dicts. One variation that we tried (and is in this version of our code) is creating two dicts of lists rather than one dict of dicts.
Trade memory for speed. Darius, a Hacker School alum, pointed us to his Stack Overflow answer, where he laid out a few other approaches that mainly trade memory usage for speed. Also, David dug up some recent papers that focus entirely on this problem. Researchers are currently doing some very clever algorithms to get the most out of their data structures.
So where did we leave it?
Darius’ suggestions and the approaches in the literature will certainly help reduce memory usage. We wanted to maximize learning per unit of time at Hacker School and didn’t think implementing these approaches teach us proportionally more than reading about them. We shipped it by deploying on a machine with more memory. Follow the Twitter bot here!
Stuff I learned along the way:
Generators, classes and methods in python, pickle, regex, defaultdict, Tweepy, Buffer’s API, how bad the built in titlecase methods are in Python, some clever way to handle default arguments, some fundamental principles of NLP.
David and I are running our app on Heroku, but it uses too much memory to run! Got any clever optimization suggestions?
We’re building a Markov chain generator that is trained on a corpus of 15MB of old Hacker News headlines. The app currently indexes the training corpus and constructs a matrix of all the possibilities, then queries that matrix to generate a hopefully entertaining new headline. Preview it here.
However, as we first wrote it, it used 1+ GB of memory when we deployed to Heroku, where the limit is 512 MB! We wrangled it down to 570 MB today and will tackle it again tomorrow morning, but need more ideas.
Some things we’ve tried:
Pickling the matrix. Not pickling the matrix. (didn’t help)
Storing the words as ints in our matrix rather than strings (helped)
Tweaked the data structure (helped, could probably do more with this)
We ran out of time, but want to try some sort of data store. Redis?
A lightweight introduction to Markov chain generators
Markov chain generators are an interesting and dead simple way to generate a chain of anything - words, weather predictions, market simulations - anything that has happened before and can happen again.
The chain is generated by examining the current link in the chain, picking the next link based on what typically followed the first link in the training corpus. Then, it completely forgets about the initial link. It chooses the third link only based on what generally comes after the second link. And so forth.
So there is no history, no memory. The generator can switch between training sentences stochastically, so it can generate sentences that sound kind of like the original source, but don’t necessarily make any sense, and are hopefully funny.
Bigrams vs trigrams
The first design decision to make is how much history to examine when choosing a new link in your chain. If you only go one word back, you are looking at pairs of words, or bigrams.
Some sentences generated with bigrams: (please pardon that they’re all seeded with ‘why’ - this is from an early branch)
Why I Am I Talk About To Be Designing For The Daily Check Out Of Braid Creating False
Why Machines Is Right Fit
Why Nokia Partners
Why We Doing In Siberia
Why Are Literally Amp Chrome Opera Singer Is Coming Soon
Why The Twitter Besides Buying Groupon Will Not Good Freely Available To Effectively Off
Why Objective C Safer In Gears
Why Some Sleep Deprived Brains
Why I Like Instapaper Redesigns Foursquare Checkin Offers Readers Cause Problems And Should Set Theorists
Why Dropbox S More Music Gear Online Teaching Ror Developers
They are nice and random, and clearly contain the right buzzwords, but they aren’t very grammatical and therefore can’t be funny. Humor relies on surprise: setting up an expectation, then delivering something different. With bigrams, you never get a coherent enough sentence to generate an expectation, so no shot at being funny.
Let’s smooth out our chains by moving from bigrams to trigrams. So instead of looking at what follows a given individual word in our training corpus, we’ll look at what follows a pair of words. Here are some examples - note that the grammar is significantly better and some are worth a chuckle. I particularly like 7 and 10.
My Year Of Experience Is A Big Twist
Mini-microsoft: Compensatory Arrangements Of Certain (microsoft) Officers
The 12-step Landing Page
More Webmaster Questions - Answered
Scripting Gnu Screen
The Full Social Network Buttons To Operate On Your Terrorist Neighbor
Typical Tech Entrepreneur?
The Lost Lesson Of 'free'
Contact Lenses Are Curing The Founder's Syndrome
HN is at its least self conscious and most easily lampooned when doling out advice. Consider these ‘how’- and ‘why’-seeded sentences:
How To Choose The Right People And The Chance To Present
How Do You Manage Your Startup’s Pr At Tech Startups Are Moving To Rackspace
Why Do Organic Eggs Come In Bunches
How Apple Is The Prevalence Of Qwerty
Why Google Wants To Magically Transfer Gov Debt To Darwin
How To Finance Your App From The Lhc Will See Global History Of Governments And Geeks Parse The World
Why Are Bank Security Questions On Agile
How To Hack The Us - So Stock Up 879.55%
Why Computer Displays Suck For You
When we choose a first word randomly from all the words that have ever been in headlines, we get a bigger assortment, but I think they’re less funny:
Canonical Contributes Only 1% Of Profit
Google Uneveils New Search Results With Google's Closure Of Paid Prioritization
What Do You Deal With Worldnow, Adds 19 Million Potential Users
Diminishing Dead-tree Media And Mobile Computing Is A Beautiful Monster
Ask Hn Yahoos: What Yahoo Should Do To Excel
China Demands New Pcs Is Ruined By A Thousand Years
Freemium: A Business Plan Competition
Buy My Blog, Please
Why I Am In Your Field?
8 Tips To Considerate When Planning To Move Themselves (neural Network)
The other drawback is that they’re more likely to hit on a seed with only one possible resulting sentence, like “Buy My Blog, Please,” above.
$ grep -i "buy my blog" ../hnfull.txt
Gawker media boss Nick Denton: Buy my blog, please
If you think of the bot as walking through the possibilities, a common seed like ‘how’ will branch of lots of different ways, so there are many paths for the bot to walk and thus many possible sentence outcomes. A less common seed, like “scripting” will result in fewer possible paths, and more likely to just return a real headline verbatim. There were only 10 ways to finish “buy my” in our corpus of 350,000 training headlines, compared to 7,710 ways to finish “how to.”
Raising the stakes
But how to make it as funny as possible? Some possible improvements:
Crowdsource funniness ratings (with mechanical turk or a ‘hot or not’ app, etc). Only tweet out the funniest headlines.
Feed the funniness ratings back into the algorithm. For example, only use the funniest seeds.
Do semantic analysis of parts of speech in the training corpus and use them with templates. This would improve grammar but decrease spontaneity.