Roblog

home about me text processing with ruby

Mini Munging, Brighton Ruby conference, July 2015

21 July 2015

At Brighton Ruby conference 2015, I gave a talk entitled Mini-Munging: the Joys of Small Data. In it, I explained the toolkit that I use for data processing tasks, and attempted to convince the audience that it represents a saner way of processing data than “Big Data” tools — at least for that class of data that’s too big for Excel but too small to warrant processing on a huge cluster of machines.

The slides from the talk are below, along with a transcript.

If you enjoyed this talk, either in person or here, you’d probably also like my book Text Processing with Ruby. It’s all about manipulating textual data using Ruby, including things like regular expressions, writing parsers, natural language processing, and more.



Hello, I’m Rob!

This is my first conference talk, made doubly intimidating by being alongside such a brilliant lineup, so please do excuse any manifestations of pure terror — like fainting, or just mumbling incomprehensibly for 20 minutes. I’m also aware that I have the key “last spot before lunch”, so I’ll do my best not to keep you from Brighton’s various culinary delights for too long.


I’m a partner in a design and marketing agency in London called Big Fish. We do branding, design, marketing, and lots of digital stuff, mainly for food brands but for some others too. You might have seen some of our work. Probably, in fairness, tumbling from the shopping bag of someone chasing a child named Tarquin through your local Waitrose, but it does look pretty, right?


The work we do as a marketing company, like many of your jobs I’m sure, generates lots and lots of interesting data. Interactions with websites, entries into competitions, redemptions of coupons, exports from lots of different systems. And part of my job is to analyse that data, to try to make sense of it, create reporting systems and glue between different sources of data.

A lot of what I do is what you might call data munging.


Mung is a word that’s tragically underused, in my opinion, but it’s one that I really love, one of those old hacker words.

I’ve always liked the folk etymology that it’s an acronym for “mash until no good”, because that seems to sum up a lot of the work that you tend to do in this area.


So, this is the toolkit I use for all my data wrangling.


Not really.


The thing is, the data I’m working with isn’t “Big Data”. Even the biggest datasets I’m working with are gigabytes in size, not terabytes or petabytes.

In other words, they’re well within the capabilities of an ordinary laptop and a single developer.


And the other thing is… your data is probably small too. I think there’s this assumption in some circles that data that’s too big for Excel is big data, which just isn’t the case. There’s an enormous middle ground between Excel and Hadoop. And an enormous amount of interesting data sits within this middle ground. This is true of virtually all the data I need to process, and I’d wager that it’s true of much of your data, too.


I do the overwhelming majority of this data wrangling in the shell. Lots of the tools I use are over 40 years old. At 22, Ruby is a comparative spring chicken.

But that doesn’t make them any less powerful or any less suitable for the job, and in fact they’re uniquely capable.

So my job today is to convince you that this represents a saner, yet for some reason underrated, toolkit for most data munging jobs. There are a few reasons why I think this is the case.


Your shell, the command line, isn’t just somewhere that you go to type commands and see their output. It’s a fully-fledged programming environment that exposes the entire plumbing of Unix to you without you needing to write C, which is always a nice thing. And yet it’s fairly common among developers to ignore this power, and use only a tiny fraction of what their shell is capable of.


The foundation of the shell’s power is that its tools don’t speak complex, inscrutable binary protocols over hidden interfaces.

They output text and they accept text. Text that humans like you can read, inspect, and create yourselves.

Doug McIllroy, inventor of the Unix pipe, called text “the universal interface”.


If there is a universal interface, and this interface is adopted by all tools, then you gain the ability to compose together any tools, even ones that aren’t designed specifically to work together.

The Unix shell offers a way to do this that’s unparalleled in its ease of use: the pipeline. It lets us hook up the output of one process to the input of the next, without either program having to know of the other’s existence, let alone what it’s doing. We can chain together as many of these processes as we need to get the job done, each one performing a different transformation on the data that flows through the pipeline.


Let’s imagine we’ve got this log file, stored in CSV format. It stores actions performed on a website — the user’s email address, what the person did, and when they did it.

Let’s also imagine we were digging into this file and wanted to find out how many actions have been performed by each email address.

We can break this problem down into discrete steps: first extracting the email address, which is the first field in the file; then taking a count of how many times each unique email address occurs.

Writing this as a pipeline involves composing together processes that each perform one of those steps. So in this case we’re using cut to take the first field, sort to make identical email addresses appear together, and uniq to produce a count of the unique values. The final output is the neat summary that we were after, each email address along with how many log entries it has.


Let’s imagine that we’ve looked at the output of that first command and said “hmm, Bob’s done an awful lot of things. When did he do them?”. In other words, we’d like to look at just the entries performed by Bob, grouped and counted by day.

With another pipeline, quickly typed, we can grab this information too.

Here we use grep to filter to only those lines which start with Bob’s email address. Then we use awk to output the first ten characters of the third field, which gives us the date portion of the date and time. Then it’s the same sorting and counting operation as before to see the summaries by day.

This illustrates the power of these sorts of pipelines. We can do this sort of ad hoc, exploratory processing really easily. As a bonus, each step of the process generates output too. This makes it easy to gradually build up the different parts of your processing, checking each one as you go and thinking about what transformation has to happen next. The feedback loop is really tight.

At this point, it’s often not about getting an answer to a question: it’s about figuring out what the question is.


The nice thing about those previous examples was that in no meaningful sense did I create new functionality. I just composed together existing things. I got so much for free; I just needed to slot things together.

But for all that, the final result wasn’t in any way a compromise solution. These are utilities written in C, with decades of optimisation, that the shell helpfully runs for us in parallel. They’re fast, and potentially very fast.

There was an article that did the rounds last year by Adam Drake, a data scientist. It was a response to an article in which someone had used a cluster of 7 AWS instances running Hadoop to process a 2GB file containing the results of chess games. The Hadoop job took about 26 minutes to complete the processing. Adam set about writing the same processing in a shell pipeline, to see how long that would take.


This isn’t about doing something that’s inferior to Big Data but simpler. It’s genuinely powerful in its own right.

The lesson here is to pick the right tool for the job. If we were talking about a petabyte of data, or something that was bound by raw computing power, then the Hadoop job would wipe the floor with the shell; but in most cases we’re not.


The other nice thing about this sort of processing is that as Rubyists we don’t need to throw out any of our Ruby knowledge.

Ruby is an unashamedly Unix-centric language, and inherits much from Perl in its approach to and ability to perform text processing. It comes with so many things that make these sorts of tasks easy, and you get the benefit of the wider Ruby ecosystem too, all of the gems and libraries that exist.


Most people who program Ruby know that you can invoke it from the command-line using the ruby command. You pass it the filename of a script, and it executes that script for you. You probably did this for your first “hello world” script.

But you can also fairly radically alter the interpreter’s behaviour to allow you to write powerful one-liners, using these command line switches among others. These allow you to:

  • Pass code on the command line
  • Loop over lines of standard input automatically
  • Loop over and print standard input, easily creating a filter program
  • Loop over standard input, splitting lines into fields on a given character automatically
  • Require a library or gem

Let’s revisit that log example. Let’s imagine we wanted to count log entries not by email address, as we did before, but by the IP addresses of the hostnames of the email addresses. (It sounds weird, but I’ve had to do literally this before.)

Here’s where we can slot some Ruby in. We use cut to get the first field, same as we did before, then cut again, separating on the @ sign, to get the hostname portion of the email address. So the output here is a list of hostnames, one per log entry. Then we use Ruby, requiring the Resolv library from the standard library and telling Ruby to loop over the lines in standard input. For each line, we output the IP address of that hostname. Then it’s the now-familiar sort-and-count dance, and we see the output — the unique IPs in the log, along with how many log entries they have.


The last example illustrates a general principle. Get as far as you can with the commands your operating system gives you, and then use Ruby for the tough bits. You’ll end up writing much less code, it’ll likely be more performant, and it’ll probably be less buggy too.

But that doesn’t mean you have to use coreutils at all, if you don’t want to. It’s possible, and often desirable, to write one-liners entirely in Ruby.


Here’s one I wrote recently. It takes a CSV file, looks for all the URLs in all of the fields it can find, and replaces them with Bitly-shortened versions of those URLs. It uses CSV from the standard library and the Bitly gem.

It’s not something I wanted to commit to a script; it’s just a one-off. In this case, we were producing tens of thousands of pieces of packaging for one of our clients, each one with a unique URL printed on it. But the URLs, with their long unique tokens, were too long to fit on pack. So we needed to shorten all of them, and doing it by hand wasn’t going to be possible. But equally it’s not something I’m going to need to do regularly, to the point where I’d commit it to a script. So I wrote this one liner.

CSV, like lots of other parts of the Ruby standard library, anticipates this sort of usage and offers a filter method that takes standard input by default and outputs to standard output by default. So all I need to do is specify the transformation of the fields; I don’t need to think about input or output.

The end result isn’t particularly pretty, but it doesn’t have to be: it’s a write-once, run-once thing.


And it’s not just the standard library that facilitates this sort of usage. Other parts of the Ruby ecosystem play nicely with this sort of one-liner too.

In this case, Nokogiri. Not only does it come with a command-line interface, but that interface accepts an -e option just like Ruby itself does, which allows us to pass code to be evaluated. Nokogiri takes care of fetching and parsing the document for us; we just need to do something with that document.

Here we’re extracting all of the source URLs of all of the images in a Wikipedia article. We’re then piping these into wget to download all of those images. Although these two steps are already parallelised for us, because we’re using a pipeline, we’re going a step further and downloading four images at once — something that’s incredibly easy to do in the shell — we just pass -P4 to xargs to tell it to fire up four processes at once.

The shell, coreutils, and Ruby, all working in harmony. Best of all, it took 2 minutes to write, I didn’t have to commit anything to a file, and I didn’t have to do any of my own plumbing. There’s nothing there about downloading images, or making the original HTTP request to the article. I definitely didn’t have to think about parallelising the downloads of the images. I just had to write the bit that’s unique to my problem — extracting the image URLs.


To try and wrap this up: your shell is a programming environment that exposes the full power of Unix to you. It’s not just somewhere you go to type commands, it’s somewhere that allows you to construct elaborate and lightning-quick data processing pipelines with very little work.


So much of Ruby’s design fits seamlessly and joyfully into this world. It’s built for this stuff, you don’t need to fight it to get things done.


And the vast majority of the data that you will likely encounter in the real world suits this type of processing far more than it suits the typical big data toolkit.


So yeah, go out and try this stuff. Write some glue that fits between two systems that wouldn’t otherwise talk to one another. Dig into some data that you’ve got lying around. Go mung stuff and see what you find. And the next time you’re tempted to fire up Hadoop, think about whether you could get things done in less time on your laptop.


The slides will be up on Slideshare, if you want to use this as a reference. I think learning these 15 or so commands will give you the power to solve 90% of the data processing problems you’ll encounter, and throwing Ruby in will let you solve the remaining 10%.


And finally, some shameless self-promotion.

If you want to get into this sort of data wrangling a bit more, I’ve got a book out with Pragmatic that covers a lot of this stuff plus things like screen scraping websites, writing parsers, natural language processing, regular expressions, and other things that come in handy when trying to chew through text.

So if that sounds like something interesting to you, then check it out! It’s definitely aimed at the beginner or intermediate developer more than the expert, but there’s hopefully something in it for everyone.

And… that’s that!