Nokogiri as a command-line tool
Most Rubyists are familiar with Nokogiri. It’s a combination XML and HTML parsing tool most commonly used for “screen scraping”: that is, fetching a web page, and searching through it to extract information of interest. When a website you’re interested in doesn’t offer an API, it’s often the only way to extract information from it.
Nokogiri offers both XPath and CSS interfaces to the documents you load into it; both are enormously powerful, and allow you to quickly hone in on the areas of the page that you want to retrieve. Its CSS selectors will be familiar to anyone who’s written advanced CSS or used jQuery, and are a quick and clear way of zipping down into the page hierarchy to fetch a specific element or set of elements.
As an example, we might want to fetch the name of the currently featured article on the English language Wikipedia. If we look at the source of the page, the relevant section of markup looks like this:
There are some obvious hooks we can grab onto there: the title is in an
<a> tag, and that
<a> is inside the featured article
helpfully has an id of
Taking this to Nokogiri, we could extract the information we’re after with a simple script:
Run this, and it should output the name and URL of today’s featured article.
So far, so standard; if you’ve used Nokogiri at all, you’ve written something like this. But what not as many people know about Nokogiri is that it can also be used from the command line, without having to write a script at all. This allows you to harness the power of Nokogiri even when writing small, throwaway one-liners in the shell — and regular readers of the blog will know that, if there’s one thing I like, it’s writing shell one-liners in Ruby.
Once you’ve installed the Nokogiri gem, its command-line interface is
available using the
Typically, we want to work with a remote website; that means we need to fetch the content of that remote site. For that, it’s probably easiest just to use curl, and pipe the output into Nokogiri (that’s the Unix way, after all — small tools interacting).
We can then pass some Ruby code to Nokogiri using the
(just like we can with the
ruby executable), and it will execute it.
It helpfully parses the document for us automatically and sets it to
$_, so we can get started right away.
Let’s stick with our Wikipedia example: how would it work on the command line? Well, quite straightforward really:
As you can see, we fetch the page with curl, then pipe the output into
Nokogiri. Operating on the document, stored in
$_, we use
select the same element were were selecting before, and output the title
of the article. Run this command, and you should see the name of the
article printed to your terminal. Neat, eh?
Of course, we don’t have to limit ourselves to piping input from curl.
We could also
cat the contents of a local HTML file into Nokogiri too;
Nokogiri in this case neither knows nor cares where the HTML it’s
parsing comes from.
Piping input is useful, but sometimes we just want to have a play with the document and get a feel for it — to try out some selectors before actually using them in a script for example. For this, we can use the Nokogiri script’s interactive mode, passing it the URL to fetch:
Voila! Nokogiri has dropped us at an IRB prompt, and has fetched and
parsed the document for us; as you can see, it’s stored in the
variable. We can now try out some operations on the document and get
instant feedback of what’s being selected. This is exactly what I did
when writing the first script in this post; I started with a selector
that got me roughly where I wanted to be (
div#mp-tfa i a) and refined
it until it selected the element I wanted (and nothing else).
Whether you want to quickly snatch something out of a document in a shellscript, are constructing an elaborate one-liner, or want to explore a document before actually committing to writing a script that scrapes it, Nokogiri’s command line interface is a really useful tool. Give it a whirl!
Text Processing with Ruby
Enjoyed this and want to find out more about data wrangling and text munging in Ruby? You might be interested in Text Processing with Ruby, a book that covers all that and more. It’s published by Pragmatic Bookshelf and is available now!