Nokogiri as a command-line tool

6 January 2014 Filed under:

Nokogiri as a command-line tool

Most Rubyists are familiar with Nokogiri. It’s a combination XML and HTML parsing tool most commonly used for “screen scraping”: that is, fetching a web page, and searching through it to extract information of interest. When a website you’re interested in doesn’t offer an API, it’s often the only way to extract information from it.

Nokogiri offers both XPath and CSS interfaces to the documents you load into it; both are enormously powerful, and allow you to quickly hone in on the areas of the page that you want to retrieve. Its CSS selectors will be familiar to anyone who’s written advanced CSS or used jQuery, and are a quick and clear way of zipping down into the page hierarchy to fetch a specific element or set of elements.

As an example, we might want to fetch the name of the currently featured article on the English language Wikipedia. If we look at the source of the page, the relevant section of markup looks like this:

<div id="mp-tfa" style="padding:2px 5px">
  <div style="float: left; margin: 0.5em 0.9em 0.4em 0;"><!-- article image --></div>
  <p>
    <i>
      <b>
        <a href="/wiki/Weather_Machine" title="Weather Machine">Weather Machine</a>
      </b>
    </i>
    <!-- rest of description snipped -->

There are some obvious hooks we can grab onto there: the title is in an <a> tag, and that <a> is inside the featured article <div> which helpfully has an id of mp-tfa.

Taking this to Nokogiri, we could extract the information we’re after with a simple script:

require "nokogiri"
require "open-uri"

wiki_url = "https://en.wikipedia.org/wiki/Main_Page"

# Fetch the page and load its HTML into a Nokogiri document
doc = Nokogiri::HTML(open(wiki_url))

# Select the <a> we're after using a CSS selector, and
# then extract its title and URL
article = doc.at_css("div#mp-tfa i:first-child a")
title   = article.text
url     = URI.join(wiki_url, article.attribute('href'))

puts "Today's featured article is: #{title} <#{url}>"

Run this, and it should output the name and URL of today’s featured article.

So far, so standard; if you’ve used Nokogiri at all, you’ve written something like this. But what not as many people know about Nokogiri is that it can also be used from the command line, without having to write a script at all. This allows you to harness the power of Nokogiri even when writing small, throwaway one-liners in the shell — and regular readers of the blog will know that, if there’s one thing I like, it’s writing shell one-liners in Ruby.

Piping input

Once you’ve installed the Nokogiri gem, its command-line interface is available using the nokogiri command.

Typically, we want to work with a remote website; that means we need to fetch the content of that remote site. For that, it’s probably easiest just to use curl, and pipe the output into Nokogiri (that’s the Unix way, after all — small tools interacting).

We can then pass some Ruby code to Nokogiri using the -e parameter (just like we can with the ruby executable), and it will execute it. It helpfully parses the document for us automatically and sets it to $_, so we can get started right away.

Let’s stick with our Wikipedia example: how would it work on the command line? Well, quite straightforward really:

curl -s https://en.wikipedia.org/wiki/Main_Page | nokogiri -e 'puts $_.at_css("div#mp-tfa i:first-child a").text'

As you can see, we fetch the page with curl, then pipe the output into Nokogiri. Operating on the document, stored in $_, we use css to select the same element were were selecting before, and output the title of the article. Run this command, and you should see the name of the article printed to your terminal. Neat, eh?

Of course, we don’t have to limit ourselves to piping input from curl. We could also cat the contents of a local HTML file into Nokogiri too; Nokogiri in this case neither knows nor cares where the HTML it’s parsing comes from.

Interactive mode

Piping input is useful, but sometimes we just want to have a play with the document and get a feel for it — to try out some selectors before actually using them in a script for example. For this, we can use the Nokogiri script’s interactive mode, passing it the URL to fetch:

$ nokogiri https://en.wikipedia.org/wiki/Main_Page
Your document is stored in @doc...
2.1.0 :001 >

Voila! Nokogiri has dropped us at an IRB prompt, and has fetched and parsed the document for us; as you can see, it’s stored in the @doc variable. We can now try out some operations on the document and get instant feedback of what’s being selected. This is exactly what I did when writing the first script in this post; I started with a selector that got me roughly where I wanted to be (div#mp-tfa i a) and refined it until it selected the element I wanted (and nothing else).

Whether you want to quickly snatch something out of a document in a shellscript, are constructing an elaborate one-liner, or want to explore a document before actually committing to writing a script that scrapes it, Nokogiri’s command line interface is a really useful tool. Give it a whirl!

Text Processing with Ruby

Enjoyed this and want to find out more about data wrangling and text munging in Ruby? You might be interested in Text Processing with Ruby, a book that covers all that and more. It’s published by Pragmatic Bookshelf and is available now!

Roblog