Roblog

home about me text processing with ruby

Ruby’s $_ variable

31 October 2015

Ruby can be invoked from the command line in order to create powerful text-processing one-liners. I wrote about this a while back, in “Ruby’s -e, -n, and -p switches”.

These one-liners are powerful, concise, and expressive. For an example that verges on the magical, how about outputting the third field in a CSV file, but only if the line contains a URL?

$ ruby -F, -ane 'puts $F[2] if /http:/' file.csv

Or outputting the contents of a Markdown file, with curly quotes switched to straight ones?

$ ruby -pe 'gsub(/“|”/, "\"")' foo.markdown

In both of these examples, and indeed in all Ruby one-liners, an oddly named global variable is at work even when we don’t see it used explicitly. Its name is $_ (dollar underscore).

Ruby has many global variables like this; there’s a complete list of them here. But $_ is one of the most useful. Indeed, along with the globals relating to regular expressions, it’s the only one I use with genuine regularity.

There are five key places that $_ is used. In each one, it’s likely that we won’t actually see the variable itself; instead, it’s used by Ruby internally. But knowing that it’s there can help to explain what’s going on; it helps us thread a connection through several different areas of the Ruby language, allowing us to peer behind the curtain and understand Ruby’s magic a little better. Let’s dig in.

1. It’s set to the content of the current line

There are two scenarios in which we loop over lines of input. The first is when, as in the examples we saw earlier, we use the -n or -p switches when invoking Ruby.

When we do this, the Ruby interpreter will loop over the lines of input for us, running the code that we pass to it once for each line of input. In doing so, it sets the value of the $_ variable to the contents of the current line. For example:

$ printf "foo\nbar" | ruby -ne 'puts $_'
foo
bar

The reason this happens, though, is because using the -n and -p switches is essentially like wrapping your code in the following:

while gets
  # your code here
end

It’s actually gets that sets the $_ variable, which means it’s also accessible in regular Ruby scripts too — not only one-liners. Wherever you call gets, $_ will be set to the input that gets received.

2. It’s outputted automatically when using -p

If we use the -p option when starting Ruby, it’s not necessary for us to write a puts or print statement to generate some output; Ruby will do it for us. It still executes our code once per line of input, but after each line it will output something too.

But what does Ruby actually output? You guessed it: the $_ variable.

This means that, if we pass the -p option to Ruby, we can affect the output of our script by manipulating the content of the $_ variable:

$ printf "foo\nbar" | ruby -pe '$_ = "baz\n"'
baz
baz

In that case we reassigned the variable entirely, but we can also mutate it:

$ printf "foo\nbar" | ruby -pe '$_.upcase!'
FOO
BAR

In this case, we transform the line of input from lowercase to uppercase. (We can tell that the method mutates the string, rather than returning a new one, because of the !.)

3. It’s an implicit argument to print

When we invoke Ruby with the -n or -p switches, the behaviour of some of Ruby’s core methods changes slightly. One such change is how print behaves if we don’t pass it an argument.

In an ordinary Ruby script, or a one-liner without -n or -p, calling print without any arguments outputs nothing:

$ ruby -e 'print'

If we invoke Ruby with -n or -p, though, print will output $_ if we call it without arguments:

$ printf "foo\nbar" | ruby -ne 'print'
foo
bar

This makes it really easy to write filters, that only output lines that meet certain conditions. For example:

$ printf "foo\nbar" | ruby -ne 'print if $_.start_with? "f"'
foo

This one-liner outputs only those lines the start with the letter f.

4. It’s the implicit receiver of some global string methods

Another behaviour that changes when Ruby is invoked with either the -n or -p options is that some global methods are defined. They are:

  • sub
  • gsub
  • chop
  • chomp

They’re defined in the Kernel module, the same place as print and puts, which means that we don’t call them with a receiver — they’re global methods.

But how can we call, say, gsub in this way? Normally, the receiver of the gsub method is the string that we want to perform a substitution within. If there’s no receiver, what string will be used instead?

There are no prizes for guessing that the answer is $_. In this way, these global methods allow us to perform operations on each line of input without having to refer to that input explicitly. For example:

$ printf "foo\nbar" | ruby -ne 'puts gsub(/[aeiou]/, "_")'
f__
b_r

In this case, we output each line of input, except with all vowels replaced with underscores.

This behaviour is even more useful when used with -p, since we can skip the output step:

$ printf "foo\nbar" | ruby -pe 'gsub(/[aeiou]/, "_")'
f__
b_r

This works because these global methods actually modify $_ as well as manipulating its content; they’re actually equivalent to the !-suffixed methods on String, and so the above example is equivalent to:

$ printf "foo\nbar" | ruby -pe '$_.gsub!(/[aeiou]/, "_")'
f__
b_r

Particularly if you’re not comfortable with using sed, but even if you are, this is a really powerful way to perform find-and-replace operations from the command line.

These global methods are otherwise identical to their counterparts from the String class; they’re just a useful shortcut for a common operation.

5. It’s the implicit matcher of regular expressions

The final place that $_ is used is as the implicit subject of regular expression matches. It’s this behaviour that I exploited in the very first example in this post, and it’s this behaviour that’s perhaps most obscure (or magical, depending on your viewpoint).

This behaviour is triggered either when we use a regular expression in a conditional context, or by using the ~ operator on a regular expression. For example:

$ printf "foo\nbar" | ruby -ne 'p ~ /^f/'
0
nil
$ printf "foo\nbar" | ruby -ne 'print if /^f/'
foo

In the former case, we see that an integer is returned if the expression matched the current line of input (in this case 0, since the expression matched at the very first character). If the expression didn’t match, the method returns nil.

It’s the latter, conditional form that’s most useful, since it allows us to do something based on whether a line matches a given expression — an incredibly common requirement for filter scripts.

Behind the scenes, this translates to the following:

printf "foo\nbar" | ruby -ne 'print $_ if $_ =~ /^f/'

The implicit example is much more magical, but it’s also much shorter and easier to read — and with one-liners, every character counts.

Summing up

Ruby has lots of cryptic globals, but one that crops up in lots of different places is $_. It’s always connected to the idea of processing input line-by-line, which is a really common requirement. Getting to know it can help you write nicely concise text processing scripts — and concision is particularly helpful when you’re writing one-liners.

Text Processing with Ruby

Enjoyed this and want to find out more about data wrangling and text munging in Ruby? You might be interested in Text Processing with Ruby, a book that covers all that and more. It’s published by Pragmatic Bookshelf and is available now!