Roblog

home about me text processing with ruby

Ruby regular expressions: the /o modifier

30 March 2014

Ruby’s regular expressions, like those of most other languages, allow you to pass so-called “pattern modifiers” when creating regular expressions; these modifiers then change the way the regular expression behaves.

Most people are familiar with things like /i, which makes the regular expression case-insensitive, or perhaps /m, which allows the . character to match across multiple lines.

But less commonly used is the /o modifier. That’s perhaps with good reason; its role is much more of a niche one than those of the everyday /i, /m, and /x. But it’s occasionally useful, and — like most nuggets of programming-language trivia — is useful to have stored away in the back of your mind.

In a nutshell, the /o modifier to causes any interpolation in a regular expression to happen only once; the final regular expression is then cached, and repeated execution won’t incur the penalties of interpolation.

This isn’t normally too useful, but there are two conditions that, when met, make it worth remembering: if you’re interpolating the result of a method call that might have some costly calculation involved; and if you’re matching lots of values in a loop.

If that’s lost you slightly, don’t worry. Let’s look at an example — though admittedly a slightly contrived one — that shows how the o modifier can result in a significant increase in performance under the right conditions.


First, let’s define a method that returns part of a regex. We’ll also include a call to sleep, to simulate the effects of performing some complex calculation, and output a message to show that the method has been called:

def letters
  puts "letters() called"
  sleep 0.5
  "A-Za-z"
end

Let’s imagine we call this method when creating a regular expression — to end up with something like this, which matches a string consisting only of letters:

/\A[#{letters}]+\z/

Finally, let’s match this regular expression in a loop, so that it’s created repeatedly:

words = %w[the quick brown fox jumped over the lazy dog]

words.each do |word|
  puts "Matches!" if word.match(/\A[#{letters}]+\z/)
end

If we were to run this script, we’d see something like the following:

letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!
letters() called
Matches!

We can see from the output that every time the regex literal is created and passed to match, the letters method is called — in this case incurring a half-second penalty each time. As a result, the script is really slow; after all, it takes half a second of execution for each of the words in our array.

The o modifier lets us avoid this. It will perform the interpolation once, and then cache the resulting regular expression; future execution of the same line will use this cached expression, and won’t perform the interpolation again.

To take advantage of this, all we need to do is pass /o when defining our regular expression:

words.each do |word|
  puts "Matches!" if word.match(/\A[#{letters}]+\z/o)
end

If we modify our script to use the o modifier we see the following output, showing that the letters method is only called once:

letters() called
Matches!
Matches!
Matches!
Matches!
Matches!
Matches!
Matches!
Matches!
Matches!

We also see a significant increase in speed. Let’s benchmark the two to show the comparison:

require "benchmark"

def letters
  sleep 0.5
  "A-Za-z"
end

words = %w[the quick brown fox jumped over the lazy dog]

Benchmark.bm do |bm|
  bm.report("without /o:") do
    words.each do |word|
      word.match(/\A[#{letters}]+\z/)
    end
  end

  bm.report("with /o:   ") do
    words.each do |word|
      word.match(/\A[#{letters}]+\z/o)
    end
  end
end

The result is, predictably, not even a contest:

             user       system     total       real
without /o:  0.000000   0.000000   0.000000 (  4.508294)
with /o:     0.000000   0.000000   0.000000 (  0.501238)

4.5 seconds vs. 0.5 seconds in this admittedly entirely contrived example.

As I said, this isn’t something you’re going to find yourself using every day, or perhaps even every year. But when you find yourself in a situation that calls for it, it’s useful to know about. And hey, sometimes knowledge is worthwhile in and of itself, right?

Text Processing with Ruby

Enjoyed this and want to find out more about data wrangling and text munging in Ruby? You might be interested in Text Processing with Ruby, a book that covers all that and more. It’s published by Pragmatic Bookshelf and is available now!