Roblog

home about me text processing with ruby

Real progress in long-running command-line scripts

4 March 2014

Earlier in the week I wrote a post on how, when executing long-running sub-processes in your Ruby scripts, it might be useful to show some kind of progress to your user.

But the progress displayed there was fake: it bore no resemblance to the actual progress of the sub-process. In many respects this is better than nothing, but in others it’s worse; although we’re displaying an indeterminate progress bar, so we’re not actually lying to the user, we’re still not distinguishing adequately between the importing state and the hung state.

Surely there must be a way to improve on this, and show real progress to the user? Turns out, there is. With the application of a little more Unix knowledge, we can do just that.

For this example, I’m going to revisit the use case of importing a database dump into mySQL that I discussed in my first post. For reference, it was equivalent to a system call of:

mysql some_database < dump.sql

Showing real progress

This solution, while perhaps involving more conceptual understanding of Unix fundamentals, is actually simpler than our previous “fake” solution.

Here’s the code in full. I’ll then break it down and discuss it.

require "ruby-progressbar"

total_lines = `wc -l dump.sql`.to_i

File.open("dump.sql", "r") do |dump|
  progress = ProgressBar.create(title: "Importing", total: total_lines)

  IO.popen("mysql some_database", "w") do |stdin|
    dump.each do |line|
      progress.increment
      stdin.puts(line)
    end
  end
end

What’s going on here, then?

total_lines = `wc -l dump.sql`.to_i

First, we use the Unix wc command to get the total number of lines in the file — it’s much quicker than trying to do this calculation ourselves in Ruby.

File.open("dump.sql", "r") do |dump|
  progress = ProgressBar.create(title: "Importing", total: total_lines)

Next, we open the dump.sql file for reading, and create a new progress bar. We set the total value of the progress bar to the number of lines in the dump; this will allow us to increment once per line and have the progress bar finish when we’ve finished reading the file.

IO.popen("mysql some_database", "w") do |stdin|

This is where the magic happens. We use the popen system call to execute our mySQL command as a sub-process. We tell popen that we’re interested in writing to it by passing a mode of w (just like when we open a file for writing).

This is just like opening a file: if we passed a block to File.open, the block would be passed a handle that would allow it to write to the file. We’re doing the same here, except the handle we have will write to the mysql command’s standard input stream, rather than to a file.

dump.each do |line|
  progress.increment
  stdin.puts(line)
end

Now we loop over the lines in the dump. For each one, we increment the progress bar by one, and then pass the line to the mySQL process. stdin here refers to the handle that we opened with popen.

Now, this code might be a bit hard to wrap your head around, especially if you’re not familiar with Unix pipelines. But if you’ve ever used a shell, you’ve used this idiom: it’s what happens under the hood when we pipe between two processes in our shell. In essence, what we’re doing here is:

cat dump.sql | mysql some_database

The only difference is that our Ruby script sits in the middle, orchestrating the pipeline and displaying progress to the user in their shell.

This illustrates neatly one of the foundational principles of the Unix philosophy: everything is a file. Swap out popen for open, and we’re doing exactly the same things that we’d do if we wanted to write the database dump to another file; it’s also the same as we’d do if we wanted to write to a network socket. Across all of these disparate interfaces, Unix presents the same, consistent abstraction: the file.


That’s it! We should now see a proper progress bar on the command-line that increments as the file is imported into mySQL.

Since the file is never read into memory at any point — it’s processed line-by-line — it should scale to any size of input file that you can throw at it.

There aren’t really many downsides to this approach. I’ve never found it to be measurably slower than simply shelling out the command — unsurprising, really, given that the shell is making the same system calls — and even if there was a performance hit, there are many use-cases where a slightly slower import with accurate feedback would be the preferable option.

Ta-da; another way that a little use of Unix processes can help us out in our day-to-day development.