November 01, 2005 - Domain Specific Language with a lifespan of 2 hours -- or basic data munging

Last night I needed to categorize all of the JVM opcodes according to their effect on the stack. Since there are around 200 of them, it seemed like it would be a tedious task. Fortunately, the JVM spec is online in an editable format. I thought, "Maybe I can parse the opcodes out of the spec and then put them into a format that I can use to build my categorization automatically."

I am content with my solution, but I'd like to hear others' approach to solving this type of problem. I am particularly eager to see where my Ruby skills are lacking.

Since the JVM Spec is in html, I initially tried using one of the Ruby html libraries to parse it. It didn't really do what I wanted though, so I wrote my own code to do it. The one-time parse is quick and dirty, but it is one-time so I can live with it being dirty.

First, I saved each opcode index page so that I could process it locally.

I figured out from the format of the opcode pages that each opcode was identified in an <h2> tag, and that each opcode's Operand Stack manipulation description began with "...".

So, here is how I got those lines out of the file:

def parse_file(filename)
  str = nil"/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html") { |file| 
    str = 
  lines = []
  str.each_line {|line| 
    if line =~ /h2/ # opcode declaration line 
      lines << line.gsub(/<hr>/,"").gsub(/<h2>/,"").gsub(/<\/h2>/,"")
    elsif line =~ /^\.\.\./  # operandstack effect line
      lines <<line.gsub(/\.\.\.,/,"").
          gsub(/*lt;img src="chars\/arrwdbrt\.gif">/,"=>").

%w[a b c d f g i j l m n p r s t w].each{|e| parse_file(e).each {|l| puts l }}; nil

The parse function reads a file ('a.html', 'b.html', ...) and scans for the two types of lines I care about - opcode declarations and operand Stack descriptions. Next, it strips all the extraneous html and converts an image into useful data about the opcode's stack manipulation.

The output form is a little mini language. It has the form:

opcode: pop_off [, pop_off] => push_on

For example,

aaload: arrayref, index => value

The bottom line parses each file and dumps it to output. Then I just copied all that into a ruby variable. I could've done that in Ruby, but I won't ever need to repeat it.

Finally, I need to be able to use this information about the opcodes. I have another function that reads the opcodes.

# define all the opcodes I parsed out of the spec pages
opcodes = <<OPCODES
aaload .....

opcode_map =
opcodes.each_line { |l|
  opcode, stack_effect = l.split(":")
  pop_off = nil
  push_on = nil
  if stack_effect.include?("=>")
    pop_off, push_on = stack_effect.split("=>")

  if !push_on.nil? && push_on.include?(",")
    push_on = push_on.split(",")
  if !pop_off.nil? && pop_off.include?(",")
    pop_off = pop_off.split(",")
  opcode_map[opcode] = [pop_off, push_on]

Now, I can ask for the stack manipulations of any opcodes with


This will return a two element array where the first item is the values popped from the stack, and the second element is the values pushed onto the stack.

Not perfect, but it took only an hour or so, and saved me many hours of rote typing.

Note: this code has not been cleaned. It could be made much more concise as so:

def parse_file(filename)
  str = nil"/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html") { |file| 
    str = 
  lines = []
  str.each_line {|line| 
    if is_opcode_decl? line 
      lines << opcode_from line
    elsif is_operandstack? line
      lines << operand_effect_from line

Posted by Bob Evans at November 1, 2005 10:38 AM

Trackback Pings

TrackBack URL for this entry:


def parse_file(filename)
lines = []
IO.foreach( "/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html"
) {|line|

Posted by: Chess Player on November 3, 2005 02:03 PM


%w[a b c d f g i j l m n p r s t w].each

is longwinded. Take a look at


Posted by: Lyndon on November 3, 2005 03:04 PM

Ah, foreach. D'uh. I guess I should have thought of that one. Thanks.

On the [a..z] suggestion, it turns out that the list of filenames is sparse and doesn't follow the alphabet. Unless of course you are suggesting another way to use Ruby's ranges that I don't understand. (Very possible). Thanks for the insights.

Posted by: Bob Evans on November 3, 2005 04:07 PM

Post a comment

Remember Me?