Last night I needed to categorize all of the JVM opcodes according to their effect on the stack. Since there are around 200 of them, it seemed like it would be a tedious task. Fortunately, the JVM spec is online in an editable format. I thought, "Maybe I can parse the opcodes out of the spec and then put them into a format that I can use to build my categorization automatically."
I am content with my solution, but I'd like to hear others' approach to solving this type of problem. I am particularly eager to see where my Ruby skills are lacking.
Since the JVM Spec is in html, I initially tried using one of the Ruby html libraries to parse it. It didn't really do what I wanted though, so I wrote my own code to do it. The one-time parse is quick and dirty, but it is one-time so I can live with it being dirty.
First, I saved each opcode index page so that I could process it locally.
I figured out from the format of the opcode pages that each opcode was identified in an <h2> tag, and that each opcode's Operand Stack manipulation description began with "...".
So, here is how I got those lines out of the file:
def parse_file(filename) str = nil File.open("/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html") { |file| str = file.read } lines = [] str.each_line {|line| if line =~ /h2/ # opcode declaration line lines << line.gsub(/<hr>/,"").gsub(/<h2>/,"").gsub(/<\/h2>/,"") elsif line =~ /^\.\.\./ # operandstack effect line lines <<line.gsub(/\.\.\.,/,""). gsub(/\.\.\./,""). gsub(/<\/code><p>/,""). gsub(/<i>/,"").gsub(/<\/i>/,""). gsub(/*lt;img src="chars\/arrwdbrt\.gif">/,"=>"). gsub(/<em>/,"").gsub(/<\/em>/,"") end } lines end %w[a b c d f g i j l m n p r s t w].each{|e| parse_file(e).each {|l| puts l }}; nil
The parse function reads a file ('a.html', 'b.html', ...) and scans for the two types of lines I care about - opcode declarations and operand Stack descriptions. Next, it strips all the extraneous html and converts an image into useful data about the opcode's stack manipulation.
The output form is a little mini language. It has the form:
opcode: pop_off [, pop_off] => push_on
For example,
aaload: arrayref, index => value
The bottom line parses each file and dumps it to output. Then I just copied all that into a ruby variable. I could've done that in Ruby, but I won't ever need to repeat it.
Finally, I need to be able to use this information about the opcodes. I have another function that reads the opcodes.
# define all the opcodes I parsed out of the spec pages opcodes = <<OPCODES aaload ..... OPCODES opcode_map = Hash.new opcodes.each_line { |l| opcode, stack_effect = l.split(":") pop_off = nil push_on = nil if stack_effect.include?("=>") pop_off, push_on = stack_effect.split("=>") end if !push_on.nil? && push_on.include?(",") push_on = push_on.split(",") end if !pop_off.nil? && pop_off.include?(",") pop_off = pop_off.split(",") end opcode_map[opcode] = [pop_off, push_on] }
Now, I can ask for the stack manipulations of any opcodes with
opcodes_map[my_favorite_opcode].
This will return a two element array where the first item is the values popped from the stack, and the second element is the values pushed onto the stack.
Not perfect, but it took only an hour or so, and saved me many hours of rote typing.
Note: this code has not been cleaned. It could be made much more concise as so:
def parse_file(filename) str = nil File.open("/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html") { |file| str = file.read } lines = [] str.each_line {|line| if is_opcode_decl? line lines << opcode_from line elsif is_operandstack? line lines << operand_effect_from line end } lines end
Posted by Bob Evans at November 1, 2005 10:38 AM
TrackBack URL for this entry:
http://www.developertesting.com/mt/mt-tb.cgi/177
def parse_file(filename)
lines = []
IO.foreach( "/Users/bobevans/Documents/papers/jvmopcodes/#{filename}.html"
) {|line|
Posted by: Chess Player on November 3, 2005 02:03 PM
this
%w[a b c d f g i j l m n p r s t w].each
is longwinded. Take a look at
("a".."z").each
Posted by: Lyndon on November 3, 2005 03:04 PM
Ah, foreach. D'uh. I guess I should have thought of that one. Thanks.
On the [a..z] suggestion, it turns out that the list of filenames is sparse and doesn't follow the alphabet. Unless of course you are suggesting another way to use Ruby's ranges that I don't understand. (Very possible). Thanks for the insights.
Posted by: Bob Evans on November 3, 2005 04:07 PM