[QUIZ] Parsing JSON (#155)

E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

They are a convenient solution for a wide range of problems.


You are taking my statement out of context. I was referring to
performance. If you are using a bunch of little regex's, you won't see much
performance benefit compared to matching a character at a time with a 1
character lookahead (LL(1) parsing). I think you can write a pure ruby
LL(1) lexer/parser that usually meets or beats a regex version.

But yes, using regex's are are convenient for a wide range of problems.

With respect to extending ruby's regexp, I'd rather be interested in
embedded code snippets that are evaluated when a clause matches. I
didn't know perl (http://perldoc.perl.org/perlre.html) does this until
I read about it in the ragel manual. It seems there are two kinds of
embedded code: (1) evaluate the code and match with null width; (2)
use the return value of the code as regexp. This sound quite
interesting to me. It's currently listed under "A-3. Lacked features
compare with perl 5.8.0" in the ruby RE manual.

Yep. I used this in perl 5+ years ago (before coming to ruby) to write
recursive regex's (i.e. paren matching).

But, this doesn't address the issue I was referring to - applying a Regexp
to something other than a String.
 
J

James Gray

ch/s F E gzip author/gem
------- - - ---- ----------
186042 5 0 685 James Edward Gray II (RE, recursive descent)

Here's a revised version of this parser that passes all of the tests:

#!/usr/bin/env ruby -wKU

require "strscan"

class JSONParser
AST = Struct.new:)value)

def parse(input)
@input = StringScanner.new(input.strip)
parse_value.value
ensure
@input.eos? or error("Unexpected data")
end

private

def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
ensure
trim_space
end

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs
@input.scan(/\s*\}\s*/) or error("Unclosed object")
AST.new(object)
else
false
end
end

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values
@input.scan(/\s*\]\s*/) or error("Unclosed array")
AST.new(array)
else
false
end
end

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"\s*/) or error("Unclosed string")
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

def trim_space
@input.scan(/\s+/)
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

__END__

James Edward Gray II
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Here's a revised version of this parser that passes all of the tests:

#!/usr/bin/env ruby -wKU

Hi James,

I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more. I also got it working in ruby1.9.
It is at the end of this message. I couldn't figure out why yours wasn't
working with 1.9.

Also, I did some more benchmarking. This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest running
datasets. I limited the depth to 10 because only the fastest 2 parsers need
deeper (to not parse <1s) and ruby seems to do strange things with garbage
collection. All of the fastest parsers are shown with results from the
depth=10 dataset.

I ran everything I could easily on 1.9 (didn't load any gems for 1.9).

Here are the results with this benchmarking that I think is more reliable:

1.8 1.9 F E gzip author/gem
------ --- - - ---- ----------
- - 5 0 545 Pawel Radecki (RE, recursive descent)
NL - 6 0 796 James Koppel (RE+StringIO, recursive descent)
NL NL 5 1 683 Justin Ethier (RE lexer, ruby eval, fixes)
NL - 3 2 1074 James Edward Gray II (peggy)
4662 SF 0 0 608 Eric Mahurin (Grammar0, no code-gen)
5264 - 6 2 588 ghostwheel (ghostwheel, fixes)
5764 - 2 0 706 Eric I (Treetop, unicode broken)
8246 - 2 0 631 Steve (Treetop, mismatches in benchmark)
11856 - 1 1 545 Clifford Heath (Treetop)
25841 - 0 0 842 Alexander Stedile (RE, recursive descent)
57217 - 0 0 723 Eric Mahurin (Grammar, v0.5)
152205 NL 2 1 660 Paolo Bonzini (RE, recursive descent)
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
202090 - 0 0 774 James Edward Gray II (RE, recursive descent)
233646 - 0 0 653 Eric Mahurin (Grammar, unreleased)
270508 211266 1 0 - json
299259 - 6 0 - fjson (w/ C ext)
351163 235726 0 0 607 James & Eric M (RE, recursive)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
430251 106285 0 0 837 Eric Mahurin (LL(1), recursive descent)
2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)

NL: non-linear, SF: seg-fault

Here are some things to note:

* on closer inspection, some of the slower solutions are nonlinear (usually
quadratic runtime). ch/s is not meaningful since it just gets worse with
larger data sets.

* ruby1.9 killed my LL(1) parser performance. The reason for this is that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum's. 4X slower is not good. I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this. Anything that
worked with characters in 1.8 is going to see the same problem. My Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way. It works so well in 1.8 because there is so little object creation
(using Regexp's creates more objects).

* Dominick Baton's ruby2cext is giving my dev Grammar class a 10X
performance boost. Woohoo! Only 4X away from the best hand-crafted C (not
bad considering the ruby and then the C were autogenerated). I generate
low-level stuff that optimizes well. High-level Regexp's will have little
benefit from ruby2cext.

Here is the benchmark code I used (along with the previously posted
RandomJSON class):

parser = JSONParser.new
generator = RandomJSON.new((ARGV[1]||0).to_i)
bandwidth = 0
bandwidth0 = 0
t0 = 0
l = nil
t = nil
max_depth = 10
(max_depth+1).times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
s.gsub!(/=>/, ':')
s.gsub!(/nil/, 'null')
tree2 = nil
l = s.length
t = nil
4.times {
Benchmark.bm { |b|
GC.start
t1 = b.report("#{depth} #{l} ") { tree2 = parser.parse(s) }
GC.start
raise("#{tree2.inspect}!=#{tree.inspect}") if tree2!=tree
GC.start
if (!t or t1.real<t.real)
t = t1
end
}
}
bandwidth = l/t.real
puts "#{bandwidth.to_i} chars/second"
break if (t.real>=(ARGV[0]||1).to_f or depth>=max_depth)
if (t.real>t0)
bandwidth0 = bandwidth
t0 = t.real
end
}
bandwidth = bandwidth0 if (bandwidth0>bandwidth)

puts "\n#{bandwidth.to_i} chars/second"


Here is another optimization of James' solution:


require "strscan"

class JSONParser

def parse(input)
@input = StringScanner.new(input)
@input.scan(/\s*/)
parse_value(out=[])
@input.eos? or error("Unexpected data")
out[0]
end

private

def parse_value(out)
if @input.scan(/\{\s*/)
object = {}
kv = []
until @input.scan(/\}\s*/)
object.empty? or @input.scan(/,\s*/) or error("Expected ,")
kv.clear
@input.scan(/"/) or error("Expected string")
parse_string(kv)
@input.scan(/:\s*/) or error("Expecting object separator")
parse_value(kv)
object[kv[0]] = kv[1]
end
out << object
elsif @input.scan(/\[\s*/)
array = []
until @input.scan(/\]\s*/)
array.empty? or @input.scan(/,\s*/) or error("Expected ,")
parse_value(array)
end
out << array
elsif @input.scan(/"/)
parse_string(out)
elsif @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\s*/)
out << eval(@input.matched)
elsif @input.scan(/true\s*/)
out << true
elsif @input.scan(/false\s*/)
out << false
elsif @input.scan(/null\s*/)
out << nil
else
error("Illegal JSON value")
end
end

def parse_string(out)
string = ""
while true
if @input.scan(/[^\\"]+/)
string.concat(@input.matched)
elsif @input.scan(%r{\\["\\/]})
string << @input.matched[-1]
elsif @input.scan(/\\[bfnrt]/)
string << eval(%Q{"#{@input.matched}"})
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
string << @input.matched[2..-1].to_i(16)
else
break
end
end
@input.scan(/"\s*/) or error("Unclosed string")
out << string
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end
 
T

tho_mica_l

Hi,
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)

May I ask which test are missing here, since it passes all my test.
2079190 - 0 0 653 Eric Mahurin (Grammar, unreleased, ruby2cext)

This made me quite curious. Unfortunately it seems ruby2cext doesn't
support ruby19 yet. Anyway, will this coming-up version of grammar be
a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
grammar 0.5 now?

BTW I took Paolo's hacked version of my humble submission and
converted it for use with StringScanner which has better performance
it seems. Since Paolo uses numbered groups, StringScanner is fine. I
slightly modified the regexp to catch something like '"a" "b"'.

Regards,
Thomas.


require 'strscan'

class JSONParser

def initialize
@rxe = /
\[|\]|
\{|\}|
:))|
(,[ \t\r\n]*[}\]])|
,|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*"(?![ \t\r
\n]"))|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?!\d)|
true|
false|
(null)|
(?>[ \t\r\n]+)|
((?>.+))
/xmu
end

def parse(json)
scanner = StringScanner.new(json)
out = []
until (scanner.skip(/[[:space:][:cntrl:]]*/); scanner.eos?)
scan = scanner.scan(@rxe)
if scanner[5] or scanner[2]
invalid(scanner[2] || scanner[5])
elsif scanner[1]
out << '=>'
elsif scanner[3]
out << scanner[3].gsub(/#/, '\\\\#')
elsif scanner[4]
out << 'nil'
else
out << scan
end
end
begin
return eval(out.join(' '))
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Hi,


May I ask which test are missing here, since it passes all my test.

I'm running all tests from this thread. Here are the ones that the above
solutions failed:

assert_raise(RuntimeError) { @parser.parse(%Q{"a" "b"}) }
assert_equal( "a", @parser.parse(%Q{"\\u#{"%04X" % 97}"}) ) # 97 was ?a,
1.9 compat.
assert_equal('#{p 123}', @parser.parse(%q{"\\u0023{p 123}"}))

Maybe the versions of these I have are old.

This made me quite curious. Unfortunately it seems ruby2cext doesn't
support ruby19 yet. Anyway, will this coming-up version of grammar be
a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
grammar 0.5 now?


I'll have a compatibility layer that will use the new stuff. You'll still
get all of the performance benefits, but this will ease any transition.
Most of the API changes will be minor. It will be a cleaner API.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top