[SUMMARY] Parsing JSON (#155)

R

Ruby Quiz

We saw a large variety of solutions for this week's problem. Many of them used
a parser generator to construct their parser. You do that by defining a grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described syntax.
Of the generators used, Treetop was definitely the most popular and is surely
worth a look if you want to do some grammar based parsing.

I'm not going to show a grammar based solution here, not because I don't like
them, but because I want to show a few of the ideas behind simple parsing. To
do that, we will need to examine a hand-rolled solution. Just to be clear
though, using grammar based parsers can often be a more robust choice if you can
find an official grammar for the content you need to parse.

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages.
First, it is trivial to adapt so that the entire content to be parsed doesn't
need to be read into memory. It's also some very efficient code.
Unfortunately, that makes it a touch less obvious if you aren't already familiar
with parsing techniques.

Given that, I'm going to show my own hand-rolled recursive descent parser. It's
not as cool as Eric's but it's a little easier to take in. We will say it's a
good introduction to Eric's code, which you should be able to follow after I
break this down.

Here's the beginning of my parser:

#!/usr/bin/env ruby -wKU

require "strscan"

class JSONParser
AST = Struct.new:)value)

def parse(input)
@input = StringScanner.new(input)
parse_value.value
ensure
@input.eos? or error("Unexpected data")
end

private

# ...

One of the first concerns when parsing is the need to manage where you currently
are in the input. If you treat the input as an IO object, you can read input
piece by piece and the IO object itself naturally keeps track of where you are.
For String input though, it's often easier to use Ruby's standard StringScanner
class. It wraps the input and allows you to test regular expression matches
against the head of that input. It will tell you when they match or don't and
when they do, your position automatically advances forward beyond that match.
You can see that I set this up in the code above.

The only public method for my class is parse(). It prepares the StringScanner
as I just described, tries to parse a JSON value out of the input, then makes
sure we consumed all of the input. Note that my use of ensure here isn't very
standard. I'm just using it to run some code at the end of the method without
changing the return value of the method.

The AST definition also merits a bit of discussion. It would have been nice to
just have each method return the Ruby objects for the JSON it parsed. However,
false and nil (null in JSON) are legal JSON parses in some places. If I return
those, I won't be able to tell if my parse succeeded or failed. To get around
that, all parsed JSON values are wrapped in a trivial abstract syntax tree
object. Returning this object is always true and, after I've verified that the
parse worked, it's just one more method call to retrieve the actual value it
parsed.

It's worth mentioning that this parser is based on the not quite correct
definition of JSON I put forth in the quiz tests. Only objects and arrays are
allowed to be top-level JSON values. An easy fix is to replace this line

# ...

parse_value.value

# ...

with:

# ...

if top_level = parse_object || parse_array
top_level.value
else
error("Illegal top-level JSON object")
end

# ...

Now let's look at the main parser:

# ...

def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
ensure
trim_space
end

# ...

This method really sums up the basic strategy of recursive descent parsing. At
each point, try to read out one of the legal values that can occur there. You
can do that just by drilling down into more specialized methods that know how to
read that one thing. If at any time you can't read a legal value, you have an
error.

Let's dig into the specialized parsers a bit more to see how this works:

# ...

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs
@input.scan(/\s*\}/) or error("Unclosed object")
AST.new(object)
else
false
end
end

# ...

This code reads JSON objects. It's pretty linear, so let's digest it in order.
First, we have to have an opening brace or we don't have an object at all. We
can see here that I try a regular expression on the StringScanner to see if that
is indeed what's up next. If it is scan() will return true and @input will
advance past that brace for our future matches. If it's false, we can't read an
object and the whole attempt is a bust.

When we know we're inside an object, we create the Ruby equivalent (a Hash),
fill it with all of the string/value pairs we can read, then make sure we find a
closing brace. Reading the pairs is the trickiest part because we have to match
a string, followed by a colon, and finally a value. Then, if we find a comma,
we know another pair is expected. If not, we've read the whole object. Note
that I verify these assumptions at every step and toss error()s if any of them
fail. For parsing strings and values, we just reuse the parse_string() method
we first saw called in parse_value() and parse_value() itself.

You can see that I'm constantly trimming space around the JSON syntax. I could
have also done that with repeated calls to the trim_space() helper we saw used
in parse_value(), but that fattens up the code a bit and slows things down with
more tests. For these simple cases, I opted for the shortcut.

Having deciphered parse_object(), parse_array() is trivial:

# ...

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values
@input.scan(/\s*\]/) or error("Unclosed array")
AST.new(array)
else
false
end
end

# ...

This is identical to the process we just examined save that it's pulling
individual values in the middle instead of string/value pairs. This simplifies
the code a bit, as you can see. We also throw these objects into a Ruby Array
instead of a Hash.

Now we are ready for parse_string() and it has a couple of helpers:

# ...

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"/) or error("Unclosed string")
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

# ...

Whenever a structure you need to read gets more complicated, you want to break
it down into smaller parsers that read more specialized pieces. Some probably
would have broken down the the string/value pairs from object (into a
parse_object_pair()), but you don't gain much for that and it was just simple
enough that I opted for the easier approach. Here though we need to handle
content and escapes differently and there may be any combination of them in any
order inside the string. Now the gain is worth it.

Content is easy enough to handle, since we can pass it through unaltered. It's
already content in a Ruby String object. Escapes we have to work on a little
more. Some we just unescape, but others need to be converted. I used pack() to
handle Unicode, but you can see that I was lazy and used eval() on the special
string escapes. All of these have the same meaning in Ruby and thanks to the
match I know it's safe to eval() the contents without worrying about embedded
Ruby nastiness.

With those defined, parse_string() is similar to parse_array(). Find the start
of a JSON string, pull content and escapes as long as we can, then find the end
of the string.

The last two parsers are the easiest of all:

# ...

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

# ...

These are just match and eval() as you can plainly see. Again the evals() are
safe because the match ensures we aren't facing any dangerous content.

Some feel that using regular expressions like this isn't true parsing. We
certainly could have chopped the number rule down into a bunch of smaller rules.
However, the number definition is squarely in the domain of what regular
expressions do well and I'm more of a practical kind of guy. I have access to
regular expressions with this setup, the needed expression isn't really all that
complex, and I like easy jobs. Thus I use them.

All that is left are the two helpers I used, though the implementations won't be
any surprise:

# ...

def trim_space
@input.scan(/\s+/)
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

First, trim_space() can just try a match to advance us pass any whitespace. It
may fail, because there wasn't any whitespace to skip, but that doesn't affect
us any. We know that we aren't about to read whitespace after it is called,
either way.

My error() wrapper just raise()s exceptions. It adds the content left to parse
so you can see where you had trouble or replaces the message altogether to warn
you that all of the content was exhausted.

That's all it takes to build a simple JSON parser. Take some time to go look
through the other hand-rolled solutions now and I bet you'll be surprised by how
similar they work. Then you can look into grammars and how they simplify the
process of defining new grammars.

The final Ruby Quiz will take us back into the world of finance...
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

We saw a large variety of solutions for this week's problem. Many of them
used
a parser generator to construct their parser. You do that by defining a
grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described
syntax.
Of the generators used, Treetop was definitely the most popular and is
surely
worth a look if you want to do some grammar based parsing.


But not nearly the fastest... My very old Grammar class already generates
parsers 5-10X faster than treetop and my next release will give another
5-10X plus using ruby2cext will give another 5-10X (generates code that
optimizes well with ruby2cext). Only a pure-C parser would beat it. I've
had this stuff sitting around locally for too long and need to release.

... i'll get off my (biased) soapbox now ...

<snip>

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages.
First, it is trivial to adapt so that the entire content to be parsed
doesn't
need to be read into memory. It's also some very efficient code.


Computationally, I believe this is the simplest (most efficient) type -
LL(1). I parses "L"eft-to-right and derives the "L"eftmost part of a
grammar using (1) token/character. When the token is a character it becomes
very easy to parse a file by (just need a variable to hold the next
character in the file - previously read with IO.getc).

Unfortunately, that makes it a touch less obvious if you aren't already
familiar
with parsing techniques.


Yep. When rolling by hand, Regexp makes it a bit easier to do recursive
descent parsers since you can easily do more than just a character at a
time. If no Regexp has unlimited match length, you'll have an LL(k) parser
where k is that max match length. You could adapt reading from a file by
keeping a string buffer of the next (k) characters in the file. I you have
a Regexp that can have unlimited match length, I think you might call that
an LL(*) parser. You'll have more problems parsing from a file in this
case, unless possibly you can keep all matches contained in a line
(IO.getswould act as a mini-parser looking for newline).

Look here if you want to see more about LL parsers:

http://en.wikipedia.org/wiki/LL_parser

The AST definition also merits a bit of discussion. It would have been
nice to
just have each method return the Ruby objects for the JSON it parsed.
However,
false and nil (null in JSON) are legal JSON parses in some places. If I
return
those, I won't be able to tell if my parse succeeded or failed. To get
around
that, all parsed JSON values are wrapped in a trivial abstract syntax tree
object. Returning this object is always true and, after I've verified
that the
parse worked, it's just one more method call to retrieve the actual value
it
parsed.


The approach I like to take is have the caller pass the AST to be modified.
I just use a sequence (Array or even String) to represent the sequence of
branches at that level. Then each grammar method can put none to an
arbitrary number of branches in the AST. There is much more flexibility.
The grammar method just returns true or false for whether it matches
independent of the AST. Some parsers don't even need an AST (or a sparse
one) since things might be done on the fly as parsing occurs. Passing the
AST separately works in all these cases.
 
J

James Gray

But not nearly the fastest...

That's true, but I think the need for raw speed in parsers is seldom
the top priority. Large XML can often cause speed issues, but outside
of that I don't personally need many super fast parsers. Maybe I just
don't run into those problems a lot though.

James Edward Gray II
 
C

Clifford Heath

James said:
That's true, but I think the need for raw speed in parsers is seldom
the top priority.

I was frankly amazed how much of the discussion was about speed.
Besides, there's an awful lot of useful languages that you can't
do with LL(1). Including Ruby and C...

That said, Treetop is very slow, and we need to improve that.
Fortunately, it's not very hard to.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

I was frankly amazed how much of the discussion was about speed.
Besides, there's an awful lot of useful languages that you can't
do with LL(1). Including Ruby and C...


There are techniques to convert LL(k) and LR(k) grammars to LL(1) and LR(1)
(factoring, left-recursion removal), but they can make the grammar get large
and can make the actions difficult. C shouldn't be an issue. Ruby is
definitely a beast to parse. Starting with the YACC spec, (almost) any
LL/recursive-descent/packrat parser will have a hard time dealing with the
left-recursion. Not to mention all of the lexer states. Starting from
scratch is probably a better choice for many LL/recursive-descent/packrat in
handling LR grammars (with left-recursion). That being said, I have a way
of handling left-recursion directly in my development LL(1/*) parser
generator. I haven't seen any other LL parsers do it.
 
C

Clifford Heath

Eric said:
There are techniques to convert LL(k) and LR(k) grammars to LL(1) and LR(1)
(factoring, left-recursion removal), but they can make the grammar get large

That's an understatement! They must magnify the number of rules
by something like M^(k-1), where M is the number of terminals...
or something like that. It's not reasonable to consider, and still
doesn't do LL(*) like a packrat does.

C labels require 2 levels of lookahead.
 
J

James Gray

I was frankly amazed how much of the discussion was about speed.

Don't be. It's a common pastime for little challenges like this.
It's the easiest metric to quantify, so it makes for fun
comparisons. :)

James Edward Gray II
 
T

ThoML

I was frankly amazed how much of the discussion was about speed.

I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

Another aspect would of course be how long it takes to create such a
parser in comparison to the other solutions. Unfortunately, we don't
have timings for that.
That said, Treetop is very slow, and we need to improve that.

My main concern with treetop isn't so much speed but rather that the
polygot approach. While the idea per se is pretty cool, it seems to
preclude a programmatic generation/extension of a grammar definition.
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc. If I understand it right,
this isn't easily possible with treetop. I scanned the source code a
little bit and it seems treetop requires the parser definition to be
saved on disk. I assume you're now going to tell me I'm wrong? Please
do so. I'd be glad to here it's otherwise.

Regards,
Thomas.
 
C

Clifford Heath

ThoML said:
I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

Yes, it was interesting. I just thought that a few other things
were interesting too ;-)
My main concern with treetop isn't so much speed but rather that the
polygot approach.

Well, I'm the author of the Polyglot gem, and that doesn't preclude
the idea of executing Ruby blocks during the parse. I'd be much in
favor of that actually, but it's done Nathan's way as he created
Treetop. I like the way ANTLR and grammar let you build the parse
tree you want. Or not. But remember, packrat needss parse tree nodes
that contain enough to enable the backtracking.

Since Treetop has the ability to request a custom node (using the
<my_syntax_node_class> annotation) perhaps you can do what you want
by adding code to the constructor in that class? Just remember though,
the node might get discarded during backtracking, so don't do anything
too permanent :).
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc. If I understand it right,
this isn't easily possible with treetop.

I think it would be a sin against the gods of good taste to do that,
especially when the thing that did it might get discarded!
I scanned the source code a
little bit and it seems treetop requires the parser definition to be
saved on disk. I assume you're now going to tell me I'm wrong? Please
do so. I'd be glad to here it's otherwise.

It's otherwise. The compiler uses a Builder pattern, and then either
saves that or eval's it. When Polyglot requires a grammar, it first
passes through the normal (or rubygems+normal) require, which will
load a .rb file if one exists in the path. Otherwise it compiles
the grammar file and eval's it.

I think, without testing it, that multiple require's in the 2nd
situation will compile the grammar multiple times, which would be
a bug... I'll test for that tomorrow... bed now ;-).

Clifford Heath.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

I would have thought that too. I knew my LL(1) hand solution would be near
the top but was surprised that it came out on top (at least on my
machine/benchmark) especially considering the RE+eval (especially yours)
solutions. I think object creation was a big factor. My LL(1) parser
created basically nothing but the AST. The match objects likely slowed down
the RE solutions (StringScanner was probably best off since the match object
is just a string). When you turn off GC, some of the RE solutions beat
mine.

Another aspect would of course be how long it takes to create such a
parser in comparison to the other solutions. Unfortunately, we don't
have timings for that.

I'm guessing that you could crank out an initial RE solution the fastest.
But, those solutions were also the ones that failed as more tests were
added, since they were hacky. The solutions that took no shortcuts and
followed the spec to a T from the beginning may have taken longer initially,
but they had fewer issues.

My main concern with treetop isn't so much speed but rather that the
polygot approach. While the idea per se is pretty cool, it seems to
preclude a programmatic generation/extension of a grammar definition.
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc.


My Grammar classes definitely give you that ability. One of the many powers
of using Ruby as the DSL.

For example, if you deal with a "," separated list terminated by ";" for a
variety of items types, you could do something like this:

comma_list = lambda { |item| item.list1(E(?;), E(?,)) }

then use it later like this

string_list = comma_list[string] # string defined earlier
number_list = comma_list[number] # number defined earlier

One of the powers is creating your own generic grammar constructs on the
fly.
 
B

Benjohn Barnes

Clifford said:
I was frankly amazed how much of the discussion was about speed.
Besides, there's an awful lot of useful languages that you can't
do with LL(1). Including Ruby and C...

That said, Treetop is very slow, and we need to improve that.
Fortunately, it's not very hard to.

I was reading about PEG parsers on the wikipedia page [1], and wondered
along to a site about packrat parsers [2].

One article there [3] seems to imply that while a packrat parser's
memoization gives linear theoretical time, for almost all real grammars
(and inputs, I presume) it's actually not necessary.

They suggest that for the cases where there is a an exponential time
problem, it's sensible to rely on user supplied memoization directives
in the grammar, or to add memoization selectively and automatically,
based on profiling information.

I'm not sure if that's any help, or if you've already come across these
articles. Hopefully they're of some interest though.

Cheers,
Benjohn


[1]http://en.wikipedia.org/wiki/Parsing_expression_grammar
[2]http://pdos.csail.mit.edu/~baford/packrat/
[3]http://www.mercury.csse.unimelb.edu.au/information/papers/packrat.pdf
 
C

Clifford Heath

Benjohn said:
One article there [3] seems to imply that while a packrat parser's
memoization gives linear theoretical time, for almost all real grammars
(and inputs, I presume) it's actually not necessary.

There's a bit of a chicken/egg problem here.

Existing computer languages have been created to be easy to parse,
which means they require minimum lookahead- one or two tokens.

In part, that's because such languages are easier for humans to
parse as well, and in part because parser generators couldn't
produce efficient parsers for them... until now. Now that packrat
is on the scene, we're free to create more natural looking grammars.
They suggest that for the cases where there is a an exponential time
problem, it's sensible to rely on user supplied memoization directives
in the grammar,

That requires a lot more user understanding (of where to memoize),
or a whole lot more analysis. ANTLR takes the 2nd approach, but it's
a difficult task that still causes many surprises. Treetop is slow,
but it's easily accessible to Joe Average without special skills and
without major surprises. I think it fills an important niche.
or to add memoization selectively and automatically,
based on profiling information.

That would be cool. Write an exponential-time parser, run it over
your test cases in learning mode, and then regenerate it to get a
fast parser.

Clifford Heath.
 
B

Benjohn Barnes

Clifford said:
Benjohn said:
One article there [3] seems to imply that while a packrat parser's
memoization gives linear theoretical time, for almost all real
grammars (and inputs, I presume) it's actually not necessary.

There's a bit of a chicken/egg problem here.

Existing computer languages have been created to be easy to parse,
which means they require minimum lookahead- one or two tokens.

In part, that's because such languages are easier for humans to
parse as well, and in part because parser generators couldn't
produce efficient parsers for them... until now. Now that packrat
is on the scene, we're free to create more natural looking grammars.

That's an interesting suggestion - that languages are as they are to a
great extent because of the difficulty of implementing their parsers, so
existing languages don't need memoized parsing. Certainly a compelling
point. Do you see newer parsing opening up language design then?

I bumped in to a few suggestions of plugable language syntax while
reading about PEG, and I definitely like that idea.

Cheers,
B
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top