[QUIZ] Parsing JSON (#155)

C

Clifford Heath

Joel said:
Clifford said:
Your example parsers are much harder to read than Treetop however,
space = Grammar::Element[Set[?\ ,?\t].duck!:)==,:include?)].discard
Surely you can sugar that up a bit, something like
space = Grammar.charset(" \t").discard

Well, maybe. I just grabbed the first incomprehensible line from
Eric's Tcl example grammar, in order to point out a difference in
our meanings for "clean and pure" :).

Clifford Heath.
 
T

tho_mica_l

What do you mean it mismatches? Also what do you mean by self-checking?
It passes all the unit tests for me.

I think he meant that it doesn't throw exceptions on error. This is
the
wrapper I used to make it conform with the provided tests:

class JSONParser
def initialize
@parser = JsonParser.new
end

def parse(text)
rv = @parser.parse(text)
if rv
return rv.value
else
raise RuntimeError
end
end
end

Regards,
Thomas.
 
T

tho_mica_l

rule string
'"' char* '"' { def obj
eval(
# This should be handled in the "char" rule.
text_value.gsub(/\\u..../) { |unicode|
eval("0x"+unicode[2..-1]).chr
}
)
end
}
end

I was slightly surprised to find this in a full-blown parser. I think
this shortcut is a really bad choice here since it makes the parse
execute code on input like ["#{`ls -r /`}"]. This should definitely be
handled in the char rule. Justin Ethier's solution suffers from the
same problem but he included a comment on this problem in his code.

Here are some random test cases to check this:

assert_raise(RuntimeError) { @parser.parse(%{p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[], p "Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{[p "Busted"]}) }
assert_raise(RuntimeError) { @parser.parse(%{{1 =>
STDOUT.puts("Busted")}}) }
assert_raise(RuntimeError) { @parser.parse(%{"\u0022; p 123;
\u0022Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{"" p 123; ""}) }
assert_equal("\u0022; p 123; \u0022Busted",
@parser.parse(%{"\\u0022; p 123; \\u0022Busted"}))
assert_equal('#{p 123}', @parser.parse(%q{"#{p 123}"}))
assert_equal('["#{`ls -r`}"]', @parser.parse(%q{["#{`ls -r`}"]}))
assert_equal('#{p 123}', @parser.parse(%q{"\\u0023{p 123}"}))
assert_equal('#{p 123}', @parser.parse(%q{"\u0023{p 123}"}))


Regards,
Thomas.
 
T

tho_mica_l

assert_equal('["#{`ls -r`}"]', @parser.parse(%q{["#{`ls -r`}"]}))

People should really test their test case. At least I should test mine
before sending them to the list. This line should be:

assert_equal(['#{`ls -r`}'], @parser.parse(%q{["#{`ls -r`}"]}))
 
P

Paolo Bonzini

Here is my try using regexes. I use the "copy-on-write trick" from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don't want to discard I use (?
=...).

Using Eric's benchmark I get 36kb/sec, but I haven't benchmarked any
other solution.

http://pastie.caboo.se/147201

Paolo
 
C

Clifford Heath

tho_mica_l said:
I think he meant that it doesn't throw exceptions on error. This is
the
wrapper I used to make it conform with the provided tests:

class JSONParser
def initialize
@parser = JsonParser.new
end

def parse(text)
rv = @parser.parse(text)
if rv
return rv.value
else
raise RuntimeError
end
end
end

If you check my version, you'll see that you can raise a meaningful
RuntimeError by raising @parser.failure_reason - this produces an
error message that says how far the parse progressed, and which tokens
might have allowed it to progress further.

Clifford Heath.
 
C

Clifford Heath

tho_mica_l said:
rule string
'"' char* '"' { def obj
eval(
# This should be handled in the "char" rule.
text_value.gsub(/\\u..../) { |unicode|
eval("0x"+unicode[2..-1]).chr
}
)
end
}
end

I was slightly surprised to find this in a full-blown parser. I think
this shortcut is a really bad choice here since it makes the parse
execute code on input like ["#{`ls -r /`}"].

At least I *admitted* to cheating :).
 
P

Paolo Bonzini

Here is my try using regexes. I use the "copy-on-write trick" from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don't want to discard I use (?
=...).

Using Eric's benchmark I get 36kb/sec, but I haven't benchmarked any
other solution.

http://pastie.caboo.se/147201

For what it's worth, I get 144kb/sec from tho_mica_l's solution, after
converting it to Ruby 1.8 like this:

class JSONParser

RXE = /
\[|\]|
\{|\}|
:))|
(,\s*[}\]])|
,|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
true|
false|
(null)|
(?>[[:space:][:cntrl:]]+)|
((?>.+))
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
if !$5.nil?||!$2.nil? then invalid($5.nil? ? $2 :
$5)
elsif !$4.nil? then 'nil'
elsif !$1.nil? then '=>'
elsif !$3.nil? then $3.gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end
 
T

tho_mica_l

Just an FYI solution, JSON is a subset of YAML. So
data = YAML.load(json)

Interesting suggestion, but ruby seems to disagree:

irb(main):006:0> YAML.load_file 'test.json'
ArgumentError: syntax error on line 0, col 99: `{"a":2,"b":
3.141,"TIME":"2007-03-14T11:52:40","c":"c","d":[1,"b",3.14],"COUNT":
666,"e":{"foo":"bar"},"foo":"B\u00e4r","g":"\u677e\u672c\u884c
\u5f18","h":1000.0,"bar":"\u00a9 \u2260 \u20ac!","i":
0.001,"j":"\ud840\udc01"}'
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Here is my try using regexes. I use the "copy-on-write trick" from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don't want to discard I use (?
=...).

Using Eric's benchmark I get 36kb/sec, but I haven't benchmarked any
other solution.

http://pastie.caboo.se/147201

For what it's worth, I get 144kb/sec from tho_mica_l's solution, after
converting it to Ruby 1.8 like this:

class JSONParser

RXE = /
\[|\]|
\{|\}|
:))|
(,\s*[}\]])|
,|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
true|
false|
(null)|
(?>[[:space:][:cntrl:]]+)|
((?>.+))
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
if !$5.nil?||!$2.nil? then invalid($5.nil? ? $2 :
$5)
elsif !$4.nil? then 'nil'
elsif !$1.nil? then '=>'
elsif !$3.nil? then $3.gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
raise RuntimeError, 'Invalid JSON: %s' % string
end

end


Thank you. Now we can compare apples to apples with this on 1.8.6. I are
the results with these two new benchmarks on my machine:

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)
3214 Justin Ethier (RE lexer + ruby eval, fixed number parsing)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")
54586 Eric Mahurin (Grammar, no lexer, v0.5)
137989 Paolo Bonzini (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
333368 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)
553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)


Looks like the above solution is fastest pure-ruby solution out there. It's
interesting that the performance is better in 1.8.6 than 1.9. The "eval"
solution is a nice trick as it uses ruby's C parser to act as the parser.
But, this isn't viable for most other languages you might parse. Maybe fine
for this JSON case though. The "eval" does seem a bit dangerous though.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Eric said:
I'm biased,

First up, it sounds like your package is a significant achievement,
and I don't want anything here to imply I'm not commending you on
it. But different approaches suit different goals...

Thanks!

but I don't think Treetop is super-clean and pure. Take a look
at my rubyforge grammar package if you want something really clean and
pure:

:) To me, to use a separate grammar is *more pure* than to use
a Ruby DSL. I can see how you would argue the opposite however...
Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!:)==,:include?)].discard
with
skip space
[ \t]
end

I agree. It is ugly. The ugliness can be easily fixed though:
1. when making a parser, derive from Grammar so that namespace is available
2. E = Element, so that you can have a shorthand for the most common atomic
Grammar
3. Performance-wise, I found that using a Set for matching was slower than
just having a few (ranged) alternatives.
4. Grammar::Element.[] is simply an alias for Grammar::Element.new. My plan
is that in the next release, I'll make all the classes "callable" by
defining methods of their name that call new. This is better when .new can
take a block (like Recurse).
5. I made a bad choice of #== for the matching method for the pattern
argument. #=== would be better and is what I'm using in my dev code (and
the Grammar0 I posted). This would allow it to work out of the box with
Ranges too. So, you could do this: alpha = E(?a..?z) | E(?A..?Z)

So, doing 1..3, in the released grammar v0.5, you'd have this instead:

space = (E[?\s] | E[?\t]).discard

(ok, so TT doesn't have a skip keyword yet, soon...)
That's what I mean by clean and pure anyhow. Not pure Ruby,
but pure grammar.

Nathan's concept is that "grammar" should become a Ruby keyword,
and should introduce the grammar meta-grammar, rather than using
existing Ruby syntax. I think that polyglot approach is much better
than the DSL approach, since the Ruby grammar isn't well-suited to
many important languages.


I don't think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

The point is, if/when Ruby uses a parser whose grammar is composable
with a DSL grammar you define, the result is truly seamless and
the syntax is unrestrained. Existing DSL syntax imposes too many
restraints. A language with a composable grammar would be truly
excellent!


I think I've argued why that is a bad thing :), though it definitely
has good things about it also - like your filters etc...

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.


Don't know anything about CQL to answer. I would like to make what I have
general enough.
- integration of lexer/parser with other code is seamless since
everything

Nice, but it won't help me when I need to target C# or Java.
Most important grammars need to be compilable to more than
one target language. Treetop's not there yet, but it's easy
to see how to do it, whereas it won't ever happen with yours
AFAICS.


It definitely could happen with mine. I do ruby code generation now, but
could have another Grammar-like class (or another mechanism) to generate
code for another language. I would have to add a new method for dumping out
the generated code (right now it is only eval'ed immediately). In my dev
code, I'm also using ruby2cext to compile it on the fly. I'll try to have a
general language infrastructure in the next release.
- (sub)grammars are objects and you combine them using operators and

Can't get much easier than just "include otherGrammar", which
works in Treetop (mix the generated modules together).

lines

Similarly to Treetop, AFAICS.


Does this memoize, or give exponential performance? Or is it optional?


In general, the parsers it generates are LL(1). It makes decisions one
character/token at a time. But, if you use the #lookahead method on any
grammar, it will give a new grammar that handles the case when the grammar
fails in the middle (backtracks to the position where the grammar started).
For most languages I've seen this is rarely needed, but it is in your back
pocket to be used when needed. I haven't done anything to try to optimize
it as it isn't needed that often.
* can make any kind of pipeline of lexers/parsers and multi-thread them.

Very cool stuff.


Yes, this is excellent, and something I've always wanted from a parser
generator. You should add your json parsers to the examples in the gem!


Yes, next release.
 
P

Paolo Bonzini

The "eval" does seem a bit dangerous though.

Especially if you take it a bit to the edge like this (plug into the
other solution):

RXE = /
[\[\]\{\}[:space:][:cntrl:]truefals]+|
:))|
(?:,(?>\s*)(?![}\]]))|
("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
-?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
(null)|
(.+)
/xmu

def parse(json)
ruby = json.gsub(RXE) do |t|
if !$4.nil? then invalid($4)
elsif !$3.nil? then 'nil'
elsif !$1.nil? then '=>'
elsif !$2.nil? then $2.gsub(/#/, '\\\\#')
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

This works because the "null" token does not *start* with any letter
in "true" or "false". "fnull" would be happily converted to "fnil",
but eval catches that luckily.

As ugly as it is, it cranks in 200kb/sec on my machine, which should
be more than 400kb/sec on yours. 25-30% of C speed is quite
impressive (I had to say this since I was bitching about Ruby's speed
last week...).

Paolo
 
J

James Gray

This doc is missing two things: 1) exactly what is allowed for the
top-level json,

I guess it does hint at this when it says:

"JSON is built on two structures:"

and goes on to describe object and array. I just didn't interpret it
well.

Thanks for clarifying.
and 2) where can whitespace appear.

Good point.

Good resource. Thanks for making me aware of it.

James Edward Gray II
 
T

tho_mica_l

This works because the "null" token does not *start* with any letter
in "true" or "false". "fnull" would be happily converted to "fnil",
but eval catches that luckily.

I'm not sure I understand your argument. f will always be considered
illegal. It will never reach the eval(). Only text matching a clause
not
defined illegal in the rx get fed to eval(). You could also wrap the
atoms between \b (eg \bnull\b) but it seemed unnecessary. Show me an
example for malicious code that passes through.
 
T

tho_mica_l

This works because the "null" token does not *start* with any letter
Well, I missed the changes in the regexp and thus missed your point.

The differences between ruby18 and ruby19 are quite interesting
though. Can somebody with deeper knowledge of ruby19 affirm that the
"copy on write" strategy is still in use?
 
C

Clifford Heath

Eric said:
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!:)==,:include?)].discard
So, doing 1..3, in the released grammar v0.5, you'd have this instead:
space = (E[?\s] | E[?\t]).discard

Definitely an improvement!
I don't think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

Part of the point of using a packrat-style parser is that there are
no true keywords, meaning words that must only be used in their KW
places. Many of Ruby's words are like that, of course. But my point
was that once you say "grammar", you're now talking a different
language, which can use as much and *only* as much of the Ruby
grammar as it needs. And when inside a rule you say {, you're back
in Ruby grammar... etc.

The normal lexer/parser split makes that fairly hard to do, as the
lexer needs to know whether to return the KW meaning or a general
identifier.
Don't know anything about CQL to answer. I would like to make what I have
general enough.

Take a look at the ORM diagrams under the images directory, and
then at the CQL version of the same models, under
<http://activefacts.rubyforge.org/svn/examples/>. You'll see a lot
of what looks like unparsable plain English... but it's not. In
fact, the language is much harder to parse than any of those examples,
I don't have any published examples of queries (these are just
declarations). The CQL parser is in
<http://activefacts.rubyforge.org/svn/lib/activefacts/cql/> - you
might be able to see where it's hard. You can be many tokens into
recognising a fact_clause before realising you're looking at a
"condition".

It's possible that with your lookahead it would be possible though.
It definitely could happen with mine. I do ruby code generation now, but
could have another Grammar-like class

Cool! I didn't realize you were generating code from this.
But, if you use the #lookahead method on any
grammar, it will give a new grammar that handles the case when the grammar
fails in the middle (backtracks to the position where the grammar started).

You mean it backtracks to the start of the file? Or to where you called
lookahead?

Clifford Heath.
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

Eric said:
compare:
space = Grammar::Element[Set[?\
,?\t].duck!:)==,:include?)].discard
So, doing 1..3, in the released grammar v0.5, you'd have this instead:
space = (E[?\s] | E[?\t]).discard

Definitely an improvement!
I don't think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

Part of the point of using a packrat-style parser is that there are
no true keywords, meaning words that must only be used in their KW
places. Many of Ruby's words are like that, of course. But my point
was that once you say "grammar", you're now talking a different
language, which can use as much and *only* as much of the Ruby
grammar as it needs. And when inside a rule you say {, you're back
in Ruby grammar... etc.

I definitely need to go learn about packrat/PEG stuff. Sounds interesting
after looking at wikipedia. Still don't really understand LR/LALR parsers.
My main focus has been LL/recursive-descent and PEG sounds to be
recursive-descent.

The normal lexer/parser split makes that fairly hard to do, as the
lexer needs to know whether to return the KW meaning or a general
identifier.


Fortunately in my grammar package the lexer and parser can be specified in
exactly the same way. Also, you can use anything for tokens. If you are
using pattern to match to them, it will just use the expression
(pattern===next_token) to see if it is a match. So, in the lexer, you might
generate a String for any identifier/keyword. The lexer wouldn't care
whether it is a keyword or not. In the parser, you then could use this to
match an arbitrary identifier:

E(String) # (String===next_token) is true if next_token is a String

or if you wanted to match a keyword, you'd have this type of Grammar:

E("begin") # ("begin"===next_token) is true if next_token=="begin"

You could also have some arbitrary matching function by defining your own
pattern class (with a #== (v0.5) or #=== (dev) method).
Take a look at the ORM diagrams under the images directory, and
then at the CQL version of the same models, under
<http://activefacts.rubyforge.org/svn/examples/>. You'll see a lot
of what looks like unparsable plain English... but it's not. In
fact, the language is much harder to parse than any of those examples,
I don't have any published examples of queries (these are just
declarations). The CQL parser is in
<http://activefacts.rubyforge.org/svn/lib/activefacts/cql/> - you
might be able to see where it's hard. You can be many tokens into
recognising a fact_clause before realising you're looking at a
"condition".

It's possible that with your lookahead it would be possible though.


At one time I made a mini English parser, so this seems quite doable.
Cool! I didn't realize you were generating code from this.

started).

You mean it backtracks to the start of the file? Or to where you called
lookahead?


Just back to where that lookahead Grammar started parsing. For other
Grammar's, there is only a one char/token lookahead. If it matches the
first char/token, it is committed to that and will give an exception if a
mismatch occurs later. When you apply lookahead to a Grammar, it starts
buffering the input and will rescue these exceptions. It simply rewinds the
input to the point where this lookahead Grammar started parsing when an
exception is found. That's all there is to it.

When a lexer is used, I don't think this backtracking is needed that often.
I guess that is why this memoization is so important with packrat/PEG
parsers - no lexer.

I have to think about whether memoization would apply or not. It may work
since my stuff carries around very little state (mainly just with regards to
the input and output streams).
 
E

Eric Mahurin

[Note: parts of this message were removed to make it a legal post.]

In honor of that, this week's Ruby Quiz is to write a parser for JSON.


Here is another solution of mine:

http://pastie.caboo.se/147505

In this one, I just made a fast hand-built recursive-descent/LL(1) parser.
This is the kind of parser that I'm trying to get my 'grammar' package to
approach (using lots of optimizations). It uses no Regexp or ruby eval
(both of which have compiled C to help speed). And yet, it is the fastest
pure-ruby JSON parser we've seen (see the recursive descent line below):

ch/s author/gem
---- ----------
- Pawel Radecki (RE, mismatch)
3214 Justin Ethier (RE lexer + ruby eval, fixed number parsing)
4054 Eric Mahurin (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford Heath (Treetop, had to remove handling of "\/")
17320 Alexander Stedile (RE)
54586 Eric Mahurin (Grammar, no lexer, v0.5)
137989 Paolo Bonzini (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)
220289 json
223486 Eric Mahurin (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
333368 Thomas Link & Paolo Bonzini (RE + eval, unicode broken)
388670 Eric Mahurin (hand-built recursive descent)
553081 Eric Mahurin (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)


#
# JSON hand-built recursive descent/LL(1) parser, by Eric Mahurin
#
require 'stringio'

class JSONParser

def parse(s)
@next = (@io=StringIO.new(s)).getc
ws
value(out=[])
ws
raise("EOF expected") if @next
raise(out.inspect) unless out.length==1
out[0]
end

def error(expected, found)
raise("expected #{expected}, found #{found ? ("'"<<found<<?\') :
'EOF'}")
end

def value(out)
if ?\[.equal?(@next)
# array
@[email protected]
ws
a = []
unless ?\].equal?(@next)
value(a)
ws
until ?\].equal?(@next)
?\,.equal?(@next) ? (@[email protected]) : error("','",
@next)
ws
value(a)
ws
end
end
@next = @io.getc
out << a
elsif ?\{.equal?(@next)
# object
@[email protected]
ws
h = {}
unless ?\}.equal?(@next)
?\".equal?(@next) ? string(kv=[]) : error("a string", @next)
ws
?\:.equal?(@next) ? (@[email protected]) : error("':'", @next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
until ?\}.equal?(@next)
?,.equal?(@next) ? (@[email protected]) : error("','",
@next)
ws
?\".equal?(@next) ? string(kv.clear) : error("a string",
@next)
ws
?\:.equal?(@next) ? (@[email protected]) : error("':'",
@next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
end
end
@next = @io.getc
out << h
elsif (?a..?z)===(@next)
# boolean
(s="")<<@next
@next = @io.getc
while (?a..?z)===(@next)
s<<@next;@[email protected]
end
out << case s
when "true" then true
when "false" then false
when "null" then nil
else error("'true' or 'false' or 'null'", s)
end
elsif ?\".equal?(@next)
string(out)
else
# number
n = ""
(n<<@next;@[email protected]) if ?-.equal?(@next)
?0.equal?(@next) ? (n<<@next;@[email protected]) : digits(n)
(?..equal?(@next) ?
(n<<@next;@[email protected];digits(n);exp(n);true) :
exp(n)) ?
(out << n.to_f) :
(out << n.to_i)
end
end

# Flattening any of the methods below will improve performance further

def ws
@next = @io.getc while (case @next;when ?\s,?\t,?\n,?\r;true;end)
end

def digits(out)
(?0..?9)===@next ? (out<<@next;@[email protected]) : error("a digit",
@next)
while (?0..?9)===@next; (out<<@next;@[email protected]); end
true
end

def exp(out)
(case @next;when ?e,?E;true;end) ? (out<<@next;@[email protected]) :
return
(out<<@next;@[email protected]) if (case @next;when ?-,?+;true;end)
digits(out)
end

def string(out)
# we've already verified the starting "
@[email protected]
s = ""
until ?\".equal?(@next)
if ?\\.equal?(@next)
@next = @io.getc
case @next
when ?\",?\\,?\/ then (s<<@next;@[email protected])
when ?b then (s<<?\b;@[email protected])
when ?f then (s<<?\f;@[email protected])
when ?n then (s<<?\n;@[email protected])
when ?r then (s<<?\r;@[email protected])
when ?t then (s<<?\t;@[email protected])
when ?u
@next = @io.getc
u = ""
4.times {
case @next
when ?0..?9, ?a..?f, ?A..?F
u<<@next;@[email protected]
else
error("a hex character", @next)
end
}
s << u.to_i(16)
else
error("a valid escape", @next)
end
else
error("a character", @next) unless @next
s<<@next;@[email protected]
end
end
@next = @io.getc
out << s
end

end
 
J

James Gray

Here's my own recursive descent parser (based on the not-quite-correct
quiz tests):

#!/usr/bin/env ruby -wKU

require "strscan"

# http://json.org/
class JSONParser
AST = Struct.new:)value)

def parse(input)
@input = StringScanner.new(input.strip)
parse_value.value
end

private

def parse_value
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
end

def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
@input.scan(/\s*,\s*/) or break
end
@input.scan(/\s*\}\s*/) or error("Unclosed object")
AST.new(object)
else
false
end
end

def parse_array
if @input.scan(/\[\s*/)
array = Array.new
while contents = parse_value rescue nil
array << contents.value
@input.scan(/\s*,\s*/) or break
end
@input.scan(/\s*\]\s*/) or error("Unclosed array")
AST.new(array)
else
false
end
end

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"\s*/) or error("Unclosed string")
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+(?:[eE][+-]?\d+)?)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end

def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end

__END__

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top