Generic Parsing Library

A

Adam Sanderson

I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

The data tends to be structured, but not rigorously, and changes
whenever someone feels like it, it's not hard to parse manually but
wouldn't it be nice to do a little metaprogramming at the top of a
class and say something like this? (not a rigorous example)

class LoopDetector
one :header, :hash, :start_after=>/^\*+$/,
:end_before=>/^\*+$/, :split=>/:\s+/
many :days, LoopData, :start_after=>:header,
:end_before=>/\n\n\n/
end

Most of the data can be broken down into:
- Spacer lines
- Hashes
- Tables
- Garbage (No seriously, some of these files have completely pointless
information in a lot of them)

Any ideas folks?
.adam sanderson


Here's one example of the type of data I get to play with (in reality
it goes from 00:00 -> 23:55 for each set of Loop Data, and there are
about 200 sets of Raw Loop Data). For anyone who's interested this is
loop detector data, which measures the amount of traffic on freeways.

***********************************
Filename: 0076ON04.cdl
Extracted by: CDR_Auto version 3.31 BETA g
Creation Date: Mar27/05 (Sun)
Creation Time: 20:23:09
File Type: TEXT
***********************************


ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/01/04 (Thu)

---Raw Loop Data Listing---

Time Vol Occ Flg nPds
00:00 5 0.4% 1 15
00:05 11 1.2% 1 15
00:10 14 1.2% 1 15
23:50 3 0.5% 2 15
23:55 3 0.4% 1 15



ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
01/02/04 (Fri)

---Raw Loop Data Listing---

Time Vol Occ Flg nPds
00:00 0 0.0% 0 0
00:05 0 0.0% 0 0
00:10 0 0.0% 0 0
00:15 0 0.0% 0 0
23:50 0 0.0% 0 0
23:55 26 3.8% 2 10
 
K

Kirk Haines

I was wondering if anyone would be interested in, or knows of a generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

Once upon a time, in a job far, far away, I wrote an ETL (Extract, Transform,
Load) system in Perl. It had a lot of bells and whistles, but the core of it
was an XML language that described data sources, transformations, and data
destinations. The code was "compiled" to Perl to execute it.

It was neat. Every since I started using Ruby, I've thought that it could be
done better with Ruby.

For starters, the whole notion of an XML language that is parsed and turned
into executable code could be replaced by a domain specific language. It'd
be a beautiful thing, and I bet it could be done in a LOT fewer lines than
when I did it in Perl.


Kirk Haines
 
J

James Edward Gray II

I was wondering if anyone would be interested in, or knows of a
generic
parsing library.

I've just recently been throwing together my own tool for this. I
just got done using it in a real-world (paid) project. It's small
and really just a chainsaw tool for data mining, but it seems to be a
good start. I haven't documented it yet, but here are a couple of
examples from my unit tests:

def test_complex
path = File.join(File.dirname(__FILE__), "ross_report.txt")
test = self

input(path) do
@state = :skip
start_skipping_at("\f")
stop_skipping_at(/\A-[- ]+-\Z/)
skip(/\A\s*\Z/)
skip(/--\Z/)

find_in_skipped(/((?:period|Week)\s+\d.+?)\s*\Z/) do |
period|
test.assert_equal("Period 02/2002", period)
end

stop_at("*** Selection Criteria ***")

read do |line|
test.assert_match(/\A\s+(?:Sales|Cust|SA)|\A[-\w]+\s
+/, line)
end
end

path = File.join(File.dirname(__FILE__), "car_ads.txt")

data = input(path, "") do
@state = :skip
stop_skipping_at("Save Ad")
skip(/\A\s*\Z/)

pre { @price = @miles = nil }
read(/\$([\d,]+\d)/) { |price| @price = price.delete
(",").to_i }
read(/([\d,]*\d)\s*m/) { |miles| @miles = miles.delete
(",").to_i }

read do |ad|
if @price and @price < 20_000 and @miles and @miles
< 40_000
(@ads ||= Array.new) << ad.strip
end
end
end

assert_equal([<<END_AD.strip], data.ads)
2003 Chrysler Town & Country LX
$16,990, green, 21,488 mi, air, pw, power locks, ps, power
mirrors,
dual air bags, keyless entry, intermittent wipers, rear defroster,
alloy,
pb, abs, cruise, am/fm stereo, CD, cassette, tinted glass
VIN:2C4GP44363R153238, Stock No:C153238, CALL DAN PERKINS AT
1-800-432-6326
END_AD
end

__END__

The first half of that is parsing the report from Ruby Quiz #17
(http://www.rubyquiz.com/quiz17.html). The second half is parsing a
listing of car ads (very unstructured data) looking for cars below a
certain price and mileage.

If people think this looking promising, I'll be happy to make it
available.

James Edward Gray II
 
A

Adam Sanderson

It would be interesting to look at one way or another. It looks like
it could be useful for controlling some of the parsing. My old code
for parsing these types of files was in Java, and as with most of my
Java code, I realized that I've been trying to write in ruby all along
;)
.adam sanderson
 
T

Trans

Hi Adam,

Actually I have written something similar to what you describe, though
it is token based. It may be adaptable to what you describe. Certainly
it could use some twaeking, more testing and any improvements you might
offer. Here's an example of parsing something like XML.

require 'yaml'

s = %Q{
[p]
This is plain paragraph.
[t]This bold.[b.]This tee'd off.[t.]&tm;
[p.]
}

tokens = []

t = TokenParser::Token.new( :ONE )
t.start = lambda { |match| %r{ \[ (.*?) \] }mx }
t.stop = lambda { |match| %r{ \[ [ ]* (#{resc(match[1])}) (.*?) \. \]
}mx }
tokens << t

t = TokenParser::UnitToken.new( :TWO )
t.start = lambda { |match| ; %r{ \& (.*?) \; }x }
tokens << t

cp = TokenParser.new( *tokens )
d = cp.parse( s )
y d

outputs (don't let this scare you, its easy to traverse the content)

--- &id004 !ruby/array:TokenParser::Main
- "
"
- &id002 !ruby/object:TokenParser::Marker
content:
- >

This is plain paragraph.

- &id001 !ruby/object:TokenParser::Marker
content:
- !ruby/object:TokenParser::Marker
content:
- This bold.
inner_range: !ruby/range '36...46'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '33...50'
parent: *id001
token: &id003 !ruby/object:TokenParser::Token
key: :ONE
parser:
start: !ruby/object:proc {}
stop: !ruby/object:proc {}
- "This tee'd off."
inner_range: !ruby/range '33...65'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '30...69'
parent: *id002
token: *id003
- !ruby/object:TokenParser::Marker
content: []
match: !ruby/object:MatchData {}
outer_range: !ruby/range '69...73'
parent: *id002
token: !ruby/object:TokenParser::UnitToken
key: :TWO
parser:
start: !ruby/object:proc {}
inner_range: !ruby/range '4...74'
match: !ruby/object:MatchData {}
outer_range: !ruby/range '1...78'
parent: *id004
token: *id003

Let me know if you'd like a copy to play with.

T.
 
G

Gavin Kistner

I was wondering if anyone would be interested in, or knows of a
generic
parsing library. I am continually faced with reading in bizarre text
files and parsing them. They tend to have regular structures though
(at the whim of researcher who made them). I'd like to write up some
sort of declarative code to parse these files. There's a lot of room
for reuse.

I wrote TagTreeScanner, which can be used to parse text files when
the desired output is a hierarchy of nodes and text (i.e. an XML type
file).

http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
 
J

James Edward Gray II

It would be interesting to look at one way or another.

I don't want to send out a big announcement message until I get
documentation in there, but my parsing library is on RubyForge now:

http://rubyforge.org/projects/input/

It should be easy to figure out how to use it from the unit tests in
CVS. I did release a gem, if you want to install it.

James Edward Gray II
 
A

Adam Sanderson

Great.
This looks very much like what I was imagining, or at least some part
of it. I think I'll play with it a little bit today or tonight and see
what I can do. By the way, I never knew you could do:
?u or ?a
to get the int codes for a letter, how odd.
.adam sanderson
 
J

James Edward Gray II

Great.
This looks very much like what I was imagining, or at least some part
of it. I think I'll play with it a little bit today or tonight and
see
what I can do.

I have ideas for more features and I will document the next release,
so hopefully it will be more approachable. I'm using it to do real-
world tasks now though, so I think it has potential.
By the way, I never knew you could do:
?u or ?a
to get the int codes for a letter, how odd.

Ruby's just full of surprises. ;)

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top