Generic Parsing Library

Discussion in 'Ruby' started by Adam Sanderson, Aug 16, 2005.

  1. I was wondering if anyone would be interested in, or knows of a generic
    parsing library. I am continually faced with reading in bizarre text
    files and parsing them. They tend to have regular structures though
    (at the whim of researcher who made them). I'd like to write up some
    sort of declarative code to parse these files. There's a lot of room
    for reuse.

    The data tends to be structured, but not rigorously, and changes
    whenever someone feels like it, it's not hard to parse manually but
    wouldn't it be nice to do a little metaprogramming at the top of a
    class and say something like this? (not a rigorous example)

    class LoopDetector
    one :header, :hash, :start_after=>/^\*+$/,
    :end_before=>/^\*+$/, :split=>/:\s+/
    many :days, LoopData, :start_after=>:header,
    :end_before=>/\n\n\n/
    end

    Most of the data can be broken down into:
    - Spacer lines
    - Hashes
    - Tables
    - Garbage (No seriously, some of these files have completely pointless
    information in a lot of them)

    Any ideas folks?
    .adam sanderson


    Here's one example of the type of data I get to play with (in reality
    it goes from 00:00 -> 23:55 for each set of Loop Data, and there are
    about 200 sets of Raw Loop Data). For anyone who's interested this is
    loop detector data, which measures the amount of traffic on freeways.

    ***********************************
    Filename: 0076ON04.cdl
    Extracted by: CDR_Auto version 3.31 BETA g
    Creation Date: Mar27/05 (Sun)
    Creation Time: 20:23:09
    File Type: TEXT
    ***********************************


    ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
    01/01/04 (Thu)

    ---Raw Loop Data Listing---

    Time Vol Occ Flg nPds
    00:00 5 0.4% 1 15
    00:05 11 1.2% 1 15
    00:10 14 1.2% 1 15
    23:50 3 0.5% 2 15
    23:55 3 0.4% 1 15



    ES-076R:_CN_O_1 I-5 MLK Jr Way-NB 157.13
    01/02/04 (Fri)

    ---Raw Loop Data Listing---

    Time Vol Occ Flg nPds
    00:00 0 0.0% 0 0
    00:05 0 0.0% 0 0
    00:10 0 0.0% 0 0
    00:15 0 0.0% 0 0
    23:50 0 0.0% 0 0
    23:55 26 3.8% 2 10
    Adam Sanderson, Aug 16, 2005
    #1
    1. Advertising

  2. Adam Sanderson

    Kirk Haines Guest

    On Tuesday 16 August 2005 3:46 pm, Adam Sanderson wrote:
    > I was wondering if anyone would be interested in, or knows of a generic
    > parsing library. I am continually faced with reading in bizarre text
    > files and parsing them. They tend to have regular structures though
    > (at the whim of researcher who made them). I'd like to write up some
    > sort of declarative code to parse these files. There's a lot of room
    > for reuse.


    Once upon a time, in a job far, far away, I wrote an ETL (Extract, Transform,
    Load) system in Perl. It had a lot of bells and whistles, but the core of it
    was an XML language that described data sources, transformations, and data
    destinations. The code was "compiled" to Perl to execute it.

    It was neat. Every since I started using Ruby, I've thought that it could be
    done better with Ruby.

    For starters, the whole notion of an XML language that is parsed and turned
    into executable code could be replaced by a domain specific language. It'd
    be a beautiful thing, and I bet it could be done in a LOT fewer lines than
    when I did it in Perl.


    Kirk Haines
    Kirk Haines, Aug 16, 2005
    #2
    1. Advertising

  3. On Aug 16, 2005, at 4:46 PM, Adam Sanderson wrote:

    > I was wondering if anyone would be interested in, or knows of a
    > generic
    > parsing library.


    I've just recently been throwing together my own tool for this. I
    just got done using it in a real-world (paid) project. It's small
    and really just a chainsaw tool for data mining, but it seems to be a
    good start. I haven't documented it yet, but here are a couple of
    examples from my unit tests:

    def test_complex
    path = File.join(File.dirname(__FILE__), "ross_report.txt")
    test = self

    input(path) do
    @state = :skip
    start_skipping_at("\f")
    stop_skipping_at(/\A-[- ]+-\Z/)
    skip(/\A\s*\Z/)
    skip(/--\Z/)

    find_in_skipped(/((?:period|Week)\s+\d.+?)\s*\Z/) do |
    period|
    test.assert_equal("Period 02/2002", period)
    end

    stop_at("*** Selection Criteria ***")

    read do |line|
    test.assert_match(/\A\s+(?:Sales|Cust|SA)|\A[-\w]+\s
    +/, line)
    end
    end

    path = File.join(File.dirname(__FILE__), "car_ads.txt")

    data = input(path, "") do
    @state = :skip
    stop_skipping_at("Save Ad")
    skip(/\A\s*\Z/)

    pre { @price = @miles = nil }
    read(/\$([\d,]+\d)/) { |price| @price = price.delete
    (",").to_i }
    read(/([\d,]*\d)\s*m/) { |miles| @miles = miles.delete
    (",").to_i }

    read do |ad|
    if @price and @price < 20_000 and @miles and @miles
    < 40_000
    (@ads ||= Array.new) << ad.strip
    end
    end
    end

    assert_equal([<<END_AD.strip], data.ads)
    2003 Chrysler Town & Country LX
    $16,990, green, 21,488 mi, air, pw, power locks, ps, power
    mirrors,
    dual air bags, keyless entry, intermittent wipers, rear defroster,
    alloy,
    pb, abs, cruise, am/fm stereo, CD, cassette, tinted glass
    VIN:2C4GP44363R153238, Stock No:C153238, CALL DAN PERKINS AT
    1-800-432-6326
    END_AD
    end

    __END__

    The first half of that is parsing the report from Ruby Quiz #17
    (http://www.rubyquiz.com/quiz17.html). The second half is parsing a
    listing of car ads (very unstructured data) looking for cars below a
    certain price and mileage.

    If people think this looking promising, I'll be happy to make it
    available.

    James Edward Gray II
    James Edward Gray II, Aug 16, 2005
    #3
  4. It would be interesting to look at one way or another. It looks like
    it could be useful for controlling some of the parsing. My old code
    for parsing these types of files was in Java, and as with most of my
    Java code, I realized that I've been trying to write in ruby all along
    ;)
    .adam sanderson
    Adam Sanderson, Aug 16, 2005
    #4
  5. Adam Sanderson

    Trans Guest

    Hi Adam,

    Actually I have written something similar to what you describe, though
    it is token based. It may be adaptable to what you describe. Certainly
    it could use some twaeking, more testing and any improvements you might
    offer. Here's an example of parsing something like XML.

    require 'yaml'

    s = %Q{
    [p]
    This is plain paragraph.
    [t]This bold.[b.]This tee'd off.[t.]&tm;
    [p.]
    }

    tokens = []

    t = TokenParser::Token.new( :ONE )
    t.start = lambda { |match| %r{ \[ (.*?) \] }mx }
    t.stop = lambda { |match| %r{ \[ [ ]* (#{resc(match[1])}) (.*?) \. \]
    }mx }
    tokens << t

    t = TokenParser::UnitToken.new( :TWO )
    t.start = lambda { |match| ; %r{ \& (.*?) \; }x }
    tokens << t

    cp = TokenParser.new( *tokens )
    d = cp.parse( s )
    y d

    outputs (don't let this scare you, its easy to traverse the content)

    --- &id004 !ruby/array:TokenParser::Main
    - "
    "
    - &id002 !ruby/object:TokenParser::Marker
    content:
    - >

    This is plain paragraph.

    - &id001 !ruby/object:TokenParser::Marker
    content:
    - !ruby/object:TokenParser::Marker
    content:
    - This bold.
    inner_range: !ruby/range '36...46'
    match: !ruby/object:MatchData {}
    outer_range: !ruby/range '33...50'
    parent: *id001
    token: &id003 !ruby/object:TokenParser::Token
    key: :ONE
    parser:
    start: !ruby/object:proc {}
    stop: !ruby/object:proc {}
    - "This tee'd off."
    inner_range: !ruby/range '33...65'
    match: !ruby/object:MatchData {}
    outer_range: !ruby/range '30...69'
    parent: *id002
    token: *id003
    - !ruby/object:TokenParser::Marker
    content: []
    match: !ruby/object:MatchData {}
    outer_range: !ruby/range '69...73'
    parent: *id002
    token: !ruby/object:TokenParser::UnitToken
    key: :TWO
    parser:
    start: !ruby/object:proc {}
    inner_range: !ruby/range '4...74'
    match: !ruby/object:MatchData {}
    outer_range: !ruby/range '1...78'
    parent: *id004
    token: *id003

    Let me know if you'd like a copy to play with.

    T.
    Trans, Aug 17, 2005
    #5
  6. On Aug 16, 2005, at 3:46 PM, Adam Sanderson wrote:
    > I was wondering if anyone would be interested in, or knows of a
    > generic
    > parsing library. I am continually faced with reading in bizarre text
    > files and parsing them. They tend to have regular structures though
    > (at the whim of researcher who made them). I'd like to write up some
    > sort of declarative code to parse these files. There's a lot of room
    > for reuse.


    I wrote TagTreeScanner, which can be used to parse text files when
    the desired output is a hierarchy of nodes and text (i.e. an XML type
    file).

    http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html
    Gavin Kistner, Aug 17, 2005
    #6
  7. On Aug 16, 2005, at 5:36 PM, Adam Sanderson wrote:

    > It would be interesting to look at one way or another.


    I don't want to send out a big announcement message until I get
    documentation in there, but my parsing library is on RubyForge now:

    http://rubyforge.org/projects/input/

    It should be easy to figure out how to use it from the unit tests in
    CVS. I did release a gem, if you want to install it.

    James Edward Gray II
    James Edward Gray II, Aug 19, 2005
    #7
  8. Great.
    This looks very much like what I was imagining, or at least some part
    of it. I think I'll play with it a little bit today or tonight and see
    what I can do. By the way, I never knew you could do:
    ?u or ?a
    to get the int codes for a letter, how odd.
    .adam sanderson
    Adam Sanderson, Aug 19, 2005
    #8
  9. On Aug 19, 2005, at 12:41 PM, Adam Sanderson wrote:

    > Great.
    > This looks very much like what I was imagining, or at least some part
    > of it. I think I'll play with it a little bit today or tonight and
    > see
    > what I can do.


    I have ideas for more features and I will document the next release,
    so hopefully it will be more approachable. I'm using it to do real-
    world tasks now though, so I think it has potential.

    > By the way, I never knew you could do:
    > ?u or ?a
    > to get the int codes for a letter, how odd.


    Ruby's just full of surprises. ;)

    James Edward Gray II
    James Edward Gray II, Aug 19, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Murat Tasan
    Replies:
    1
    Views:
    8,025
    Chaitanya
    Feb 3, 2009
  2. Replies:
    2
    Views:
    423
  3. Jasper

    Parsing a generic data file

    Jasper, Dec 14, 2007, in forum: XML
    Replies:
    13
    Views:
    1,004
    Anthony Jones
    Dec 22, 2007
  4. minlearn
    Replies:
    2
    Views:
    445
    red floyd
    Mar 13, 2009
  5. Phil Tomson

    Generic Parsing Library

    Phil Tomson, Aug 17, 2005, in forum: Ruby
    Replies:
    0
    Views:
    84
    Phil Tomson
    Aug 17, 2005
Loading...

Share This Page