[ANN][SoC] Ariel 0.1.0 released

Discussion in 'Ruby' started by A. S. Bradbury, Aug 22, 2006.

  1. = Ariel release 0.1.0

    == About - Ariel: A Ruby Information Extraction Library
    Ariel is a library that allows you to extract information from semi-structured
    documents (such as websites). It is different to existing tools because rather
    than expecting the developer to write rules to extract the desired
    information, Ariel will use a small number of labeled examples to generate
    and learn effective extraction rules. It is developed by Alex Bradbury and
    released under the MIT license. Ariel was started as a Google Summer of Code
    project mentored by Austin Ziegler in 2006.

    == Install
    gem install ariel

    == Announcement

    I'm happy to announce the release of Ariel 0.1.0, the result of my Summer of
    Code work. This release should be easy to use, very functional, and hopefully
    useful - so it's worth trying out. I've put a lot of effort in to writing
    clear and straightforward documentation to get your started, so take a look
    at the docs available at http://ariel.rubyforge.org. In particular, flick
    through the tutorial and quick start guide. If you're interested, you may
    also want to take a look at the theory page where I've made a good start on
    describing the method Ariel uses to learn extraction rules. If you have any
    problems or find any bugs, just send me an email or add it to the issue
    tracker (see link below). Enjoy. See the FAQ for a vim snippet to make
    labeling examples a little easier.

    == Quickstart/Basic usage

    * @require 'ariel'@
    * Define a structure for the information you wish to extract:
    structure = Ariel::Node::Structure.new do |r|
    r.item :title
    r.item :body
    r.list :comments do |c|
    c.list_item :comment do |d|
    d.item :author
    d.item :body
    end
    end
    end
    * Collect a few examples of the sort of document you wish to extract
    information from (pages from the same website for instance).
    * Label each example with tags such as <l:title>, <l:comment> and so on in the
    relevant places.
    * Ariel.learn structure, labeled_file1, labeled_file2, labeled_file3
    * Find the documents you want to extract information from.
    * extractions = Ariel.extract structure, unlabeled_file1,
    unlabeled_file2
    * extractions[0].search('comments/*/body').each {|e| puts e.extracted_text}
    => "Great stuff, loving it", "I love life", .....
    * extractions[0].at('comments/34') => nil</tt> (there is no 34th comment, #at
    returns the first result rather than an array of matches).


    == Credits
    Ariel is developed by Alex Bradbury as a Google Summer of Code project under
    the mentoring of Austin Ziegler.

    == Links
    SVN Repository: http://rubyforge.org/projects/ariel
    Issue tracker: http://code.google.com/p/ariel/issues/
    Documentation/homepage: http://ariel.rubyforge.org
    RDoc: http://ariel.rubyforge.org/rdoc/
    A. S. Bradbury, Aug 22, 2006
    #1
    1. Advertising

  2. Very impressive library! I remember when you posted about this at the
    beginning of the summer. I took the library and pointed it at the
    USCCBs online version of the Bible and got some very impressive
    results! It was able to identify book, chapter and verse with only a
    few examples.

    With only 3 sample pages, and the following structure I was able to
    get very reliable results

    structure = Ariel::Node::Structure.new do |r|
    r.item :book do |b|
    b.item :title
    b.item :chapter do |c|
    c.item :title
    c.list :verses do |v|
    v.list_item :verse
    end
    end
    end
    end

    I was particularly impressed that it understood how I re-used the
    title tag in different contexts (i.e. for the book and chapter title).

    If you'd like me to email you my structure files, my examples, and the
    tests I use them in I'd be glad to. It's three small files but
    probably too much for the list.

    Comments and questions I jotted down while playing with this:

    * Most of the chapter pages have footnotes interspersed throughout the
    text. These are hyperlinks to anchors below the main body of the
    chapter. Can Ariel correctly identify footnotes and pull in the text
    for them?

    * Ariel gets confused if you have tags in the example document that
    are not in the structure, but look like Ariel tags. Example: I had a
    <l:verses> tag which contained <l:verse> tags. I realized 'verses'
    was not needed so removed the item definition from the structure but
    not the example file. Ariel was not able to find the verse items until
    I removed the <l:verses> tag.

    * Typing "extracted_text" to get the text of each node is cumbersome.
    If its not already, maybe overload to_s on nodes to display the text?

    * Dealing with items is a little cumbersome. To get the number of
    verses in a chapter, I have to type
    e[:book][:chapter][:verses].children.length. Since I am already
    treating the nodes like arrays, having a 'length' method would be
    nice: e[:book][:chapter][:verses].length

    * Better progress indication during learning phase. Hard to tell if
    program is hung or if it is managing to do something. The CPU is
    pegged but its hard to tell what progress is being made.

    * More info about the search/at methods and expressions they can take.
    RDOC and the tutorial only hint at what you can do.

    * Falls apart if tags entered are not well formed and gives little
    indication why. For example, I had missed an end tag on a list_item.
    The program didnt use the examples provided (said "learning node X
    with 2 examples" when I had 3) and then would quit with the error "No
    examples are suitable for exhaustive rule learning"
    Justin Bailey, Aug 23, 2006
    #2
    1. Advertising

  3. On Wednesday 23 August 2006 00:25, Justin Bailey wrote:
    > Very impressive library! I remember when you posted about this at the
    > beginning of the summer. I took the library and pointed it at the
    > USCCBs online version of the Bible and got some very impressive
    > results! It was able to identify book, chapter and verse with only a
    > few examples.


    Thank you so much for taking the time for writing this detailed email.

    > With only 3 sample pages, and the following structure I was able to
    > get very reliable results
    >
    > structure = Ariel::Node::Structure.new do |r|
    > r.item :book do |b|
    > b.item :title
    > b.item :chapter do |c|
    > c.item :title
    > c.list :verses do |v|
    > v.list_item :verse
    > end
    > end
    > end
    > end
    >
    > I was particularly impressed that it understood how I re-used the
    > title tag in different contexts (i.e. for the book and chapter title).


    This is because of the way it checks for nesting when extracting label tags,
    which is why it gets confused when you have an extra tag.
    <l:item><l:title>....</l:title>...</l:item><l:title>...</l:title>
    When it encounters the first <l:item>, it increments the nesting level by one,
    and again when it encounters the first <l:title>. The two closing tags
    decrement it, and then when <l:title> (which we're searching for in this
    example) is encountered and the nesting level is 0, we know it's the right
    one.

    > If you'd like me to email you my structure files, my examples, and the
    > tests I use them in I'd be glad to. It's three small files but
    > probably too much for the list.


    Yes, please do email them to me.

    > Comments and questions I jotted down while playing with this:
    >
    > * Most of the chapter pages have footnotes interspersed throughout the
    > text. These are hyperlinks to anchors below the main body of the
    > chapter. Can Ariel correctly identify footnotes and pull in the text
    > for them?


    I'll have to take a look at your examples, but if I understand correctly not
    really. Perhaps you can extract footnote references (as in #footnote34) from
    the relevant page section, and separately extract all footnotes (with a
    footnote.reference). Then you can match them up, is this what you're trying
    to do? I haven't thought about having linked items like that
    before...interesting.

    > * Ariel gets confused if you have tags in the example document that
    > are not in the structure, but look like Ariel tags. Example: I had a
    > <l:verses> tag which contained <l:verse> tags. I realized 'verses'
    > was not needed so removed the item definition from the structure but
    > not the example file. Ariel was not able to find the verse items until
    > I removed the <l:verses> tag.


    The checking and error reporting when parsing labeled documents isn't that
    great at the moment, I'll have to rework it a bit to make it easier to work
    out where there are errors if they exist. I'm not sure I follow here, you
    have :verses in your example above. Putting list items in a container is the
    recommended way of doing things:

    <ul>
    <l:verses><li><l:verse>Verse 1<l:verse></li>
    <li><l:verse>Verse 2</l:verse></li>
    <li><l:verse>Verse 3<l:verse></li></l:verses>
    </ul>

    You could put the <l:verses> right next to the first <l:verse> and the same
    with the </l:verse> if you wanted.

    When defining structure, you should really only put a list_item as a single
    child of a list (internally a list is just an item.....I mean if you think
    about it extracting the whole list above is the same as extracting any other
    piece of text that occurs once). If you have multiple list_item's at the same
    level I think you'd get a lot of things going wrong. I'll add a check for
    this - a list_item should have no siblings.

    > * Typing "extracted_text" to get the text of each node is cumbersome.
    > If its not already, maybe overload to_s on nodes to display the text?


    Will do this.

    > * Dealing with items is a little cumbersome. To get the number of
    > verses in a chapter, I have to type
    > e[:book][:chapter][:verses].children.length. Since I am already
    > treating the nodes like arrays, having a 'length' method would be
    > nice: e[:book][:chapter][:verses].length


    This is a case where I'd like you to use #search. It's easier for you too -
    what if no value for chapter was extracted for whatever reason? You'd get an
    error with the code above (because you'd be using [] on the nil value
    returned by e[:book][:chapter]), but e.search('book/chapter/verses/*').length
    would just return 0. (e/'book/chapter/verses/*').length is equivalent. I
    guess I haven't defined #size/#length because it makes sense when you're
    talking about a list, but means little when you're talking about
    e.chapter.size I think. #size = number of children seems reasonable enough
    though, I'll add that.

    > * Better progress indication during learning phase. Hard to tell if
    > program is hung or if it is managing to do something. The CPU is
    > pegged but its hard to tell what progress is being made.


    You're seeing at least messages like this?:
    info: Learning rules for node version_history with 2 examples
    info: Learnt start rules [#<Ariel::Rule:0xb79d7c64 @exhaustive=false,
    @direction=:forward, @landmarks=[["<td>"], ["Versions"], ["<td>"]]>]

    You can fill your screen with status updates by using the -D switch if using
    the command line script, by setting $DEBUG or by Ariel::Log.set_level :debug

    It's hard to know what status information to output. Other than printing the
    name of the item we're learning rules for and the rules as they're learnt,
    I'm not sure what else would mean something to the user who isn't familiar
    with Ariel internals and wouldn't be too excessively verbose. If you just
    want to know something's going on behind the scenes, then try one of the
    switches above.

    > * More info about the search/at methods and expressions they can take.
    > RDOC and the tutorial only hint at what you can do.


    They're very limited at the moment, there's nothing more to them than listing
    parameters between /, and * are supported much like directory globbing.
    There's no way to specify certain parameters (like to select only verses
    lists with more than 5 children). But then Ruby has powerful array operations
    like #select and #reject for this sort of querying. I made this interface as
    basic as possible, not being sure what people would need/use. What sort of
    queries would you like to perform? I was planning on adding range selection,
    so you could do e.search 'book/chapter/verses/[0..5/whatever'. Clearly in
    your structure it would be as easy to just slice the result array.

    This is where I could really use some practical examples to beef out the
    documentation. Maybe some of the functionalities people might want are easily
    provided using Ruby's standard library, but the documentation should give
    pointers on where to look, and suggest useful techniques.

    > * Falls apart if tags entered are not well formed and gives little
    > indication why. For example, I had missed an end tag on a list_item.
    > The program didnt use the examples provided (said "learning node X
    > with 2 examples" when I had 3) and then would quit with the error "No
    > examples are suitable for exhaustive rule learning"


    Mentioned this problem with error reporting above. I've added it to the issue
    tracker, this is definitely something that makes Ariel less user friendly.

    Can you recreate this with one of your labeled files? The message only
    learning node x with 2 examples when there are 3 seems a little odd. The "No
    examples are suitable for exhaustive rule learning" takes a little bit of
    explaining, that I probably don't have time to do properly. But basically,
    taking the example I used above. I could have labeled it like this:

    <ul>
    <li><l:verses><l:verse>Verse 1<l:verse></li>
    <li><l:verse>Verse 2</l:verse></li>
    <li><l:verse>Verse 3<l:verse></l:verses></li>
    </ul>

    Remember how Ariel learns rules - it finds a rule that consumes all the tokens
    up to the one that is labeled. Assume we're finding start rules (end rules
    have the same issue), there are no tokens between the beginning of the
    extracted verses list and the label. So the only possible rule is an empty
    rule with no landmarks, which of course can't be applied exhaustively to
    iterate over the whole list. This is why this example must be ignored in the
    current (somewhat naive) implementation. I find it works pretty well, it just
    could be better. Looking in to this is one of my post-SoC aims. A problem
    with lists is that you don't want to make users label every item, or even
    count them. The good thing is that lists are generally very regular and have
    simple rules to split them.

    Returning to the example, if we can't make a start rule that locates the start
    of the first verse, then how do we extract it? The answer is the end rule,
    say we have an end rule that has </li> as a landmark, the lowest end location
    will have a position less than the first start location, so Ariel assumes
    that all tokens from the first to the lowest end location are a list item.
    Hope that makes a little sense. This isn't something I've explained much/at
    all in the documentation, because it requires quite a lot of understanding of
    how Ariel works, and I'm hoping to look at ways to change the way this works.

    Thanks so much again for taking the time to look through my project and share
    your experiences, hope my response has been some help. Apologies if it's a
    little long.

    Regards,

    Alex
    A. S. Bradbury, Aug 23, 2006
    #3
  4. A. S. Bradbury

    Kashia Buch Guest

    Hi,

    > =3D Ariel release 0.1.0
    >
    > =3D=3D About - Ariel: A Ruby Information Extraction Library


    _WAY_ cool!

    Only one reply post just wasn't enough I thought, very cool project inde=
    ed =

    :)

    Kash

    -- =

    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
    Kashia Buch, Aug 23, 2006
    #4
  5. A. S. Bradbury wrote:
    > They're very limited at the moment, there's nothing more to them than listing
    > parameters between /, and * are supported much like directory globbing.
    > There's no way to specify certain parameters (like to select only verses
    > lists with more than 5 children). But then Ruby has powerful array operations
    > like #select and #reject for this sort of querying. I made this interface as
    > basic as possible, not being sure what people would need/use. What sort of
    > queries would you like to perform? I was planning on adding range selection,
    > so you could do e.search 'book/chapter/verses/[0..5/whatever'. Clearly in
    > your structure it would be as easy to just slice the result array.


    If I look at the queries that can be performed now they look somewhat
    similar to XPath expressions. The structure you define also looks a lot
    like an XML structure, so maybe the data extracted can be easily
    converted to XML? This would allow you to use REXML's powerful XPath
    support to query the results. Next to this the results can easily be
    exported as XML too.

    Regards,

    Peter
    Peter C. Verhage, Aug 23, 2006
    #5
  6. On Wednesday 23 August 2006 22:35, Peter C. Verhage wrote:
    > A. S. Bradbury wrote:
    > > They're very limited at the moment, there's nothing more to them than
    > > listing parameters between /, and * are supported much like directory
    > > globbing. There's no way to specify certain parameters (like to select
    > > only verses lists with more than 5 children). But then Ruby has powerful
    > > array operations like #select and #reject for this sort of querying. I
    > > made this interface as basic as possible, not being sure what people
    > > would need/use. What sort of queries would you like to perform? I was
    > > planning on adding range selection, so you could do e.search
    > > 'book/chapter/verses/[0..5/whatever'. Clearly in your structure it would
    > > be as easy to just slice the result array.

    >
    > If I look at the queries that can be performed now they look somewhat
    > similar to XPath expressions. The structure you define also looks a lot
    > like an XML structure, so maybe the data extracted can be easily
    > converted to XML? This would allow you to use REXML's powerful XPath
    > support to query the results. Next to this the results can easily be
    > exported as XML too.


    I hadn't got round to XML export yet, but this seems like an excellent reason
    to bump it up higher on the todo list. I wanted to avoid reimplementing
    XPath, which is why I chose a very simple query interface that is quite
    similar to globbing directories (no ** at the moment). XML export of course
    means I don't have to reimplement this myself, excellent. I'll add #to_xml to
    the extracted data structures.

    Alex
    A. S. Bradbury, Aug 23, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gregory Brown

    [SoC][ANN] Ruby Reports 0.4.13 Released

    Gregory Brown, Jul 9, 2006, in forum: Ruby
    Replies:
    0
    Views:
    101
    Gregory Brown
    Jul 9, 2006
  2. Gregory Brown

    [SoC][ANN] Ruby Reports 0.4.19 Released

    Gregory Brown, Jul 30, 2006, in forum: Ruby
    Replies:
    5
    Views:
    114
    Gregory Brown
    Jul 31, 2006
  3. Gregory Brown

    [ANN][SoC] Ruby Reports 0.4.21 Released

    Gregory Brown, Aug 7, 2006, in forum: Ruby
    Replies:
    0
    Views:
    79
    Gregory Brown
    Aug 7, 2006
  4. A. S. Bradbury

    [ANN][SOC] Ariel 0.0.1 released

    A. S. Bradbury, Aug 9, 2006, in forum: Ruby
    Replies:
    0
    Views:
    159
    A. S. Bradbury
    Aug 9, 2006
  5. Kevin Clark

    [ANN] [SOC] mkrf 0.1.1 released

    Kevin Clark, Aug 17, 2006, in forum: Ruby
    Replies:
    0
    Views:
    103
    Kevin Clark
    Aug 17, 2006
Loading...

Share This Page