[ANN][SOC] Ariel 0.0.1 released

Discussion in 'Ruby' started by A. S. Bradbury, Aug 9, 2006.

  1. = Ariel release 0.0.1

    == Install
    gem install ariel (if it's not yet propagated either wait or grab the .gem
    from my rubyforge page and install that).

    == Announcement
    This is the first public release of Ariel - A Ruby Information Extraction
    Library. See my previous post, ruby-talk:20014
    [http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/200140] for more
    background information. This release supports defining a tree document
    structure and learning rules to extract each node of this tree. Handling of
    list extraction and learning is not yet implemented, and is the next
    immediate priority. See the examples directory included in this release and
    below for discussion of the included examples. Rule learning is functional,
    and appears to work well, but many refinements are possible. Look out for
    more updates and a new release shortly.

    == About Ariel
    Ariel intends to assist in extracting information from semi-structured
    documents including (but not in any way limited to) web pages. Although you
    may use libraries such as Hpricot or Rubyful Soup, or even plain Regular
    Expressions to achieve the same goal, Ariel approaches the problem very
    differently. Ariel relies on the user labeling examples of the data they want
    to extract, and then finds patterns across several such labeled examples in
    order to produce a set of general rules for extracting this information from
    any similar document. It uses the MIT license.

    == Examples
    This release includes two examples in the example directory (which should now
    be in the directory to which rubygems installed ariel). The first is the
    google_calculator directory (inspired by Justin Bailey's post to my Ariel
    progress report). The structure is very simple, a calculation is extracted
    from the page, and then the actual result is extracted from that calculation.
    3 labeled examples are included. Ariel reads each of these, tokenizes them,
    and extracts each label. 4 sets of rules are learnt:
    1. Rules to locate the start of the calculation in the original document.
    2. Rules to locate the end of the calculation in the original document
    (applied from the end of the document).
    3. Rules to locate the start of the result of the calculation from the
    extracted calculation.
    4. Rules to locate the end of the result of the calculation from the extracted
    calculation (applied from the end of the calculation).

    Take note of 3 and 4 - this is the advantage of treating a document as a tree
    in this way. Deeply nested elements can be located by generating a series of
    simple rules, rather than generating a rule with complexity that increases at
    each level. Sets of rules are generated because it may not be possible to
    generate a single rule that will catch all cases. A rule is found that
    matches as many of the examples as possible (and fails on the rest), these
    examples are then removed and a rule is found that will match as many of the
    remaining examples and so on. When it comes to applying these learnt rules,
    the rules are applied in order until there is a rule that matches.

    To see this example for yourself just execute structure.rb in the
    examples/google_calculator directory to create a locally writable
    structure.yaml. Then do:
    ariel -D -m learn -s structure.yaml -d /examplepath/labeled

    You'll have to wait a while (see my note about performance below). At the end,
    the learnt rules will be printed in YAML format, and structure.yaml will be
    updated to include these rules. Apply these learnt rules to some unlabeled
    documents by doing:
    ariel -D -m extract -s structure.yaml -d /examplepath/unlabeled

    You should see the results of a successful extraction printed to your
    terminal, such as this one:

    Results for unlabeled/2:
    calculation: 3.5 U.S. dollars = 1.8486241 British pounds
    result: 1.8486241 British pounds

    The second example (raa) learns rules using just 2 labeled examples. This is
    probably fewer than I'd recommend in most cases, but as it works... This
    example consists of project entries in the Ruby Application Archive. The
    structure of the page is very flat, so all rules are applied to the full
    page. Rules are learnt and applied as shown above. The structure.yaml files
    included in the examples directories already include rules generated by
    Ariel, use these if you just want to see extraction working.

    Note: The interface demonstrated by ariel above is not very flexible or
    friendly, it's just to serve as a demonstration for the moment.

    == Performance
    Generating rules takes quite a long time. It is always going to be an
    intensive operation, but there are some very simple and obvious improvements
    in efficiency that can be made. For a start, the rule candidate refining
    process currently re-applies the same rules over and over every time the
    remaining rule candidates are ranked. This is where most time is spent, and
    caching these should make a big difference. This will definitely be
    implemented. Other performance enhancements are bound to be there, but my
    focus at this time is to get something that works.

    == Credits
    Ariel is developed by Alex Bradbury as a Google Summer of Code project under
    the mentoring of Austin Ziegler.

    == Links
    Watch my development through the subversion repository at
    I've also just started using the tracker at http://code.google.com/p/ariel/
    A. S. Bradbury, Aug 9, 2006
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gregory Brown

    [SoC][ANN] Ruby Reports 0.4.13 Released

    Gregory Brown, Jul 9, 2006, in forum: Ruby
    Gregory Brown
    Jul 9, 2006
  2. Gregory Brown

    [SoC][ANN] Ruby Reports 0.4.19 Released

    Gregory Brown, Jul 30, 2006, in forum: Ruby
    Gregory Brown
    Jul 31, 2006
  3. Gregory Brown

    [ANN][SoC] Ruby Reports 0.4.21 Released

    Gregory Brown, Aug 7, 2006, in forum: Ruby
    Gregory Brown
    Aug 7, 2006
  4. Kevin Clark

    [ANN] [SOC] mkrf 0.1.1 released

    Kevin Clark, Aug 17, 2006, in forum: Ruby
    Kevin Clark
    Aug 17, 2006
  5. A. S. Bradbury

    [ANN][SoC] Ariel 0.1.0 released

    A. S. Bradbury, Aug 22, 2006, in forum: Ruby
    A. S. Bradbury
    Aug 23, 2006