ANN: NUCULAR B3 Full text indexing (now on Win32 too)

Discussion in 'Python' started by Aaron Watters, Feb 13, 2008.

  1. ANNOUNCE: NUCULAR Fielded Full Text Indexing, BETA 3

    Nucular is a system for creating full text
    indices for fielded data. It can be accessed
    via a Python API or via a suite of command line
    interfaces.

    NEWS

    Nucular now supports WIN32. Current releases
    of Nucular abstract the file system in order
    to emulate file system feature missing
    on NT file systems which prevented older
    versions from running correctly on Windows NT
    based systems.

    Proximity search added: Current versions of
    Nucular allow queries to search for a sequence
    of words near eachother separated by
    no more than a specified number of other words.

    Faceted suggestions: Nucular queries now
    support faceted suggestions for values for
    fields which are related to a query.

    Faster index builds: Current releases of
    Nucular have completely revamped internal
    data structures which build indices much faster
    (and query a bit faster also). For example some
    builds run more than 8 times faster than
    previously.

    Read more and download at:

    http://nucular.sourceforge.net

    ONLINE DEMOS:

    Python source search:
    http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=malodorous parrot
    Mondial geographic text search:
    http://www.xfeedme.com/nucular/mondial.py/go?attr_name=ono
    Gutenberg book search:
    http://www.xfeedme.com/nucular/gut.py/go?attr_Comments=immoral english

    BACKGROUND:

    Nucular is intended to help store and retrieve
    searchable information in a manner somewhat similar
    to the way that "www.hotjobs.com" stores and
    retrieves job descriptions, for example.

    Nucular archives fielded documents and retrieves
    them based on field value, field prefix, field
    word prefix, or full text word prefix,
    word proximity or combinations of these.
    Nucular also includes features for determining
    values related to a query often called query facets.

    FEATURES

    Nucular is very light weight. Updates and accesses
    do not require any server process or other system
    support such as shared memory locking.

    Nucular supports concurrency. Arbitrary concurrent
    updates and accesses by multiple processes or threads
    are supported, with no possible locking issues.

    Nucular supports document threading in the
    manner of USENET replies. Built in semantics allows
    "follow ups" to messages to match patterns that
    match the "original" messages.

    Nucular indexes and retrieves data quickly.

    I hope you like.
    -- Aaron Watters

    ===
    It's humbling to think that when Mozart was my age
    he'd been dead for 5 years. -- Tom Lehrer
     
    Aaron Watters, Feb 13, 2008
    #1
    1. Advertising

  2. Aaron Watters

    Paul Rubin Guest

    Aaron Watters <> writes:
    > ANNOUNCE: NUCULAR Fielded Full Text Indexing, BETA 3


    Oh cool, I wondered if anything was going on with this. I'm still
    using Solr/Lucene while Nucular matures, which sounds to be moving
    along nicely.

    A while back we chatted about flash drives. I did a little bit of
    testing with a consumer CF card on an IDE adapter, and with a Corsair
    Voyager GT usb pen drive on a USB port, and got "seek" times of about
    0.7 to 1 msec, compared with about 7 msec for a hard drive. I haven't
    yet tried a serious SSD or a high speed (SLC) CF card.
     
    Paul Rubin, Feb 13, 2008
    #2
    1. Advertising

  3. Aaron Watters

    Jarek Zgoda Guest

    Paul Rubin napisa³(a):

    >> ANNOUNCE: NUCULAR Fielded Full Text Indexing, BETA 3

    >
    > Oh cool, I wondered if anything was going on with this. I'm still
    > using Solr/Lucene while Nucular matures, which sounds to be moving
    > along nicely.


    Did you (or anyone else) compare Nucular with Solr and Sphinx
    feature-by-feature?

    --
    Jarek Zgoda
    Skype: jzgoda | GTalk: | voice: +48228430101

    "We read Knuth so you don't have to." (Tim Peters)
     
    Jarek Zgoda, Feb 14, 2008
    #3
  4. Aaron Watters

    Paul Rubin Guest

    Jarek Zgoda <> writes:
    > Did you (or anyone else) compare Nucular with Solr and Sphinx
    > feature-by-feature?


    Nucular when I looked at it was in an early alpha release and looked
    interesting and promising, but was nowhere near as built-out as Solr.
    It may be closer now; I haven't yet had a chance to look at the new
    release.

    I don't know what Sphinx is.
     
    Paul Rubin, Feb 14, 2008
    #4
  5. Aaron Watters

    Jarek Zgoda Guest

    Paul Rubin napisa³(a):

    >> Did you (or anyone else) compare Nucular with Solr and Sphinx
    >> feature-by-feature?

    >
    > Nucular when I looked at it was in an early alpha release and looked
    > interesting and promising, but was nowhere near as built-out as Solr.
    > It may be closer now; I haven't yet had a chance to look at the new
    > release.
    >
    > I don't know what Sphinx is.


    http://www.sphinxsearch.com/

    --
    Jarek Zgoda
    Skype: jzgoda | GTalk: | voice: +48228430101

    "We read Knuth so you don't have to." (Tim Peters)
     
    Jarek Zgoda, Feb 14, 2008
    #5
  6. Aaron Watters

    Paul Rubin Guest

    Jarek Zgoda <> writes:
    > > I don't know what Sphinx is.

    > http://www.sphinxsearch.com/


    Thanks, looks interesting, maybe not so good for what I'm doing, but
    worth looking into. There is also Xapian which again I haven't looked
    at much, but which has fancier PIR (probabilistic information
    retrieval) capabilities than Lucene or the version of Nucular that I
    looked at.

    The main thing killing most of the search apps that I'm involved with
    is disk latency. If Aaron is listening, I might suggest offering a
    config option to redundantly recording the stored search fields with
    every search term in the index. That will bloat the indexes by a
    nontrivial constant factor (maybe 5x-10x) but even terabyte disks are
    dirt cheap these days, so you still index a lot of data, and present
    large result sets without having to do a disk seek for every result in
    the set. I've been meaning to crunch some numbers to see if this
    actually makes sense.

    Unfortunately, the concept of the large add-on memory card seems to
    have vanished. It would be very useful to have a cheap x86 box with a
    buttload of ram (say 64gb), using commodity desktop memory and extra
    modules for ECC. It would be ok if it went over some slow interface
    so that it was 10x slower than regular ram. That's still 100x faster
    than a flash disk and 1000x faster than a hard disk.
     
    Paul Rubin, Feb 14, 2008
    #6
  7. On Feb 14, 3:50 am, Paul Rubin <http://> wrote:
    > The main thing killing most of the search apps that I'm involved with
    > is disk latency. If Aaron is listening, I might suggest offering a
    > config option to redundantly recording the stored search fields with
    > every search term in the index.


    I'm not sure what you mean, but if I understand I think Nucular
    already does this. The signatures of the primary indices are

    Description: DocumentId x AttributeIndex x FullValue
    "Given a document Id find attributes and their values"
    AttributeIndex: AttributeIndex x TruncatedValue x DocumentId
    "Given an attribute and a value (prefix) find document Id's"
    AttributeWord: AttributeIndex x Word x DocumentId
    "Given an attribute and a word find documents containing
    that word in that attribute"
    WordIndex: Word x DocumentId
    "Given a word find documents containing that word anywhere"

    There are a lot of other possibilities which could be added fairly
    easily (and I'd like to work out an abstraction layer to make it
    even easier -- so you don't need to directly modify the
    library code).

    For instance you might want to make proximity searching faster by
    indexing words in a document with their locations. Currently
    proximity searches that must filter thousands of documents containing
    all the relevant words are noticably slower than other queries.

    It's a hard problem: every additional index and index column
    makes some queries faster, but it may make other queries slower
    sometimes and it always makes index builds and index files
    more expensive.

    It has also occurred to me that the underlying
    index implementations and related data structures may be of
    interest to Python programmers for all sorts of other purposes
    too.

    As far as how Nucular compares to Sphinx or anything else:
    I don't know and I'm not the right person to evaluate that.
    I'd encourage people to try out Nucular and see if it is
    easy enough to use and fast enough
    and feature rich enough for the intended use. If
    it isn't maybe you should find something else. Suggestions
    and criticism are always welcome.

    -- Aaron Watters

    ===
    "Visit New Jersey: It's not as bad as you think!"
    -- suggested New Jersey tourism slogan

    http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=frighten away evil spirits
     
    Aaron Watters, Feb 14, 2008
    #7
  8. [apologies to the list: I would have done this offline,
    but I can't figure out Paul's email address.]

    1) Paul please forward your email address

    3) Since you seem to know about these things: I was thinking
    of adding an optional feature to Nucular which would allow
    a look-up like "given a word find all attributes that contain
    that word anywhere and give a count of the number of times it
    is found in that attribute as well as the entry id for an example
    instance (arbitrarily chosen). I was thinking about calling
    this "inverted faceting", but you probably know a
    better/standard name, yes? What is it please? Thanks!
    Answers from anyone else welcomed also.

    [Nucular: http://nucular.sourceforge.net/ ]

    -- Aaron Watters

    ===
    There are 3 kinds of people: those who can count, and those who can't.
    http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=shit

    On 14 Feb, 02:59, Paul Rubin <http://> wrote:
    > Jarek Zgoda <> writes:
    > > Did you (or anyone else) compare Nucular with Solr and Sphinx
    > > feature-by-feature?

    >
    > Nucular when I looked at it was in an early alpha release and looked
    > interesting and promising, but was nowhere near as built-out as Solr.
    > It may be closer now; I haven't yet had a chance to look at the new
    > release.
    >
    > I don't know what Sphinx is.
     
    Aaron Watters, Feb 22, 2008
    #8
  9. Aaron Watters

    Paul Rubin Guest

    Aaron Watters <> writes:
    > [apologies to the list: I would have done this offline,
    > but I can't figure out Paul's email address.]
    >
    > 1) Paul please forward your email address


    Will send it privately. I don't have a public email address any more
    (death to spam!!!). My general purpose online contact point is
    http://paulrubin.com which currently has an expired certificate that
    I'll get around to renewing someday. Meanwhile you have to click
    "accept" to connect using the expired cert.

    > 3) Since you seem to know about these things: I was thinking
    > of adding an optional feature to Nucular which would allow
    > a look-up like "given a word find all attributes that contain
    > that word anywhere and give a count of the number of times it
    > is found in that attribute as well as the entry id for an example
    > instance (arbitrarily chosen). I was thinking about calling
    > this "inverted faceting", but you probably know a
    > better/standard name, yes? What is it please? Thanks!
    > Answers from anyone else welcomed also.


    In Solr this is called the DisMax (disjunction maximum) handler, I
    think. I tried it and it doesn't work very well, and ended up using a
    script written by a co-worker, that expands such queries to more
    complex queries that put user-supplied weights on each field. It is a
    somewhat messy problem. Otis Gospodnetic's book "Lucene in Action"
    talks about it some, I believe. Manning and Schutz are working on a
    new book at http://informationretrieval.org that discusses fancier
    methods. I think these are worth looking into, but I haven't had the
    bandwidth to spend time on it so far.
     
    Paul Rubin, Feb 22, 2008
    #9
  10. On Feb 22, 5:31 pm, Paul Rubin <http://> wrote:
    > Aaron Watters <> writes:
    > > 3) ...I was thinking
    > > of adding an optional feature to Nucular which would allow
    > > a look-up like "given a word find all attributes that contain
    > > that word anywhere and give a count of the number of times it
    > > is found in that attribute as well as the entry id for an example
    > > instance (arbitrarily chosen). I was thinking about calling
    > > this "inverted faceting....

    >
    > In Solr this is called the DisMax (disjunction maximum) handler,


    I can't find much documentation on this, but I think this is not
    what I was thinking of. In fact I think Nucular already supports
    "disjunction maximum".

    I was thinking of a situation that would
    support interactions like this (quickly and cheaply):

    User: I'm thinking of "Denver"
    System: I see the value "Denver" in the following contexts:
    City: Denver [100000 entries]
    (for example in "Colorado Trombone Players Association")
    Surname: Denver [100000 entries]
    (for example "Denver, John, songwriter")
    Title: Denver [1000 entries]
    (for example in "Stuck in Denver Again, by Albert Smiley")
    ... and also some other contexts
    Which do you mean?
    User: I'm actually looking for the surname...

    In other words you don't get "documents" containing
    the search term(s) but statistics on how many documents
    contain each search term in a given context.

    I'm pretty sure there must be a standard name for this kind
    of thing, anybody? Thanks!
    -- Aaron Watters

    ===
    http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=hackery
    http://nucular.sourceforge.net/
     
    Aaron Watters, Feb 25, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    16
    Views:
    484
    Istvan Albert
    Oct 27, 2007
  2. Delaney, Timothy (Tim)

    RE: Fwd: NUCULAR fielded text searchable indexing

    Delaney, Timothy (Tim), Oct 11, 2007, in forum: Python
    Replies:
    7
    Views:
    356
    Terry Reedy
    Oct 16, 2007
  3. Replies:
    0
    Views:
    273
  4. Aaron Watters

    ANN: Nucular full text indexing 0.4

    Aaron Watters, Feb 6, 2009, in forum: Python
    Replies:
    0
    Views:
    213
    Aaron Watters
    Feb 6, 2009
  5. Aaron Watters
    Replies:
    0
    Views:
    274
    Aaron Watters
    May 20, 2009
Loading...

Share This Page