ANN: NUCULAR B3 Full text indexing (now on Win32 too)

A

Aaron Watters

ANNOUNCE: NUCULAR Fielded Full Text Indexing, BETA 3

Nucular is a system for creating full text
indices for fielded data. It can be accessed
via a Python API or via a suite of command line
interfaces.

NEWS

Nucular now supports WIN32. Current releases
of Nucular abstract the file system in order
to emulate file system feature missing
on NT file systems which prevented older
versions from running correctly on Windows NT
based systems.

Proximity search added: Current versions of
Nucular allow queries to search for a sequence
of words near eachother separated by
no more than a specified number of other words.

Faceted suggestions: Nucular queries now
support faceted suggestions for values for
fields which are related to a query.

Faster index builds: Current releases of
Nucular have completely revamped internal
data structures which build indices much faster
(and query a bit faster also). For example some
builds run more than 8 times faster than
previously.

Read more and download at:

http://nucular.sourceforge.net

ONLINE DEMOS:

Python source search:
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=malodorous+parrot
Mondial geographic text search:
http://www.xfeedme.com/nucular/mondial.py/go?attr_name=ono
Gutenberg book search:
http://www.xfeedme.com/nucular/gut.py/go?attr_Comments=immoral+english

BACKGROUND:

Nucular is intended to help store and retrieve
searchable information in a manner somewhat similar
to the way that "www.hotjobs.com" stores and
retrieves job descriptions, for example.

Nucular archives fielded documents and retrieves
them based on field value, field prefix, field
word prefix, or full text word prefix,
word proximity or combinations of these.
Nucular also includes features for determining
values related to a query often called query facets.

FEATURES

Nucular is very light weight. Updates and accesses
do not require any server process or other system
support such as shared memory locking.

Nucular supports concurrency. Arbitrary concurrent
updates and accesses by multiple processes or threads
are supported, with no possible locking issues.

Nucular supports document threading in the
manner of USENET replies. Built in semantics allows
"follow ups" to messages to match patterns that
match the "original" messages.

Nucular indexes and retrieves data quickly.

I hope you like.
-- Aaron Watters

===
It's humbling to think that when Mozart was my age
he'd been dead for 5 years. -- Tom Lehrer
 
P

Paul Rubin

Aaron Watters said:
ANNOUNCE: NUCULAR Fielded Full Text Indexing, BETA 3

Oh cool, I wondered if anything was going on with this. I'm still
using Solr/Lucene while Nucular matures, which sounds to be moving
along nicely.

A while back we chatted about flash drives. I did a little bit of
testing with a consumer CF card on an IDE adapter, and with a Corsair
Voyager GT usb pen drive on a USB port, and got "seek" times of about
0.7 to 1 msec, compared with about 7 msec for a hard drive. I haven't
yet tried a serious SSD or a high speed (SLC) CF card.
 
J

Jarek Zgoda

Paul Rubin napisa³(a):
Oh cool, I wondered if anything was going on with this. I'm still
using Solr/Lucene while Nucular matures, which sounds to be moving
along nicely.

Did you (or anyone else) compare Nucular with Solr and Sphinx
feature-by-feature?
 
P

Paul Rubin

Jarek Zgoda said:
Did you (or anyone else) compare Nucular with Solr and Sphinx
feature-by-feature?

Nucular when I looked at it was in an early alpha release and looked
interesting and promising, but was nowhere near as built-out as Solr.
It may be closer now; I haven't yet had a chance to look at the new
release.

I don't know what Sphinx is.
 
J

Jarek Zgoda

Paul Rubin napisa³(a):
Nucular when I looked at it was in an early alpha release and looked
interesting and promising, but was nowhere near as built-out as Solr.
It may be closer now; I haven't yet had a chance to look at the new
release.

I don't know what Sphinx is.

http://www.sphinxsearch.com/
 
P

Paul Rubin

Jarek Zgoda said:

Thanks, looks interesting, maybe not so good for what I'm doing, but
worth looking into. There is also Xapian which again I haven't looked
at much, but which has fancier PIR (probabilistic information
retrieval) capabilities than Lucene or the version of Nucular that I
looked at.

The main thing killing most of the search apps that I'm involved with
is disk latency. If Aaron is listening, I might suggest offering a
config option to redundantly recording the stored search fields with
every search term in the index. That will bloat the indexes by a
nontrivial constant factor (maybe 5x-10x) but even terabyte disks are
dirt cheap these days, so you still index a lot of data, and present
large result sets without having to do a disk seek for every result in
the set. I've been meaning to crunch some numbers to see if this
actually makes sense.

Unfortunately, the concept of the large add-on memory card seems to
have vanished. It would be very useful to have a cheap x86 box with a
buttload of ram (say 64gb), using commodity desktop memory and extra
modules for ECC. It would be ok if it went over some slow interface
so that it was 10x slower than regular ram. That's still 100x faster
than a flash disk and 1000x faster than a hard disk.
 
A

Aaron Watters

The main thing killing most of the search apps that I'm involved with
is disk latency. If Aaron is listening, I might suggest offering a
config option to redundantly recording the stored search fields with
every search term in the index.

I'm not sure what you mean, but if I understand I think Nucular
already does this. The signatures of the primary indices are

Description: DocumentId x AttributeIndex x FullValue
"Given a document Id find attributes and their values"
AttributeIndex: AttributeIndex x TruncatedValue x DocumentId
"Given an attribute and a value (prefix) find document Id's"
AttributeWord: AttributeIndex x Word x DocumentId
"Given an attribute and a word find documents containing
that word in that attribute"
WordIndex: Word x DocumentId
"Given a word find documents containing that word anywhere"

There are a lot of other possibilities which could be added fairly
easily (and I'd like to work out an abstraction layer to make it
even easier -- so you don't need to directly modify the
library code).

For instance you might want to make proximity searching faster by
indexing words in a document with their locations. Currently
proximity searches that must filter thousands of documents containing
all the relevant words are noticably slower than other queries.

It's a hard problem: every additional index and index column
makes some queries faster, but it may make other queries slower
sometimes and it always makes index builds and index files
more expensive.

It has also occurred to me that the underlying
index implementations and related data structures may be of
interest to Python programmers for all sorts of other purposes
too.

As far as how Nucular compares to Sphinx or anything else:
I don't know and I'm not the right person to evaluate that.
I'd encourage people to try out Nucular and see if it is
easy enough to use and fast enough
and feature rich enough for the intended use. If
it isn't maybe you should find something else. Suggestions
and criticism are always welcome.

-- Aaron Watters

===
"Visit New Jersey: It's not as bad as you think!"
-- suggested New Jersey tourism slogan

http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=frighten+away+evil+spirits
 
A

Aaron Watters

[apologies to the list: I would have done this offline,
but I can't figure out Paul's email address.]

1) Paul please forward your email address

3) Since you seem to know about these things: I was thinking
of adding an optional feature to Nucular which would allow
a look-up like "given a word find all attributes that contain
that word anywhere and give a count of the number of times it
is found in that attribute as well as the entry id for an example
instance (arbitrarily chosen). I was thinking about calling
this "inverted faceting", but you probably know a
better/standard name, yes? What is it please? Thanks!
Answers from anyone else welcomed also.

[Nucular: http://nucular.sourceforge.net/ ]

-- Aaron Watters

===
There are 3 kinds of people: those who can count, and those who can't.
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=shit
 
P

Paul Rubin

Aaron Watters said:
[apologies to the list: I would have done this offline,
but I can't figure out Paul's email address.]

1) Paul please forward your email address

Will send it privately. I don't have a public email address any more
(death to spam!!!). My general purpose online contact point is
http://paulrubin.com which currently has an expired certificate that
I'll get around to renewing someday. Meanwhile you have to click
"accept" to connect using the expired cert.
3) Since you seem to know about these things: I was thinking
of adding an optional feature to Nucular which would allow
a look-up like "given a word find all attributes that contain
that word anywhere and give a count of the number of times it
is found in that attribute as well as the entry id for an example
instance (arbitrarily chosen). I was thinking about calling
this "inverted faceting", but you probably know a
better/standard name, yes? What is it please? Thanks!
Answers from anyone else welcomed also.

In Solr this is called the DisMax (disjunction maximum) handler, I
think. I tried it and it doesn't work very well, and ended up using a
script written by a co-worker, that expands such queries to more
complex queries that put user-supplied weights on each field. It is a
somewhat messy problem. Otis Gospodnetic's book "Lucene in Action"
talks about it some, I believe. Manning and Schutz are working on a
new book at http://informationretrieval.org that discusses fancier
methods. I think these are worth looking into, but I haven't had the
bandwidth to spend time on it so far.
 
A

Aaron Watters

In Solr this is called the DisMax (disjunction maximum) handler,

I can't find much documentation on this, but I think this is not
what I was thinking of. In fact I think Nucular already supports
"disjunction maximum".

I was thinking of a situation that would
support interactions like this (quickly and cheaply):

User: I'm thinking of "Denver"
System: I see the value "Denver" in the following contexts:
City: Denver [100000 entries]
(for example in "Colorado Trombone Players Association")
Surname: Denver [100000 entries]
(for example "Denver, John, songwriter")
Title: Denver [1000 entries]
(for example in "Stuck in Denver Again, by Albert Smiley")
... and also some other contexts
Which do you mean?
User: I'm actually looking for the surname...

In other words you don't get "documents" containing
the search term(s) but statistics on how many documents
contain each search term in a given context.

I'm pretty sure there must be a standard name for this kind
of thing, anybody? Thanks!
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=hackery
http://nucular.sourceforge.net/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top