Is there any library for indexing binary data?

Discussion in 'Python' started by Ìð¹Ï, Mar 25, 2010.

  1. Ìð¹Ï

    Ìð¹Ï Guest

    Howdy,

    Recently, I am finding a good library for build index on binary data.
    Xapian & Lucene for python binding focus on text digestion rather than
    binary data. Could anyone give me some recommendation? Is there any
    library for indexing binary data no matter whether it is written in
    python?

    In my case, there is a very big datatable which stores structured
    binary data, eg:
    struct Item
    {
    long id; // used as key
    double value;
    };

    I want to build the index on "id" field to speed on searching. Since
    this datatable is not constant, the library should support incremental
    indexing. If there is no suitable library, I have to do the index by
    myself...

    Thank you in advance.

    --
    ShenLei
     
    Ìð¹Ï, Mar 25, 2010
    #1
    1. Advertising

  2. On 3/25/10 4:28 AM, 甜瓜 wrote:
    > Howdy,
    >
    > Recently, I am finding a good library for build index on binary data.
    > Xapian& Lucene for python binding focus on text digestion rather than
    > binary data. Could anyone give me some recommendation? Is there any
    > library for indexing binary data no matter whether it is written in
    > python?
    >
    > In my case, there is a very big datatable which stores structured
    > binary data, eg:
    > struct Item
    > {
    > long id; // used as key
    > double value;
    > };
    >
    > I want to build the index on "id" field to speed on searching. Since
    > this datatable is not constant, the library should support incremental
    > indexing. If there is no suitable library, I have to do the index by
    > myself...
    >
    > Thank you in advance.
    >
    > --
    > ShenLei


    Put it into an Sqlite database? Or something else from
    http://docs.python.org/library/persistence.html.
    Or maybe http://www.pytables.org/ is more suitable to your needs (never
    used that one myself though).
    Or install a bank or 2 of memory in your box and read everything into
    memory in one big hashtable.

    Btw if you already have a big datatable in which the data is stored, I'm
    guessing that already is in some form of database format. Can't you
    write something that understands that database format.

    But I think you need to provide some more details about your data set.

    -irmen
     
    Irmen de Jong, Mar 25, 2010
    #2
    1. Advertising

  3. Ìð¹Ï

    Ìð¹Ï Guest

    Thank you irmen. I will take a look at pytable.
    FYI, let me explain the case clearly.

    Originally, my big data table is simply array of Item:
    struct Item
    {
    long id; // used as key
    BYTE payload[LEN]; // corresponding value with fixed length
    };

    All items are stored in one file by using "stdio.h" function:
    fwrite(itemarray, sizeof(Item), num_of_items, fp);

    Note that "id" is randomly unique without any order. To speed up
    searching I regrouped / sorted them into two-level hash tables (in
    the form of files). I want to employ certain library to help me index
    this table.

    Since the table contains about 10^9 items and LEN is about 2KB, it is
    impossible to hold all data in memory. Furthermore, some new item may
    be inserted into the array. Therefore incremental indexing feature is
    needed.

    Hope this help you to understand my case.

    --
    ShenLei


    2010/3/25 Irmen de Jong <>:
    > On 3/25/10 4:28 AM, Ìð¹Ï wrote:
    >>
    >> Howdy,
    >>
    >> Recently, I am finding a good library for build index on binary data.
    >> Xapian& Lucene for python binding focus on text digestion rather than
    >> binary data. Could anyone give me some recommendation? Is there any
    >> library for indexing binary data no matter whether it is written in
    >> python?
    >>
    >> In my case, there is a very big datatable which stores structured
    >> binary data, eg:
    >> struct Item
    >> {
    >> long id; // used as key
    >> double value;
    >> };
    >>
    >> I want to build the index on "id" field to speed on searching. Since
    >> this datatable is not constant, the library should support incremental
    >> indexing. If there is no suitable library, I have to do the index by
    >> myself...
    >>
    >> Thank you in advance.
    >>
    >> --
    >> ShenLei

    >
    > Put it into an Sqlite database? Or something else from
    > http://docs.python.org/library/persistence.html.
    > Or maybe http://www.pytables.org/ is more suitable to your needs (never used
    > that one myself though).
    > Or install a bank or 2 of memory in your box and read everything into memory
    > in one big hashtable.
    >
    > Btw if you already have a big datatable in which the data is stored, I'm
    > guessing that already is in some form of database format. Can't you write
    > something that understands that database format.
    >
    > But I think you need to provide some more details about your data set.
    >
    > -irmen
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Ìð¹Ï, Mar 25, 2010
    #3
  4. On 25-3-2010 10:55, Ìð¹Ï wrote:
    > Thank you irmen. I will take a look at pytable.
    > FYI, let me explain the case clearly.
    >
    > Originally, my big data table is simply array of Item:
    > struct Item
    > {
    > long id; // used as key
    > BYTE payload[LEN]; // corresponding value with fixed length
    > };
    >
    > All items are stored in one file by using "stdio.h" function:
    > fwrite(itemarray, sizeof(Item), num_of_items, fp);
    >
    > Note that "id" is randomly unique without any order. To speed up
    > searching I regrouped / sorted them into two-level hash tables (in
    > the form of files). I want to employ certain library to help me index
    > this table.
    >
    > Since the table contains about 10^9 items and LEN is about 2KB, it is
    > impossible to hold all data in memory. Furthermore, some new item may
    > be inserted into the array. Therefore incremental indexing feature is
    > needed.


    I see, I thought the payload data was small as well. What about this idea:
    Build a hash table where the keys are the id from your Item structs and
    the value is the file seek offset of the Item 'record' in your original
    datafile. (although that might generate values of type long, which take
    more memory than int, so maybe we should use file_offset/sizeof(Item).
    This way you can just keep your original data file (you only have to
    scan it to build the hash table) and you will avoid a lengthy conversion
    process.

    If this hashtable still doesn't fit in memory use a sparse array
    implementation of some sort that is more efficient at storing simple
    integers, or just put it into a database solution mentioned in earlier
    responses.

    Another thing: I think that your requirement of 1e7 lookups per second
    is a bit steep for any solution where the dataset is not in core memory
    at once though.

    Irmen.
     
    Irmen de Jong, Mar 25, 2010
    #4
  5. Ìð¹Ï

    Ìð¹Ï Guest

    Many thanks for your kind reply. As you mentioned, a sparse array may
    be the best choice.
    Storing offset rather than payload itself can greatly save memory space.

    1e7 queries per second is my ideal aim. But 1e6 must be achieved.
    Currently I have implemented 5e6 on one PC (without incremental
    indexing and all incoming queries coming from local data stream).
    Since the table is very big and responding is time critical, the
    finally system will be definitely distributed computing. I hope that
    Judy algorithm can simplify indexing, so I can focus on implementing
    data persistence and distributed computing affairs.

    --
    ShenLei

    ÔÚ 2010Äê3ÔÂ26ÈÕ ÉÏÎç2:55£¬Irmen de Jong <> дµÀ£º
    > On 25-3-2010 10:55, Ìð¹Ï wrote:
    >> Thank you irmen. I will take a look at pytable.
    >> FYI, let me explain the case clearly.
    >>
    >> Originally, my big data table is simply array of Item:
    >> struct Item
    >> {
    >> long id; // used as key
    >> BYTE payload[LEN]; // corresponding value with fixed length
    >> };
    >>
    >> All items are stored in one file by using "stdio.h" function:
    >> fwrite(itemarray, sizeof(Item), num_of_items, fp);
    >>
    >> Note that "id" is randomly unique without any order. To speed up
    >> searching I regrouped / sorted them into two-level hash tables (in
    >> the form of files). I want to employ certain library to help me index
    >> this table.
    >>
    >> Since the table contains about 10^9 items and LEN is about 2KB, it is
    >> impossible to hold all data in memory. Furthermore, some new item may
    >> be inserted into the array. Therefore incremental indexing feature is
    >> needed.

    >
    > I see, I thought the payload data was small as well. What about this idea:
    > Build a hash table where the keys are the id from your Item structs and
    > the value is the file seek offset of the Item 'record' in your original
    > datafile. (although that might generate values of type long, which take
    > more memory than int, so maybe we should use file_offset/sizeof(Item).
    > This way you can just keep your original data file (you only have to
    > scan it to build the hash table) and you will avoid a lengthy conversion
    > process.
    >
    > If this hashtable still doesn't fit in memory use a sparse array
    > implementation of some sort that is more efficient at storing simple
    > integers, or just put it into a database solution mentioned in earlier
    > responses.
    >
    > Another thing: I think that your requirement of 1e7 lookups per second
    > is a bit steep for any solution where the dataset is not in core memory
    > at once though.
    >
    > Irmen.
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Ìð¹Ï, Mar 26, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C
    Replies:
    0
    Views:
    527
  2. Emin
    Replies:
    4
    Views:
    434
    Paul McGuire
    Jan 12, 2007
  3. Ìð¹Ï
    Replies:
    4
    Views:
    328
    Albert van der Horst
    Apr 4, 2010
  4. Skybuck Flying
    Replies:
    30
    Views:
    1,145
    Bill Reid
    Sep 19, 2011
  5. C
    Replies:
    3
    Views:
    247
    Manohar Kamath [MVP]
    Oct 17, 2003
Loading...

Share This Page