look up very large table

Discussion in 'Perl Misc' started by ela, Feb 10, 2010.

  1. ela

    ela Guest

    I have some large data in pieces, e.g.

    asia.gz.tar 300M

    or

    roads1.gz.tar 100M
    roads2.gz.tar 100M
    roads3.gz.tar 100M
    roads4.gz.tar 100M

    I wonder whether I should concatenate them all into a single ultra large
    file and then perform parsing them into a large table (I don't know whether
    perl can handle that...).

    The final table should look like this:

    ID1 ID2 INFO
    X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
    X2.3 H9 Beijing; China; Asia
    .....

    each row may come from a big file of >100M (as aforementioned):

    CITY Beijing
    NOTE Capital
    RACE Chinese
    ....

    And then I have another much smaller table which contains all the ID's
    (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    this 20M file annotated with the INFO. Hashing seems not to be a solution
    for my 32G, 8-core machine...

    Any advice? or should i resort to some other languages?
     
    ela, Feb 10, 2010
    #1
    1. Advertising

  2. ela

    Jim Gibson Guest

    In article <hku3e0$3fs$>, ela
    <> wrote:

    > I have some large data in pieces, e.g.
    >
    > asia.gz.tar 300M
    >
    > or
    >
    > roads1.gz.tar 100M
    > roads2.gz.tar 100M
    > roads3.gz.tar 100M
    > roads4.gz.tar 100M
    >
    > I wonder whether I should concatenate them all into a single ultra large
    > file and then perform parsing them into a large table (I don't know whether
    > perl can handle that...).


    There is no benefit that I can see to concatenating the files. Use the
    File::Find module to find all files with a certain naming convention,
    read each one, and process the information in each file. As far as the
    amount of information that Perl can handles, that is mostly determined
    by the available memory and how smart you are at condensing the data,
    keeping only what you need and throwing away stuff you don't need.

    >
    > The final table should look like this:
    >
    > ID1 ID2 INFO
    > X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
    > X2.3 H9 Beijing; China; Asia


    Perl does not have tables. It has arrays and hashes. You can nest
    arrays and hashes to store complex datasets in memory by using
    references.

    > ....
    >
    > each row may come from a big file of >100M (as aforementioned):
    >
    > CITY Beijing
    > NOTE Capital
    > RACE Chinese
    > ...
    >
    > And then I have another much smaller table which contains all the ID's
    > (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    > this 20M file annotated with the INFO. Hashing seems not to be a solution
    > for my 32G, 8-core machine...
    >
    > Any advice? or should i resort to some other languages?


    Try reading all the files and saving the data you want. If you run out
    of memory, then think about a different approach. 32GB of memory is
    quite a lot.

    If you can't fit all of your data into memory at one time, you might
    consider using a database that will store your data in files. Perl has
    support for many databases. But I would first determine whether or not
    you can fit everything in memory.

    --
    Jim Gibson
     
    Jim Gibson, Feb 10, 2010
    #2
    1. Advertising

  3. ela

    ccc31807 Guest

    On Feb 10, 5:57 am, "ela" <> wrote:
    > Any advice? or should i resort to some other languages?


    Perl is probably your best bet for this task.

    > I have some large data in pieces, e.g.
    > asia.gz.tar 300M
    > or
    > roads1.gz.tar 100M


    It might be helpful for you to give a sample of your data format. You
    don't mention untarring and unzipping your file, so I assume that you
    are dealing with ASCII text. If not, then some of the following might
    not work well.

    > I wonder whether I should concatenate them all into a single ultra large
    > file and then perform parsing them into a large table (I don't know whether
    > perl can handle that...).


    Irrelevant question. Ordinarily you process files one line at a time,
    so it doesn't make any difference how large a particular file is, as
    long as each line can be manipulated. In cases where I have to deal
    with a number of files, I find it easier to glob the files, or open
    and read a directory, to automate the process of opening, reading, and
    closing a number of files. You might gain something in particular
    cases by combining files, but I don't see any general advantage in
    doing so.

    > each row may come from a big file of >100M (as aforementioned):
    > CITY    Beijing
    > NOTE    Capital
    > RACE    Chinese
    > ...


    Typical data munging. Depending on whether you have duplicates, I
    would probably build a hash and write the hash to your output file.
    You then have be ability to sort on different fields, e.g., cities,
    notes, races, etc.

    > And then I have another much smaller table which contains all the ID's
    > (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    > this 20M file annotated with the INFO. Hashing seems not to be a solution
    > for my 32G, 8-core machine...


    Hashing is ideal, provided you can link the two files by a common
    record. The general technique is to open the ID file first, build a
    hash of record IDs, then open your data file and populate the hash
    records with data according to the common record. Then, open your
    output file and print to it.

    If you will use the data frequently, you might want to stuff the data
    into a database so you can query it conveniently.

    If you want help, please be sure to furnish both sample data from each
    file and your attempts at writing the script.

    CC.
     
    ccc31807, Feb 10, 2010
    #3
  4. ela

    John Bokma Guest

    "ela" <> writes:

    > I have some large data in pieces, e.g.
    >
    > asia.gz.tar 300M
    >
    > or
    >
    > roads1.gz.tar 100M
    > roads2.gz.tar 100M
    > roads3.gz.tar 100M
    > roads4.gz.tar 100M
    >
    > I wonder whether I should concatenate them all into a single ultra large
    > file and then perform parsing them into a large table (I don't know whether
    > perl can handle that...).
    >
    > The final table should look like this:
    >
    > ID1 ID2 INFO
    > X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
    > X2.3 H9 Beijing; China; Asia
    > ....
    >
    > each row may come from a big file of >100M (as aforementioned):
    >
    > CITY Beijing
    > NOTE Capital
    > RACE Chinese
    > ...
    >
    > And then I have another much smaller table which contains all the ID's
    > (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    > this 20M file annotated with the INFO. Hashing seems not to be a solution
    > for my 32G, 8-core machine...
    >
    > Any advice? or should i resort to some other languages?


    How about importing all your data into a database, and using SQL to
    extract what you want? Depending on the format of your input files some
    parsing might be required which can be done with a small Perl
    program.

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Feb 10, 2010
    #4
  5. "ela" <> wrote:
    >I have some large data in pieces, e.g.
    >
    >asia.gz.tar 300M
    >
    >or
    >
    >roads1.gz.tar 100M
    >roads2.gz.tar 100M
    >roads3.gz.tar 100M
    >roads4.gz.tar 100M
    >
    >I wonder whether I should concatenate them all into a single ultra large
    >file


    I may be mistaken but isn't that a prerequisite to actually extract any
    data from compressed (.gz) file?

    >and then perform parsing them into a large table (I don't know whether
    >perl can handle that...).


    The hardware is the limit.

    >The final table should look like this:
    >
    >ID1 ID2 INFO
    >X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
    >X2.3 H9 Beijing; China; Asia
    >....
    >
    >each row may come from a big file of >100M (as aforementioned):
    >
    >CITY Beijing
    >NOTE Capital
    >RACE Chinese
    >...
    >
    >And then I have another much smaller table which contains all the ID's
    >(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    >this 20M file annotated with the INFO. Hashing seems not to be a solution
    >for my 32G, 8-core machine...


    Depends. It's easy enough to do so you can just try if it works.

    >Any advice? or should i resort to some other languages?


    If at all you are hardware limited, Eventually the system will begin
    swapping. And that will happen in any language if you try to keep too
    much data in RAM.
    If that happens you will have to revert to time-proven techniques from
    the dark ages: trade HD space and time for RAM by keeping only one set
    of data in RAM and annotate that set while processing the second set of
    data from the HD line by line.

    However the real solution would be to load the whole enchilada into a
    database and then do whatever join you want to do. There is a reason why
    database system have been created and optimized for exactly such tasks.

    jue
     
    Jürgen Exner, Feb 10, 2010
    #5
  6. ela

    Guest

    On Wed, 10 Feb 2010 18:57:02 +0800, "ela" <> wrote:

    >I have some large data in pieces, e.g.
    >
    >asia.gz.tar 300M
    >
    >or
    >
    >roads1.gz.tar 100M
    >roads2.gz.tar 100M
    >roads3.gz.tar 100M
    >roads4.gz.tar 100M
    >
    >I wonder whether I should concatenate them all into a single ultra large
    >file and then perform parsing them into a large table (I don't know whether
    >perl can handle that...).
    >
    >The final table should look like this:
    >

    [snip examples that doesen't tell info]

    >Any advice? or should i resort to some other languages?
    >


    Yes, go back to the database that produced these files
    and run a different querry to get the info you need.

    If your not still with the company who owns this information,
    I suggest you contact them for permission to use this information.

    -sln
     
    , Feb 10, 2010
    #6
  7. ela wrote:
    > I have some large data in pieces, e.g.
    >
    > asia.gz.tar 300M
    >
    > or
    >
    > roads1.gz.tar 100M
    > roads2.gz.tar 100M
    > roads3.gz.tar 100M
    > roads4.gz.tar 100M


    The data is first gzipped and then tarred? That is an odd way of doing
    things.

    > I wonder whether I should concatenate them all into a single ultra large
    > file


    I see no reason to do that. Especially as I don't think tar format
    supports that cleanly, does it?

    > and then perform parsing them into a large table (I don't know whether
    > perl can handle that...).


    I bet it can.

    >
    > The final table should look like this:
    >
    > ID1 ID2 INFO
    > X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
    > X2.3 H9 Beijing; China; Asia
    > .....
    >
    > each row may come from a big file of >100M (as aforementioned):
    >
    > CITY Beijing
    > NOTE Capital
    > RACE Chinese
    > ....


    What is the ",,,," hiding? 100M is an awful lot of "...."

    Each file is turned into only one row? And each file is 100M? So how
    many rows do you anticipate having?


    > And then I have another much smaller table which contains all the ID's
    > (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
    > this 20M file annotated with the INFO. Hashing seems not to be a solution
    > for my 32G, 8-core machine...


    Why not?

    > Any advice? or should i resort to some other languages?


    Your description is too vague to give any reasonable advice.


    Xho
     
    Xho Jingleheimerschmidt, Feb 11, 2010
    #7
  8. Jürgen Exner wrote:
    >
    > However the real solution would be to load the whole enchilada into a
    > database and then do whatever join you want to do. There is a reason why
    > database system have been created and optimized for exactly such tasks.


    Database systems are generally created for atomicity, concurrency,
    isolation, and durability, which is quite a bit more than this task
    seems to consist of. It is my general experience that in this type of
    task, a Perl script could be written and have completed its job while
    the database system is still tying its shoes.


    Xho
     
    Xho Jingleheimerschmidt, Feb 11, 2010
    #8
  9. Xho Jingleheimerschmidt <> wrote:
    >Jürgen Exner wrote:
    >>
    >> However the real solution would be to load the whole enchilada into a
    >> database and then do whatever join you want to do. There is a reason why
    >> database system have been created and optimized for exactly such tasks.

    >
    >Database systems are generally created for atomicity, concurrency,
    >isolation, and durability, which is quite a bit more than this task
    >seems to consist of. It is my general experience that in this type of
    >task, a Perl script could be written and have completed its job while
    >the database system is still tying its shoes.


    Certainly true. But they are also designed to handle vast amounts of
    data efficiently. And if the OP indeed runs into space issues then a DB
    system may (just may!) provide an easier to use and even faster
    alternative to looping through files over and over again.

    If it actually is an advantage or not is hard to tell. I agree, the OPs
    task seems to be easy enough to be solved in a single pass. But the
    description was rather cryptic, too, so there might be more
    cross-references going on than either of us is expecting at this time.

    jue
     
    Jürgen Exner, Feb 11, 2010
    #9
  10. ela

    ccc31807 Guest

    On Feb 11, 1:32 pm, (Jamie) wrote:
    > I'd only add that often databases aren't a very good choice for data
    > that doesn't change. Databases are way over used, IMO.
    >
    > Often, an index structure built out of flat files is superior (doesn't have all
    > that extra baggage of servers, code to carefully insert/update and such)
    >
    > Sometimes, you just can't beat a set of well formed text files on
    > top of a carefully planned directory. :)


    Amen, amen, and amen. In my line of work, I see Access almost
    exclusively used as a productivity tool (e.g., to produce reports) and
    never (amost) as a database. Personally, I have created hundreds if
    not a thousand or two of Access databases, and I can't recall using a
    Primary Key in Access.

    The same thing can also be said for Excel.

    I totally agree that many times using some kind of delimited file is
    much easier, simpler, and faster than using a database. For one time
    processing, building a data structure in memory is also much eaiser,
    simpler, and faster.

    CC.
     
    ccc31807, Feb 11, 2010
    #10
  11. On 2010-02-11 18:32, Jamie <> wrote:
    > In <>,
    > ccc31807 <> mentions:
    >>If you will use the data frequently, you might want to stuff the data
    >>into a database so you can query it conveniently.

    >
    > I'd only add that often databases aren't a very good choice for data
    > that doesn't change. Databases are way over used, IMO.
    >
    > Often, an index structure built out of flat files is superior (doesn't have all
    > that extra baggage of servers, code to carefully insert/update and such)
    >
    > Sometimes, you just can't beat a set of well formed text files on
    > top of a carefully planned directory. :)


    You are confusing databases with RDBMSs.

    A database is a collection of structured data.

    A relational database is a database where the data is organized into
    "relations", i.e., tables with certain properties (columns are ordered
    and have type and usually a name, order of rows doesn't matter, etc.)

    A relational database management system is a program for managing
    relational databases - it lets you insert, update and query the data,
    enforces consistency, maintains indexes, implements transactions, etc.

    hp
     
    Peter J. Holzer, Feb 12, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raymond Arthur St. Marie II of III

    very Very VERY dumb Question About The new Set( ) 's

    Raymond Arthur St. Marie II of III, Jul 23, 2003, in forum: Python
    Replies:
    4
    Views:
    525
    Raymond Hettinger
    Jul 27, 2003
  2. shanx__=|;-

    very very very long integer

    shanx__=|;-, Oct 16, 2004, in forum: C Programming
    Replies:
    19
    Views:
    1,730
    Merrill & Michele
    Oct 19, 2004
  3. Abhishek Jha

    very very very long integer

    Abhishek Jha, Oct 16, 2004, in forum: C Programming
    Replies:
    4
    Views:
    475
    jacob navia
    Oct 17, 2004
  4. Peter

    Very very very basic question

    Peter, Feb 8, 2005, in forum: C Programming
    Replies:
    14
    Views:
    549
    Dave Thompson
    Feb 14, 2005
  5. olivier.melcher

    Help running a very very very simple code

    olivier.melcher, May 12, 2008, in forum: Java
    Replies:
    8
    Views:
    2,384
Loading...

Share This Page