Huge files manipulation

Discussion in 'Perl Misc' started by klashxx, Nov 10, 2008.

  1. klashxx

    klashxx Guest

    Hi , i need a fast way to delete duplicates entrys from very huge
    files ( >2 Gbs ) , these files are in plain text.

    ...To clarify, this is the structure of the file:

    30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    01|F|0207|00|||+0005655,00|||+0000000000000,00
    30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    1804|
    00|||+0000000000000,00|||+0000000000000,00

    Having a key formed by the first 7 fields i want to print or delete
    only the duplicates( the delimiter is the pipe..).

    I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    it always ended with the same result (out of memory!)

    In using HP-UX large servers.

    I 'm very new to perl, but i read somewhere tha Tie::File module can
    handle very large files , i tried but cannot get the right code...

    Any advice will be very well come.

    Thank you in advance.

    Regards

    PD:I do not want to split the files.
     
    klashxx, Nov 10, 2008
    #1
    1. Advertising

  2. klashxx wrote:
    > Hi , i need a fast way to delete duplicates entrys from very huge
    > files ( >2 Gbs ) , these files are in plain text.
    >
    > ..To clarify, this is the structure of the file:
    >
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0005655,00|||+0000000000000,00
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    > 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    > 1804|
    > 00|||+0000000000000,00|||+0000000000000,00
    >
    > Having a key formed by the first 7 fields i want to print or delete
    > only the duplicates( the delimiter is the pipe..).
    >
    > I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    > it always ended with the same result (out of memory!)
    >
    > In using HP-UX large servers.
    >
    > I 'm very new to perl, but i read somewhere tha Tie::File module can
    > handle very large files , i tried but cannot get the right code...
    >
    > Any advice will be very well come.
    >
    > Thank you in advance.
    >
    > Regards
    >
    > PD:I do not want to split the files.


    When you try the following do you run out of memory?

    perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \
    and print unless $seen{$1}++' \
    hugefilename

    You might trade CPU for RAM by making a hash of the key. (in the
    cryptographic digest sense not the perl associative array sense).

    Tie:File works with files larger than memory but I'm not sure how you
    would use it for your problem. It's storing the index of seen keys that
    is the problem.

    I'd maybe tie my %seen to a dbm file. See `perldoc -f tie`

    --
    RGB
     
    RedGrittyBrick, Nov 10, 2008
    #2
    1. Advertising

  3. RedGrittyBrick <> wrote:

    > perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \



    ITYM:

    perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad J McClellan, Nov 10, 2008
    #3
  4. Tad J McClellan wrote:
    > RedGrittyBrick <> wrote:
    >
    >> perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \

    >
    >
    > ITYM:
    >
    > perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \
    >
    >


    Doh!

    --
    RGB
     
    RedGrittyBrick, Nov 10, 2008
    #4
  5. klashxx

    Guest

    klashxx <> wrote:
    > Hi , i need a fast way to delete duplicates entrys from very huge
    > files ( >2 Gbs ) , these files are in plain text.
    >
    > ..To clarify, this is the structure of the file:
    >
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0005655,00|||+0000000000000,00
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    > 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    > 1804|
    > 00|||+0000000000000,00|||+0000000000000,00
    >
    > Having a key formed by the first 7 fields i want to print or delete
    > only the duplicates( the delimiter is the pipe..).


    Given the line wraps, it is hard to figure out what the structure
    of your file is. Every line has from 7 to infinity fields, with the
    first one being 30xx? When you say "print or delete", which one? Do you
    want to do both in a single pass, or have two different programs, one for
    each use-case?

    >
    > I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    > it always ended with the same result (out of memory!)


    Which of those programs was running out of memory? Can you use sort
    to group lines according to the key without running out of memory?
    That is what I do, first use system sort to group keys, then Perl
    to finish up.

    How many duplicate keys do you expect there to be? If the number of
    duplicates is pretty small, I'd come up with the list of them:

    cut -d\| -f1-7 big_file|sort|uniq -d > dup_keys.

    And then load dup_keys into a Perl hash, then step through big_file
    comparing each line's key to the hash.

    > I 'm very new to perl, but i read somewhere tha Tie::File module can
    > handle very large files ,


    Tie::File has substantial per-line overhead. So unless the lines
    are quite long, Tie::File doesn't increase the size of the file you
    can handle by all that much. Also, it isn't clear how you would use
    it anyway. It doesn't help you keep huge hashes, which is what you need
    to group keys efficiently if you aren't pre-sorting. And while it makes it
    *easy* to delete lines from the middle of large files, it does not make
    it *efficient* to do so.

    > i tried but cannot get the right code...


    We can't very well comment on code we can't see.

    ....
    > PD:I do not want to split the files.


    Why not?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Nov 10, 2008
    #5
  6. klashxx

    Rocco Caputo Guest

    On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx wrote:
    > Hi , i need a fast way to delete duplicates entrys from very huge
    > files ( >2 Gbs ) , these files are in plain text.
    >
    > ..To clarify, this is the structure of the file:
    >
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0005655,00|||+0000000000000,00
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    > 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    > 1804|
    > 00|||+0000000000000,00|||+0000000000000,00
    >
    > Having a key formed by the first 7 fields i want to print or delete
    > only the duplicates( the delimiter is the pipe..).
    >
    > I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    > it always ended with the same result (out of memory!)


    > PD:I do not want to split the files.


    Exactly how are you using awk/sort/uniq/sed/grep? Which part of the
    pipeline is running out of memory?

    Depending on what you're doing, and where you're doing it, you may be
    able to tune sort() to use more memory (much faster) or a faster
    filesystem for temporary files.

    Are the first seven fields always the same width? If so, you needn't
    bother with the pipes.

    Must the order of the files be preserved?

    --
    Rocco Caputo - http://poe.perl.org/
     
    Rocco Caputo, Nov 10, 2008
    #6
  7. klashxx <> wrote:
    >Hi , i need a fast way to delete duplicates entrys from very huge
    >files ( >2 Gbs ) , these files are in plain text.
    >
    >Having a key formed by the first 7 fields i want to print or delete
    >only the duplicates( the delimiter is the pipe..).


    Hmmm, what is the ratio of unique lines to total lines? I.e. are there
    many duplicate lines or only few?

    If the number of unique lines is small then a standard approach with
    recording each unique line in a hash may work. Then you can simply check
    if a line with that content already exists() and delete/print the
    duplicate as you encouter it further down the file.

    If the number of unique lines is large then that will no longer be
    possible and you will have to trade speed and simplicity for memory.
    For each line I'd compute a checksum and record that checksum together
    with the exact position of each matching line in the hash.
    Then in a second pass those lines with unique checksums are unique while
    lines with the same checksum (more than one line was recorded for a
    given checksum) are candidates for duplicates and need to be compared
    individually.

    jue
     
    Jürgen Exner, Nov 10, 2008
    #7
  8. klashxx

    Tim Greer Guest

    klashxx wrote:

    > Hi , i need a fast way to delete duplicates entrys from very huge
    > files ( >2 Gbs ) , these files are in plain text.
    >
    > ..To clarify, this is the structure of the file:
    >
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0005655,00|||+0000000000000,00
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    > 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    > 1804|
    > 00|||+0000000000000,00|||+0000000000000,00
    >
    > Having a key formed by the first 7 fields i want to print or delete
    > only the duplicates( the delimiter is the pipe..).
    >
    > I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    > it always ended with the same result (out of memory!)


    What is the code you're using now?

    >
    > PD:I do not want to split the files.


    That could help solve the problem in its current form, potentially.
    Have you considered using a database solution, if nothing more than
    just for this type of task, even if you want to continue storing the
    fields/data in the files?
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
     
    Tim Greer, Nov 10, 2008
    #8
  9. klashxx

    cartercc Guest

    On Nov 10, 5:24 am, klashxx <> wrote:
    > Hi , i need a fast way to delete duplicates entrys from very huge
    > files ( >2 Gbs ) , these files are in plain text.
    >
    > ..To clarify, this is the structure of the file:
    >
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0005655,00|||+0000000000000,00
    > 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    > 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    > 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    > 1804|
    > 00|||+0000000000000,00|||+0000000000000,00
    >
    > Having a key formed by the first 7 fields i want to print or delete
    > only the duplicates( the delimiter is the pipe..).
    >
    > I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    > it always ended with the same result (out of memory!)
    >
    > In using HP-UX large servers.
    >
    > I 'm very new to perl, but i read somewhere tha Tie::File module can
    > handle very large files , i tried but cannot get the right code...
    >
    > Any advice will be very well come.
    >
    > Thank you in advance.
    >
    > Regards
    >
    > PD:I do not want to split the files.


    1. Create a database with a data table, in two columns, ID_PK and
    DATA.
    2. Read the file line by line and insert each row into the database,
    using the first seven fields as the key. This will insure that you
    have no duplicates in the database. As each PK must be unique, your
    insert statement will fail for duplicates.
    3. Do a select statement from the database and print out all the
    records returned.

    Splitting the file is irrelevant as virtually 100% of the time you
    have to split all files to take a gander at the data.

    I still don't understand why you can't use a simple hash rather than a
    DB, but my not understanding that point is irrelevant as well.

    CC
     
    cartercc, Nov 10, 2008
    #9
  10. On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx <>
    wrote:

    >Hi , i need a fast way to delete duplicates entrys from very huge
    >files ( >2 Gbs ) , these files are in plain text.

    [cut]
    >Any advice will be very well come.
    >
    >Thank you in advance.


    Wouldn't have been nice of you to mention that you asked the very same
    question elsewhere? <http://perlmonks.org/?node_id=722634> Did they
    help you there? How did they fail to do so?


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
     
    Michele Dondi, Nov 11, 2008
    #10
  11. klashxx

    Guest

    On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx <> wrote:

    >Hi , i need a fast way to delete duplicates entrys from very huge
    >files ( >2 Gbs ) , these files are in plain text.
    >
    >..To clarify, this is the structure of the file:
    >
    >30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    >01|F|0207|00|||+0005655,00|||+0000000000000,00
    >30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
    >01|F|0207|00|||+0000000000000,00|||+0000000000000,00
    >30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
    >1804|
    >00|||+0000000000000,00|||+0000000000000,00
    >
    >Having a key formed by the first 7 fields i want to print or delete
    >only the duplicates( the delimiter is the pipe..).
    >
    >I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
    >it always ended with the same result (out of memory!)
    >
    >In using HP-UX large servers.
    >
    >I 'm very new to perl, but i read somewhere tha Tie::File module can
    >handle very large files , i tried but cannot get the right code...
    >
    >Any advice will be very well come.
    >
    >Thank you in advance.
    >
    >Regards
    >
    >PD:I do not want to split the files.


    I can do this for you with custom algorithym's.
    Each file is sent to me and each has an independent
    fee, based on processing time. Or I can license my
    technologoy to you, flat-fee, per usage based.

    Let me know if your interrested, post a contact email address.


    sln
     
    , Nov 16, 2008
    #11
  12. <> wrote:

    > I can do this for you with custom algorithym's.

    ^^
    ^^

    Your algorithym (sic) possesses something?


    > Let me know if your interrested, post a contact email address.

    ^^^^
    ^^^^

    Put in apostrophe's where they are not needed, leave them out
    where theyre needed. Interresting.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad J McClellan, Nov 16, 2008
    #12
  13. klashxx

    David Combs Guest

    In article <>,
    Tad J McClellan <> wrote:
    > <> wrote:
    >
    >> I can do this for you with custom algorithym's.

    > ^^
    > ^^
    >
    >Your algorithym (sic) possesses something?
    >
    >
    >> Let me know if your interrested, post a contact email address.

    > ^^^^
    > ^^^^
    >
    >Put in apostrophe's where they are not needed, leave them out


    What's with the schools these days?

    On the net, at least, I hardly ever see "you're" any more -- it's
    always "your".



    (I bet the Chinese and Russians don't make that mistake! :-( )


    David
     
    David Combs, Dec 1, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mario Rodriguez

    uploading huge files

    Mario Rodriguez, Apr 20, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    314
    =?Utf-8?B?Q0FSZWVk?=
    Apr 20, 2004
  2. JMG

    Pb when downloading huge files

    JMG, Apr 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    334
  3. Jeff

    upload/download huge files

    Jeff, Mar 10, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    491
    John Timney \( MVP \)
    Mar 16, 2006
  4. Christian Hiller
    Replies:
    7
    Views:
    1,385
    Andrew
    Oct 9, 2003
  5. Replies:
    3
    Views:
    504
Loading...

Share This Page