Huge files manipulation

klashxx · Nov 10, 2008

Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

...To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

RedGrittyBrick · Nov 10, 2008

klashxx said:
Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

When you try the following do you run out of memory?

perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \
and print unless $seen{$1}++' \
hugefilename

You might trade CPU for RAM by making a hash of the key. (in the
cryptographic digest sense not the perl associative array sense).

Tie:File works with files larger than memory but I'm not sure how you
would use it for your problem. It's storing the index of seen keys that
is the problem.

I'd maybe tie my %seen to a dbm file. See `perldoc -f tie`

Tad J McClellan · Nov 10, 2008

RedGrittyBrick said:
perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \

ITYM:

perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \

RedGrittyBrick · Nov 10, 2008

Tad said:
ITYM:

perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \

Doh!

xhoster · Nov 10, 2008

klashxx said:
Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

Given the line wraps, it is hard to figure out what the structure
of your file is. Every line has from 7 to infinity fields, with the
first one being 30xx? When you say "print or delete", which one? Do you
want to do both in a single pass, or have two different programs, one for
each use-case?

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

Which of those programs was running out of memory? Can you use sort
to group lines according to the key without running out of memory?
That is what I do, first use system sort to group keys, then Perl
to finish up.

How many duplicate keys do you expect there to be? If the number of
duplicates is pretty small, I'd come up with the list of them:

cut -d\| -f1-7 big_file|sort|uniq -d > dup_keys.

And then load dup_keys into a Perl hash, then step through big_file
comparing each line's key to the hash.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files ,

Tie::File has substantial per-line overhead. So unless the lines
are quite long, Tie::File doesn't increase the size of the file you
can handle by all that much. Also, it isn't clear how you would use
it anyway. It doesn't help you keep huge hashes, which is what you need
to group keys efficiently if you aren't pre-sorting. And while it makes it
*easy* to delete lines from the middle of large files, it does not make
it *efficient* to do so.

i tried but cannot get the right code...

We can't very well comment on code we can't see.

....

PD:I do not want to split the files.

Why not?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Rocco Caputo · Nov 10, 2008

Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

PD:I do not want to split the files.

Exactly how are you using awk/sort/uniq/sed/grep? Which part of the
pipeline is running out of memory?

Depending on what you're doing, and where you're doing it, you may be
able to tune sort() to use more memory (much faster) or a faster
filesystem for temporary files.

Are the first seven fields always the same width? If so, you needn't
bother with the pipes.

Must the order of the files be preserved?

Jürgen Exner · Nov 10, 2008

klashxx said:
Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

Hmmm, what is the ratio of unique lines to total lines? I.e. are there
many duplicate lines or only few?

If the number of unique lines is small then a standard approach with
recording each unique line in a hash may work. Then you can simply check
if a line with that content already exists() and delete/print the
duplicate as you encouter it further down the file.

If the number of unique lines is large then that will no longer be
possible and you will have to trade speed and simplicity for memory.
For each line I'd compute a checksum and record that checksum together
with the exact position of each matching line in the hash.
Then in a second pass those lines with unique checksums are unique while
lines with the same checksum (more than one line was recorded for a
given checksum) are candidates for duplicates and need to be compared
individually.

jue

Tim Greer · Nov 10, 2008

klashxx said:
Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

What is the code you're using now?

PD:I do not want to split the files.

That could help solve the problem in its current form, potentially.
Have you considered using a database solution, if nothing more than
just for this type of task, even if you want to continue storing the
fields/data in the files?

cartercc · Nov 10, 2008

Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

1. Create a database with a data table, in two columns, ID_PK and
DATA.
2. Read the file line by line and insert each row into the database,
using the first seven fields as the key. This will insure that you
have no duplicates in the database. As each PK must be unique, your
insert statement will fail for duplicates.
3. Do a select statement from the database and print out all the
records returned.

Splitting the file is irrelevant as virtually 100% of the time you
have to split all files to take a gander at the data.

I still don't understand why you can't use a simple hash rather than a
DB, but my not understanding that point is irrelevant as well.

CC

Michele Dondi · Nov 11, 2008

Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text. [cut]
Any advice will be very well come.

Thank you in advance.

Wouldn't have been nice of you to mention that you asked the very same
question elsewhere? <http://perlmonks.org/?node_id=722634> Did they
help you there? How did they fail to do so?

Michele

sln · Nov 16, 2008

Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

I can do this for you with custom algorithym's.
Each file is sent to me and each has an independent
fee, based on processing time. Or I can license my
technologoy to you, flat-fee, per usage based.

Let me know if your interrested, post a contact email address.

sln

Tad J McClellan · Nov 16, 2008

I can do this for you with custom algorithym's.

^^
^^

Your algorithym (sic) possesses something?

Let me know if your interrested, post a contact email address.

^^^^
^^^^

Put in apostrophe's where they are not needed, leave them out
where theyre needed. Interresting.

David Combs · Dec 1, 2008

^^
^^

Your algorithym (sic) possesses something?

^^^^
^^^^

Put in apostrophe's where they are not needed, leave them out

What's with the schools these days?

On the net, at least, I hardly ever see "you're" any more -- it's
always "your".

(I bet the Chinese and Russians don't make that mistake! :-( )

David

Tailing a series of log files	8	Jun 12, 2013
read huge text file from end	5	Oct 31, 2006
Please check this find/rm script I'm about to run as root	29	May 16, 2009
Comparing 2 files.	3	Jun 26, 2008
How to compare numeric values between two xml files?	3	Nov 7, 2007
Update huge xml files without loading into RAM	7	Oct 9, 2003
How to identify unpaired files in a list	4	Jul 18, 2009
extract certain values from file with re	8	Oct 6, 2006

Huge files manipulation

klashxx

RedGrittyBrick

Tad J McClellan

RedGrittyBrick

xhoster

Rocco Caputo

Jürgen Exner

Tim Greer

cartercc

Michele Dondi

sln

Tad J McClellan

David Combs

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads