Efficient Text File Copy

M

Materialised

Hi everyone,

What I am wanting to do, is to copy, a simple plain text file, to
another file, but omitting duplicate items.

The way I thought of doing this, involved copying all the items into a
array, and then looping through that array looking for duplicates,
removing them, and then writing to another file.

This seems a very long and drawn out way of doing this to me, and also I
do not know the initial size of the array required will be.

Could anyone suggest a effective efficient way of doing this?

Thanks
Mick
 
C

Chris \( Val \)

| Hi everyone,
|
| What I am wanting to do, is to copy, a simple plain text file, to
| another file, but omitting duplicate items.
|
| The way I thought of doing this, involved copying all the items into a
| array, and then looping through that array looking for duplicates,
| removing them, and then writing to another file.
|
| This seems a very long and drawn out way of doing this to me, and also I
| do not know the initial size of the array required will be.
|
| Could anyone suggest a effective efficient way of doing this?

That depends, because there are a few different ways
that one could approach this. For example:
std::set<>, std::map<>, std::vector<>, std::unique()
are all available for you to come up with an algorithm.

Can you tell us what constitutes an 'item' in the file ?

Can you show us a few lines of the file format ?,
and what the format of the new file should look like ?

Cheers.
Chris Val
 
J

Jeff Schwab

Chris said:
| Hi everyone,
|
| What I am wanting to do, is to copy, a simple plain text file, to
| another file, but omitting duplicate items.
|
| The way I thought of doing this, involved copying all the items into a
| array, and then looping through that array looking for duplicates,
| removing them, and then writing to another file.
|
| This seems a very long and drawn out way of doing this to me, and also I
| do not know the initial size of the array required will be.
|
| Could anyone suggest a effective efficient way of doing this?

That depends, because there are a few different ways
that one could approach this. For example:
std::set<>, std::map<>, std::vector<>, std::unique()
are all available for you to come up with an algorithm.

Can you tell us what constitutes an 'item' in the file ?

Can you show us a few lines of the file format ?,
and what the format of the new file should look like ?

Cheers.
Chris Val

What he said. Also helpful would be information about the order of the
records: must it be preserved, and are they sorted?
 
M

Materialised

Chris said:
| Hi everyone,
|
| What I am wanting to do, is to copy, a simple plain text file, to
| another file, but omitting duplicate items.
|
| The way I thought of doing this, involved copying all the items into a
| array, and then looping through that array looking for duplicates,
| removing them, and then writing to another file.
|
| This seems a very long and drawn out way of doing this to me, and also I
| do not know the initial size of the array required will be.
|
| Could anyone suggest a effective efficient way of doing this?

That depends, because there are a few different ways
that one could approach this. For example:
std::set<>, std::map<>, std::vector<>, std::unique()
are all available for you to come up with an algorithm.

Can you tell us what constitutes an 'item' in the file ?

Can you show us a few lines of the file format ?,
and what the format of the new file should look like ?

Cheers.
Chris Val
The file is simply a list of items, it could be anything, lets just say
for this example its a shopping list, it would be in the format

eggs
milk
bread
carrots
Jam
...
.....
...... etc

There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.
 
E

E. Robert Tisdale

Materialised said:
The file is simply a list of items, it could be anything, lets just say
for this example its a shopping list, it would be in the format

eggs
milk
bread
carrots
Jam
..
....
..... etc

There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.

The UNIX [GNU] sort command option -u [--unique] to do this.
 
S

Simon Biber

Materialised said:
The file is simply a list of items, it could be anything, lets just say
for this example its a shopping list, it would be in the format

eggs
milk
bread
carrots
Jam
..
....
..... etc

There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.

If you're allowed to make a system() call to sort, then you probably
can also use the system() command "uniq" to do the actual work of
eliminating duplicates.

If you have to do the work of uniq yourself, have a look at the free
source code for a version of the uniq program for ideas.

On Windows/DOS/Linux/Unix type systems:
system("sort < input.file | uniq > output.file");

Of course, you must have some version of uniq installed...

C:\>uniq --version
uniq (textutils) 2.0.21
Written by Richard Stallman and David MacKenzie.

Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
C

Chris \( Val \)

| Chris ( Val ) wrote:
| >
[snip]

| > Can you tell us what constitutes an 'item' in the file ?
| >
| > Can you show us a few lines of the file format ?,
| > and what the format of the new file should look like ?
| >
| > Cheers.
| > Chris Val
| >
| >
| The file is simply a list of items, it could be anything, lets just say
| for this example its a shopping list, it would be in the format
|
| eggs
| milk
| bread
| carrots
| Jam
| ..
| ....
| ..... etc
|
| There is also no requirement on the order of items, as the program will
| make a system() call to sort at the end.



# include <iostream>
# include <fstream>
# include <ostream>
# include <vector>
# include <algorithm>
# include <iterator>
# include <cstdlib>

int main()
{
std::ifstream InFile( "Copies.txt" );
std::eek:fstream OutFile( "NewFile.txt" );

if( !InFile || !OutFile )
return EXIT_FAILURE;

std::vector<std::string> V;

std::copy( std::istream_iterator<std::string>( InFile ),
std::istream_iterator<std::string>(), std::back_inserter( V ) );

std::sort( V.begin(), V.end() );

std::unique_copy( V.begin(), V.end(),
std::eek:stream_iterator<std::string>( OutFile, "\n" ) );

return 0;
}

-- Copies.txt --
eggs
milk
bread
carrots
Jam
eggs
milk
bread
carrots
Jam
eggs
milk
bread
carrots
Jam
eggs
milk
bread
carrots
Jam

-- NewFile.txt --
Jam
bread
carrots
eggs
milk

Cheers.
Chris Val
 
G

Gianni Mariani

Materialised wrote:
....
There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.

- on unix systems "sort -u" removes dupes. In fact, if you want the
most efficient way to do this i.e. sorting and removing dupes, it's best
to do it at the same time since you need to compare each entry to an
adjacent entry (hence easily identifying dupes).
 
S

Severian

The file is simply a list of items, it could be anything, lets just say
for this example its a shopping list, it would be in the format

eggs
milk
bread
carrots
Jam
..
....
..... etc

There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.

I think a more efficient method, if you expect lots of duplicates, and
have enough memory to hold all items, would be:

Insert each word in a binary tree (if you require sorted output) or a
hash table (if you do not). When you find an item that already exists,
just ignore it.

This requires a single pass over the input data and a single output
pass.

- Sev
 
L

Leor Zolman

Materialised wrote:
...

- on unix systems "sort -u" removes dupes. In fact, if you want the
most efficient way to do this i.e. sorting and removing dupes, it's best
to do it at the same time since you need to compare each entry to an
adjacent entry (hence easily identifying dupes).

If you're gonna assume unix, it is even better to end the sort command
with
sort-command -o outputfile
instead of
sort-command > outputfile

to avoid a bunch of extra overhead. I think this option was provided
as a consequence of the fact that the sort command has to detect EOF
on its input before it can even begin to start writing output (for
obvious reasons); since the usual concurrency benefits of piping are
lost (i.e., putting a sort command into a pipeline is more of a
notational convenience than an actual pipeline), -o lets sort itself
write the final output so that the disk space required is only once
again the size of sort's input (for the output file), rather than the
twice again it would be (sort's temporary storage plus the actual
output file) using regular I/O redirection.

BTW, I'm not sure if the OP wanted a C or a C++ solution; Since he
cross-posted to both the C and Learn C/C++ groups, I'd guess C. That
would make using system() the path of least resistance. If C++ is an
option, I'd think any reasonable STL-based solution (such as using
std::set) would run more efficiently than any way of using system();
plus, using std::set would allow for the creation of a custom
comparison function that could be used to tweak the sort criteria.
-leor

Leor Zolman
BD Software
(e-mail address removed)
www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
C++ users: Download BD Software's free STL Error Message
Decryptor at www.bdsoft.com/tools/stlfilt.html
 
C

Chris \( Val \)

[snip]

My apologies to all - I didn't realise this
was cross posted.

Cheers.
Chris Val
 
R

Richard Heathfield

Materialised said:
Hi everyone,

What I am wanting to do, is to copy, a simple plain text file, to
another file, but omitting duplicate items.

The way I thought of doing this, involved copying all the items into a
array, and then looping through that array looking for duplicates,
removing them, and then writing to another file.

This seems a very long and drawn out way of doing this to me, and also I
do not know the initial size of the array required will be.

Could anyone suggest a effective efficient way of doing this?

Since you say elsethread that you do not require the output to retain the
ordering given by the input, I suggest a hash table. Make sure you can
handle collisions. K&R2 provides a rudimentary hash table example on p145.

For a hashing algorithm, either use K&R's (p144!) or Chris Torek's (with the
33 multiplier rather than 31).

Pseudocode:

if input file opened okay
while lines remain
read a line
hash it
make sure the hash value is within bucket range
if line does not already exist in that bucket
add line to bucket
endif
endwhile
close input file
endif
if output file opened okay
for each bucket
for each line in this bucket
write line to output file
endfor
endfor
close output file
endif

Turning this into C code is left as an exercise.
 
M

Malcolm

Materialised said:
eggs
milk
bread
carrots
Jam
..
....
..... etc

There is also no requirement on the order of items, as the program will
make a system() call to sort at the end.
Sounds like homework. The question is, what is your tutor looking for. Is he
looking for basic C skills (in which case he has chosen a bad question),
knowledge of C++ libraries, knowledge of algorithms, or practical skills in
knocking up programs quickly.

If the first, then read the lines into a huge array, then step through it
looking for duplicates. This algorithm is O(N^2), which is why it is a bad
choice for teaching basic C.

The trick is to store the items in some structure where they can be accessed
easily. This is either a hash table or a balanced search tree (eg a red
black tree). C++ stl provides a binary tree (set), so if the tutor is
looking for effective use of stl you should use this.

If the problem is knowledge of the algorithm, then stl is cheating. A hash
table is probably the easiest structure to implement yourself.

If the problem is to come up with a solution, then you want to take the
system tools route. This is the best answer, but cheating if the object is
to understand how to program rather than come up with the goods quickly.
 
C

CBFalconer

Richard said:
.... snip ...

For a hashing algorithm, either use K&R's (p144!) or Chris Torek's
(with the 33 multiplier rather than 31).

Why do you recommend this? I have used 31 and 37 in the past.
This is to hash strings consisting (in the main) of ascii chars.
Frankly I took Kernighan & Pikes word for it, also noting that 31
and 37 are prime, while 33 is not.
 
R

Richard Heathfield

CBFalconer said:
Why do you recommend this?

Why do I recommend Chris Torek's hash? Because Chris says it works well and
I had no reason to disbelieve him; in my experience, he's right - it does
work well.
I have used 31

So do K&R (and, as you can see above, I recommended that, too).
and 37 in the past.
This is to hash strings consisting (in the main) of ascii chars.
Frankly I took Kernighan & Pikes word for it, also noting that 31
and 37 are prime, while 33 is not.

Hmmm.

Kernighan and Pike vs Tisdale? K&P.
Kernighan and Pike vs Malbrain? K&P.

But Kernighan and Pike vs Torek is less clear-cut. :)
 
M

Materialised

Malcolm said:
Sounds like homework. Incorrect

The question is, what is your tutor looking for. Is he
What tutor? Who mentioned a tutor, there you go assuming again.
 
R

Richard Heathfield

Materialised said:
Incorrect

Actually, it *does* sound like homework. It doesn't have to /be/ homework to
/sound/ like homework.
What tutor? Who mentioned a tutor, there you go assuming again.

If you post in alt.comp.lang.learn.c-c++, it's not a terribly unreasonable
assumption, even if it turns out to be an inaccurate one.
 
C

Chris Torek

Note, it is not really "my" hash (I got it it from James Gosling
many years ago -- this is the same James Gosling who is behind
Java, incidentally, but he was at CMU at the time). It uses the
remarkably simple recurrence:

unsigned int h;
...
h = 0;
for (all bytes in the string)
h = h * 33 + this_byte;

or more concretely, for a C-style string pointed to by "cp":

for (h = 0; (c = *cp++) != 0;)
h = h * 33 + c;

Why do I recommend Chris Torek's hash? Because Chris says it
works well and I had no reason to disbelieve him; in my experience,
he's right - it does work well.

I think he meant "why 33" (instead of a prime number, like the
"more obvious" 31 and 37). I wonder the same thing. Gosling used
33, and I tried 31 and 37, and when others were doing larger-scale
experiments (for what eventually became the "Berkeley DB" library)
I suggested trying even more variations. Different datasets had
different outcomes, but on average 33 worked better than 31.

Note that this simple hash is less effective than either a CRC or
a strong cryptographic hash (both of those distribute the input
bits much better), but this one is very fast to compute. As it
turns out, for in-memory hashing, the distribution of the hash is
often less critical than the time required to compute it.
Multiplication by 31 and 33 are both quite quick on most CPUs, and
the additional bucket-search effort that occurs after this "less
effective" hash tends to use up less than the "saved time" as
compared to a more effective hash function.
 
C

CBFalconer

Chris said:
Note, it is not really "my" hash (I got it it from James Gosling
many years ago -- this is the same James Gosling who is behind
Java, incidentally, but he was at CMU at the time). It uses the
remarkably simple recurrence:

unsigned int h;
...
h = 0;
for (all bytes in the string)
h = h * 33 + this_byte;

or more concretely, for a C-style string pointed to by "cp":

for (h = 0; (c = *cp++) != 0;)
h = h * 33 + c;



I think he meant "why 33" (instead of a prime number, like the
"more obvious" 31 and 37). I wonder the same thing. Gosling used
33, and I tried 31 and 37, and when others were doing larger-scale
experiments (for what eventually became the "Berkeley DB" library)
I suggested trying even more variations. Different datasets had
different outcomes, but on average 33 worked better than 31.

Note that this simple hash is less effective than either a CRC or
a strong cryptographic hash (both of those distribute the input
bits much better), but this one is very fast to compute. As it
turns out, for in-memory hashing, the distribution of the hash is
often less critical than the time required to compute it.
Multiplication by 31 and 33 are both quite quick on most CPUs, and
the additional bucket-search effort that occurs after this "less
effective" hash tends to use up less than the "saved time" as
compared to a more effective hash function.

All of 31, 33, and 37 make intuitive sense to me. 31 and 33 will
obviously be faster on machines without multiplication.

I have the obvious testbed in hashlib, which was originally built
to test some other hash functions, and then expanded. I shall
have to build a driver to run through those and list efficiencies
Real Soon Now. I could also implement the multiplications with
shift and add, to make a total of 5 useful hashes. I don't expect
the differences to be earth shaking. The present hashlib
verification suite include the awful hash function 1. That's one.
 
K

Keith Bostic

While the function works well for strings, it doesn't work well
for numbers (for example, indexing on record numbers), in my
experience. Berkeley DB ended up using this:

/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by
email to the
* IEEE Posix P1003.2 mailing list from Phong Vo
([email protected]) and
* Glenn Fowler ([email protected]). Landon Curt Noll
([email protected])
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special
prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've
seen so
* far, and works well on both numbers and strings.
*
* PUBLIC: u_int32_t __ham_func5 __P((DB *, const void *, u_int32_t));
*/
u_int32_t
__ham_func5(dbp, key, len)
DB *dbp;
const void *key;
u_int32_t len;
{
const u_int8_t *k, *e;
u_int32_t h;

if (dbp != NULL)
COMPQUIET(dbp, NULL);

k = key;
e = k + len;
for (h = 0; k < e; ++k) {
h *= 16777619;
h ^= *k;
}
return (h);
}

There are a variety of hash functions Berkeley DB has tried
at various times -- the ones we've liked best we've kept:

http://www.opensource.apple.com/darwinsource/10.3/BerkeleyDB-6/db/hash/hash_func.c

Regards,
--keith

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic (e-mail address removed)
Sleepycat Software Inc. keithbosticim (ymsgid)
118 Tower Rd. +1-781-259-3139
Lincoln, MA 01773 http://www.sleepycat.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top