Deleting first N lines from a text file


P

pozz

I want to delete the first N lines from a file text. I imagine two
approaches:
- use a temporary file to copy the last lines only
- use the same file to move characters starting from N+1 line to the
beginning

The temporary file could be more complex to write (at last I have to
delete the original file and rename the temporary file), but at any
moment I have a coherent text file. So this approach is safe if the
application crashes during the deleting process. If the application
crashes just after deleting the original text file but before renaming
the temporary file, during initialization I can detect this situation
and proceed with the renaming.

The second approach is simpler, but leaves a malformed text file on
the filesystem if the application crashes during the deleting process.

What do you think about those thoughts? Do you agree with me?

My "deleting first N lines" function is:

int text_delete(unsigned int N) {
FILE *f;
FILE *ftmp;
int c;
f = fopen(filename, "rt");
ftmp = fopen(tmpfilename, "wt");
if ((f == NULL) || (ftmp == NULL)) {
return -1;
}
while((c = fgetc(f)) != EOF) {
if ((char)c == '\n') {
if (--N == 0) break;
}
}
while((c = fgetc(f)) != EOF) {
fputc(c, ftmp);
}
fclose(f);
fclose(ftmp);
if (remove(filename) < 0) return -1;
if (rename(tmpfilename, filename) < 0) return -1;
return 0;
}

At initialization I try to open the text file or the temporary file;

int text_init(void) {
FILE *f;
f = fopen(filename, "rt");
if (f == NULL) {
/* Does the temporary file exist? */
f = fopen(tmpfilename, "rt");
if (f != NULL) {
/* Yes!, recover temporary file */
fclose(f);
if (rename(tmpfilename, filename) < 0) return -1;
} else {
/* Create an empty log file... */
f = fopen(filename, "wt");
if (f == NULL) return -1;
fclose(f);
}
} else {
fclose(f);
}
return 0;
}
 
Ad

Advertisements

R

Roberto Waltman

pozz said:
I want to delete the first N lines from a file text.
...
The second approach is simpler,...
...
What do you think about those thoughts?

Only that the second approach is not simpler.
Also, depending on the underlying OS, it may not be possible to read
from and write to the same file as you propose.
 
B

Ben Pfaff

Acid Washed China Blue Jeans said:
Fopen with "r+". If fopen succeeds, the library has promised
you you are allowed to read and write an existing file.

However, writing in a text file may truncate it, see 7.19.3
"Files":

Whether a write on a text stream causes the associated file
to be truncated beyond that point is implementation-defined.
 
R

Roberto Waltman

Acid said:
Fopen with "r+". If fopen succeeds, the library has promised you you are allowed
to read and write an existing file.

In the general case, a write may truncate the file at the end of the
written data, so it may be OK to read from a location before the last
location written, but not after it.

And there may be environments in which fopen(..., "r+") always fails.
 
E

Eric Sosman

I want to delete the first N lines from a file text. I imagine two
approaches:
- use a temporary file to copy the last lines only

Do this.
- use the same file to move characters starting from N+1 line to the
beginning

Don't do this.
The temporary file could be more complex to write (at last I have to
delete the original file and rename the temporary file), but at any
moment I have a coherent text file. So this approach is safe if the
application crashes during the deleting process. If the application
crashes just after deleting the original text file but before renaming
the temporary file, during initialization I can detect this situation
and proceed with the renaming.

The second approach is simpler, but leaves a malformed text file on
the filesystem if the application crashes during the deleting process.

What do you think about those thoughts? Do you agree with me?

No, not at all. One problem with your supposedly simpler
solution: How do you tell subsequent readers of the file that they
should stop before reaching the end? Observe that <stdio.h> offers
no way to shorten an existing file to any length other than zero.
 
J

jacob navia

Using the containers library (and if your file fits in memory)

#include <containers.h>
int main(int argc,char *argv[])
{
if (argc != 3) {
printf("Usage: deletelines <file> <N>\n");
return -1;
}
strCollection *data = istrCollection.CreateFromFile(argv[1]);
if (data == NULL) return -1;
istrCollection.RemoveRange(data,0,atoi(argv[2]));
istrCollection.WriteToFile(data,argv[1]);
istrCollection.Finalize(data);
}
 
Ad

Advertisements

G

Giuseppe

     No, not at all.  One problem with your supposedly simpler
solution: How do you tell subsequent readers of the file that they
should stop before reaching the end?  Observe that <stdio.h> offers
no way to shorten an existing file to any length other than zero.

Ok, I implemented the "temporary file" solution and it works well.
The
only disadvantage is time: when the file is big (1000 lines of about
50 bytes
each), the time to delete the first line could be very high.

Do you think the process could be reduced launching an external script
(for
example, 'head' based) with system()? If I redirect the output to the
original
filename I could avoid the time consuming process of copying the
original
to the temporary file.
 
K

Keith Thompson

Giuseppe said:
Ok, I implemented the "temporary file" solution and it works well.
The only disadvantage is time: when the file is big (1000 lines of
about 50 bytes each), the time to delete the first line could be very
high.

A text file of 1000 lines of 50 bytes each really isn't all that big.
The time to copy and rename it probably won't even be noticeable.
Do you think the process could be reduced launching an external script
(for example, 'head' based) with system()? If I redirect the output
to the original filename I could avoid the time consuming process of
copying the original to the temporary file.

The behavior of external program is outside the scope of the C language.

(But I'll mention that on Unix-like systems, running a command with its
input and output directed to the same file can cause serious problems;
it can easily end up reading a partially modified version of the file
instead of the original. And even if it works, it's likely going to be
doing the same thing you would have done in your program.)
 
E

Eric Sosman

Ok, I implemented the "temporary file" solution and it works well.
The
only disadvantage is time: when the file is big (1000 lines of about
50 bytes
each), the time to delete the first line could be very high.

Fifty K shouldn't take long. Even on a system from forty years
ago it didn't take long. Even on paper tape, for goodness' sake, it
took less than a minute!

For "really big" files (terabytes) copying most of the file from
one place to another could take an unacceptably long time. Also, the
need to find space for a second nearly complete copy could be
troublesome. In such cases you'd be justified in seeking fancier
solutions -- but I sincerely doubt that "slide all those terabytes
a couple hundred positions leftward" would produce a savings. More
likely it would produce a slowdown, plus the risks you've already
mentioned about data loss in the event of an error. No, the fancier
solution would probably involve some kind of an index external to the
file, describing which parts of the file were "live" and which "dead,"
and fancier routines to read just the live parts.
Do you think the process could be reduced launching an external script
(for
example, 'head' based) with system()? If I redirect the output to the
original
filename I could avoid the time consuming process of copying the
original
to the temporary file.

First, just what do you imagine the "head" program does, hmmm?

However, on the systems I've encountered that provide a "head"
utility and support "redirection," your solution is likely to run
very quickly indeed. And save a lot of disk space, too! (Hint:
Try it yourself: `head <foo.txt >foo.txt', then `ls -l foo.txt',
and then you get to test your backups ...)

But all this is mostly beside the point. You are worried about
the time to copy 50K bytes: Have you *measured* the time? Have you
actually found it to be a problem for your application? Or are you
just imagining monsters under your bed? The fundamental theorem of
all optimization is There Are No Monsters Until You've Measured Them.
 
P

pozz

A text file of 1000 lines of 50 bytes each really isn't all that big.
The time to copy and rename it probably won't even be noticeable.

It takes about 100ms to finish the shrink procedure. It's not a long
time
on a desktop PC, but I'm working on ambedded Linux based on ARM9
processor.

The slowest part of my application is this. Anyway I'm thinking if
there
are some simple improvements to reduce the time taken by this task.

The behavior of external program is outside the scope of the C language.

Oh, I now, I was asking for on "off-topic" opinion :)

(But I'll mention that on Unix-like systems, running a command with its
input and output directed to the same file can cause serious problems;
it can easily end up reading a partially modified version of the file
instead of the original.  And even if it works, it's likely going to be
doing the same thing you would have done in your program.)

Ok, I'll not try.
 
P

pozz

     Fifty K shouldn't take long.  Even on a system from forty years
ago it didn't take long.  Even on paper tape, for goodness' sake, it
took less than a minute!

100ms (see my answer to Keith above). It's not too much, but I was
thingking
about improvements.
 
Ad

Advertisements

P

Phil Carmody

Acid Washed China Blue Jeans said:
Fopen with "r+". If fopen succeeds, the library has promised you you are allowed
to read and write an existing file.

Being allowed to write to it at the point that you open the file
doesn't mean that it's possible to write to the file at any point
later in time.

Think wire-cutters.

Phil
 
J

jgharston

pozz said:
It takes about 100ms to finish the shrink procedure.  It's not a long
time on a desktop PC, but I'm working on ambedded Linux based on ARM9
processor.

Are you doing it byte by byte? Try buffering it, even chunks of
16 bytes at a time will speed it up significantly. What's the
biggest chunk of memory you can claim, use, release without
memory fragmentation impacting your program more than acceptably?

JGH
 
?

-.-

jacob navia was trying to save the world with his stuff:
Using the containers library (and if your file fits in memory)

#include <containers.h>

You self-celebrating fucko. There only exist your things to you:
that silly lcc-win and your funny containers.
Stop making this newsgroup your personal advertisements page.
 
J

jacob navia

Le 16/11/11 14:01, -.- a écrit :
jacob navia was trying to save the world with his stuff:


You self-celebrating fucko.

That is why you hide behind a pseudo, because you have the courage of
your opinions...
 
B

BartC

(That's a fast paper tape reader. The last one I used would have taken
nearly 3 hours.)
100ms (see my answer to Keith above). It's not too much, but I was
thingking
about improvements.

How long for a file containing ten lines instead of 1000? How long for
double the number of lines?

That will tell you the overheads involved and the fastest speed achievable.

While you're about, how long does it take to create a file, write 50,000
bytes to it (of anything) and close it? And how long to read such a file?

Take care when taking measurements, to eliminate the effects of
disk-caching.
 
Ad

Advertisements

J

jgharston

Try replacing:
        while((c = fgetc(f)) != EOF) {
                fputc(c, ftmp);
        }

with:
bsize=m_free(0);
buff=m_alloc(bsize);
numread=-1;

while(numread) {
numread=fread(buff,1,bsize,f);
fwrite(buff,1,numread,ftmp);
}
m_free(buff);

As with usenet tradition, completely untested.

JGH
 
J

jgharston

jgharston said:
        bsize=m_free(0);
        buff=m_alloc(bsize);

Following up my own post, that call to m_free(0) is supposed to
return a size of a free block that can subsequently be claimed
with m_alloc(). A bit of a skim of through the web shows that
functionality isn't in any of the malloc libraries documented
there. All I can say is it worked 25 years ago! and inspired
me to include that functionality in my own malloc library.

Just replace bsize=m_free(0) with a suitable bsize=(some
method of deciding an amount of memory to claim).

JGH
 
K

Keith Thompson

jgharston said:
Try replacing:

with:
bsize=m_free(0);
buff=m_alloc(bsize);
numread=-1;

while(numread) {
numread=fread(buff,1,bsize,f);
fwrite(buff,1,numread,ftmp);
}
m_free(buff);

As with usenet tradition, completely untested.

Leaving aside the m_free and m_alloc calls, why do you assume that this
will be significantly faster than the fgetc/fputc loop? stdio does its
own buffering.
 
Ad

Advertisements

J

jgharston

Keith said:
Leaving aside the m_free and m_alloc calls, why do you assume that this
will be significantly faster than the fgetc/fputc loop?  stdio does its
own buffering.

As I recall, this was a standard exam question back when I worra
litt'un.
If doing bulk data copying a program buffer is likely to be bigger
than stdio's buffer and bulk read/write/read/write is more efficient
for simple chucking of large lumps of data from one place to another,
one bit being the skipping of fgetc's unget functionality.

JGH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top