File Read Progress Indicator

M

Marcus Kwok

I am working on a program that reads and processes large text files (on
the order of 32 MB, so not too huge), so I wanted to add a progress
indicator so I can estimate when it will finish. I just need an
estimate, so the exact byte count isn't essential.

// reduced code
// assume necessary #include's and using declarations for std
// components

ifstream file(filename.c_str());


// read 2 header lines
for (int i = 0; i != 2; ++i) {
string header;
getline(file, header);
}


ifstream::pos_type start_of_data = file.tellg();

file.seekg(0, ios::end);
ifstream::pos_type end_of_data = file.tellg();


file.seekg(start_of_data);
for (string line; getline(file, line); ) {
do_something_with(line);

int percent_done =
static_cast<unsigned long>(file.tellg()) * 100 / end_of_data;

cout << percent_done << "%\n";
}

This outline seems to work well. My question is: is the cast from the
return type of ifstream::tellg() to unsigned long well-defined? The
reason I am casting to an unsigned type in the first place is that
without the cast, eventually negative percents were being displayed.

Also, are there any other issues with my usage of tellg()? I remember
reading somewhere that the result of tellg() isn't guaranteed to be able
to represent any valid filesize, but I don't know if there is any way
around this issue using only standard components.
 
M

Michael

int percent_done =
static_cast<unsigned long>(file.tellg()) * 100 / end_of_data;
This outline seems to work well. My question is: is the cast from the
return type of ifstream::tellg() to unsigned long well-defined? The
reason I am casting to an unsigned type in the first place is that
without the cast, eventually negative percents were being displayed.

It seems that the negativity problem you're seeing would have to be
when you're hitting that limit, but specifically when file.tellg() *
100 hits that limit. You probably should do this calculation in
doubles, then convert back to ints at the end.

How many bits does unsigned long have on your system? If it's 64,
then ignore the previous paragraph, as you're very unlikely to be
hitting that limit.

Michael
 
J

James Kanze

I am working on a program that reads and processes large text files (on
the order of 32 MB, so not too huge), so I wanted to add a progress
indicator so I can estimate when it will finish. I just need an
estimate, so the exact byte count isn't essential.
// reduced code
// assume necessary #include's and using declarations for std
// components
ifstream file(filename.c_str());
// read 2 header lines
for (int i = 0; i != 2; ++i) {
string header;
getline(file, header);
}
ifstream::pos_type start_of_data = file.tellg();
file.seekg(0, ios::end);
ifstream::pos_type end_of_data = file.tellg();
file.seekg(start_of_data);
for (string line; getline(file, line); ) {
do_something_with(line);
int percent_done =
static_cast<unsigned long>(file.tellg()) * 100 / end_of_data;

cout << percent_done << "%\n";
}
This outline seems to work well. My question is: is the cast from the
return type of ifstream::tellg() to unsigned long well-defined?

No. First, the return type is a streampos, which may not even
be convertible to an integral type. Second, even when it is
convertible, there is not necessarily a direct relationship
between the numeric value and the number of bytes in the file.
Third, even on systems where there is an exact relationship
(Unix), or a more or less rough relationship (Windows), and
unsigned long is generally not large enough. (Unix defines a
special type, ssize_t, for this; Microsoft uses a struct
LARGE_INTEGER.) If you're sure that the files can never be more
than, say, 100 MB, then this is not necessarily a consideration.
The
reason I am casting to an unsigned type in the first place is that
without the cast, eventually negative percents were being displayed.

Overflow. The length of a file often doesn't fit into a long to
begin with, and then you go ahead and multiply it by 100. Since
you're interested in per cent, and exact precision isn't an
issue, I'd cast it to double, and use floating point arithmetic.
Also, are there any other issues with my usage of tellg()? I remember
reading somewhere that the result of tellg() isn't guaranteed to be able
to represent any valid filesize, but I don't know if there is any way
around this issue using only standard components.

There's no real solution if you want to remain 100% standard,
because there are real systems where what you want simply isn't
possible. If you're willing to limit portability to Windows and
Unix, however, converting the results of tellg() to double, and
using it, should work. (The results may be off by a couple of
percent under Windows, but typically, the error will be more or
less the same for each call, so your calculations of per cent
will probably end up more precise than expected. Supposing that
the file has more or less homogeonous contents, at least.)
 
P

popyart

No. First, the return type is a streampos, which may not even
be convertible to an integral type. Second, even when it is
convertible, there is not necessarily a direct relationship
between the numeric value and the number of bytes in the file.
Third, even on systems where there is an exact relationship
(Unix), or a more or less rough relationship (Windows), and
unsigned long is generally not large enough. (Unix defines a
special type, ssize_t, for this; Microsoft uses a struct
LARGE_INTEGER.) If you're sure that the files can never be more
than, say, 100 MB, then this is not necessarily a consideration.


Overflow. The length of a file often doesn't fit into a long to
begin with, and then you go ahead and multiply it by 100. Since
you're interested in per cent, and exact precision isn't an
issue, I'd cast it to double, and use floating point arithmetic.


There's no real solution if you want to remain 100% standard,
because there are real systems where what you want simply isn't
possible. If you're willing to limit portability to Windows and
Unix, however, converting the results of tellg() to double, and
using it, should work. (The results may be off by a couple of
percent under Windows, but typically, the error will be more or
less the same for each call, so your calculations of per cent
will probably end up more precise than expected. Supposing that
the file has more or less homogeonous contents, at least.)

--
James Kanze (Gabi Software) email: (e-mail address removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
M

Marcus Kwok

Michael said:
It seems that the negativity problem you're seeing would have to be
when you're hitting that limit, but specifically when file.tellg() *
100 hits that limit. You probably should do this calculation in
doubles, then convert back to ints at the end.

Thanks, that's the same advice James Kanze gave as well.
How many bits does unsigned long have on your system? If it's 64,
then ignore the previous paragraph, as you're very unlikely to be
hitting that limit.

sizeof(unsigned long) * CHAR_BIT = 32 on my platform (Windows XP, VS
2005).
 
M

Marcus Kwok

James Kanze said:
No. First, the return type is a streampos, which may not even
be convertible to an integral type. Second, even when it is
convertible, there is not necessarily a direct relationship
between the numeric value and the number of bytes in the file.
Third, even on systems where there is an exact relationship
(Unix), or a more or less rough relationship (Windows), and
unsigned long is generally not large enough. (Unix defines a
special type, ssize_t, for this; Microsoft uses a struct
LARGE_INTEGER.) If you're sure that the files can never be more
than, say, 100 MB, then this is not necessarily a consideration.


Overflow. The length of a file often doesn't fit into a long to
begin with, and then you go ahead and multiply it by 100. Since
you're interested in per cent, and exact precision isn't an
issue, I'd cast it to double, and use floating point arithmetic.

Thanks, I think I'll go this route.

As an aside, the conversion from streampos to double is well-defined?
Or it just will work in practice? Right now it only needs to work on
Windows but we may use it on HP-UX in the future.
There's no real solution if you want to remain 100% standard,
because there are real systems where what you want simply isn't
possible. If you're willing to limit portability to Windows and
Unix, however, converting the results of tellg() to double, and
using it, should work.

I see, so I guess this answers my above question :)
 
J

James Kanze

Thanks, I think I'll go this route.
As an aside, the conversion from streampos to double is well-defined?
Or it just will work in practice? Right now it only needs to work on
Windows but we may use it on HP-UX in the future.

First, it's not defined at all; there is (in the standard) no
direct conversion from streampos to an arithmetic type. There
is an implicite conversion from streampos to streamoff, however,
and streamoff is required to be convertible to an integral type;
in most implementations, streamoff is in fact a typedef of an
integral type. If streamoff is a typedef to an integral type,
streampos will convert implicitly to any arithmetic type; if it
is a user defined type, you'll need some explicit conversion in
there somewhere.

More significantly, of course, the semantics of the conversion
are more or less undefined; there is a set of operations which
are required to work, but there's nothing to stop the resulting
integral type from being a magic number, or (more likely), some
formatted representation, with different bits having different
significations.

In practice, of course: under Unix or Windows, streamoff will be
an integral type, and it will represent the number of bytes at
the system level from the start of the file. Under Unix, this
means exactly the number of bytes that you read; under Windows,
the number may be slightly higher, but perfectly adequate for
things like a progress bar. This solution typically won't work
on mainframes, but then, mainframes don't usually have the sort
of terminals attached to them where a running indication of
progress would make sense. (And they're different enough from
Unix/Windows that there are probably other things in your code
which would require fixing.)
I see, so I guess this answers my above question :)

Yes. And Windows and Unix (which includes Mac) is a pretty
large world; I'd say that if you're concerned about a user
sitting in front of a terminal, they pretty much cover that
environment. (Today---this wasn't always true, and even today,
you might run into a legacy system here and there. But if you
don't already have one, your company isn't going to go out and
acquire one in the future.)
 
M

Marcus Kwok

James Kanze said:
First, it's not defined at all; there is (in the standard) no
direct conversion from streampos to an arithmetic type. There
is an implicite conversion from streampos to streamoff, however,
and streamoff is required to be convertible to an integral type;
in most implementations, streamoff is in fact a typedef of an
integral type. If streamoff is a typedef to an integral type,
streampos will convert implicitly to any arithmetic type; if it
is a user defined type, you'll need some explicit conversion in
there somewhere.

Thanks. The conversion from streampos to double works for me, today, on
my current platform :)

[snip rest]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top