Handling large text streams of integers

C

Comp1597

Suppose infile is an ifstream bound to a text file containing
integers. The operation infile >> x1; reads integers into the
stream. How many times can this operation be done before there are
possible problems? I know this question isn't exactly platform
independent but are there guidelines that can be given for writing
portable code?

How does the number of such operations (before using another stream)
compare with std::numeric_limits<std::streamsize>::max() ?

Does the above constant denote the max number of integers that can be
read into the stream? Or is the bound in terms of the number of
characters? For example, should the bound be halved if every integer
has exactly two digits?

Thanks in advance.
 
V

Victor Bazarov

Suppose infile is an ifstream bound to a text file containing
integers. The operation infile >> x1; reads integers into the
stream. How many times can this operation be done before there are
possible problems?

Problems? What kind of problems? In a working program the extraction
operation can be done unlimited number of times.
> I know this question isn't exactly platform
independent but are there guidelines that can be given for writing
portable code?

The expression

infile >> x1;

is quite portable. Why are you concerned? Do you know something I don't?
How does the number of such operations (before using another stream)
compare with std::numeric_limits<std::streamsize>::max() ?

8-O Huh? It's greater.
Does the above constant denote the max number of integers that can be
read into the stream?

Integers are not "read into the stream". They are extracted from the
stream buffer into a variable. The buffer is replenished from the
source (whatever that might be) periodically, but not defined how often
or when. Unless you're providing your own implementation of the stream
buffer, you shouldn't concern yourself with that.
> Or is the bound in terms of the number of
characters? For example, should the bound be halved if every integer
has exactly two digits?

WHAT???

V
 
J

James Kanze

Suppose infile is an ifstream bound to a text file containing
integers. The operation infile >> x1; reads integers into the
stream. How many times can this operation be done before
there are possible problems?

As many times as there are numbers in the file.
I know this question isn't exactly platform independent but
are there guidelines that can be given for writing portable
code?

The real guideline is the check the results after every
extraction.
How does the number of such operations (before using another
stream) compare with
std::numeric_limits<std::streamsize>::max() ?

A lot of systems maintain the current position in the file in an
std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the maximum
number of bytes in the file. And that there can't be that many
numbers, since each number requires at least two bytes (one
digit and a separator).

But it's not something one worries about. A lot of files will
be smaller. Just check the status after the read (and before
using the value), and everything will be OK.
Does the above constant denote the max number of integers that
can be read into the stream? Or is the bound in terms of the
number of characters? For example, should the bound be halved
if every integer has exactly two digits?

That number may be (probably is) the maximum number of bytes you
can read from a file in a C++ program. However, it's normally
irrelevant, since most systems do not require that all files
have the maximum length. The number of values you can read is
limited by the number of values in the file.

Note that if, instead of reading a file, you are reading from a
device that doesn't support seeking (like a pipe, a keyboard or
a socket), then the position is irrelevant, and you can easily
read more.
 
V

Victor Bazarov

James said:
Suppose infile is an ifstream bound to a text file containing
integers. The operation infile >> x1; reads integers into the
stream. How many times can this operation be done before
there are possible problems?

As many times as there are numbers in the file.
I know this question isn't exactly platform independent but
are there guidelines that can be given for writing portable
code?

The real guideline is the check the results after every
extraction.
How does the number of such operations (before using another
stream) compare with
std::numeric_limits<std::streamsize>::max() ?

A lot of systems maintain the current position in the file in an
std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the maximum
number of bytes in the file. And that there can't be that many
numbers, since each number requires at least two bytes (one
digit and a separator).
[..]

What if the "file" is actually a serial connection that, like the
Energizer Bunny, just keeps going, and going, and... Will the system
also try to keep track of the "current position" on a socket, for
example? I know, I know, the OP asked about a text file...

V
 
J

Jorgen Grahn

James Kanze wrote: ....
A lot of systems maintain the current position in the file in an
std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the maximum
number of bytes in the file. And that there can't be that many
numbers, since each number requires at least two bytes (one
digit and a separator).
[..]

What if the "file" is actually a serial connection that, like the
Energizer Bunny, just keeps going, and going, and... Will the system
also try to keep track of the "current position" on a socket, for
example? I know, I know, the OP asked about a text file...

I am obviously too lazy to check what the standard says about
std::numeric_limits<std::streamsize>::max(), but I hope it's just a
case of unfortunate naming and that it has to do with seekable
streams only (like James hinted at elsewhere).

I'd be very disappointed if you couldn't use iostreams with "infinite
streams", which (on Unix) includes pipes, (TCP) sockets, /dev/random, ...
I expect to be able to use std::cin/cerr constantly for years.

/Jorgen
 
J

James Kanze

James said:
A lot of systems maintain the current position in the file
in an std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the
maximum number of bytes in the file. And that there can't
be that many numbers, since each number requires at least
two bytes (one digit and a separator).
[..]
What if the "file" is actually a serial connection that, like
the Energizer Bunny, just keeps going, and going, and... Will
the system also try to keep track of the "current position" on
a socket, for example?

That's actually a good question. I don't know what the standard
says about it---probably that it's unspecified. (The standard
doesn't require the system to try to keep track of the "current
position". But in practice, it has to, in some way, in order to
know where the next data is to come from.) I suspect that in
practice, most systems "try" to keep track of it, in the sense
that they update it each time with the number of bytes read, but
that they fail in the case of something like a socket, pipe, or
keyboard; but that it doesn't matter, because the next data are
determined automatically, and those devices don't support
seeking (which would also require the position).
 
J

James Kanze

James Kanze wrote: ...
A lot of systems maintain the current position in the file
in an std::streamsize. Which means that
std::numeric_limits<std::streamsize>::max() is also the
maximum number of bytes in the file. And that there can't
be that many numbers, since each number requires at least
two bytes (one digit and a separator).
[..]
What if the "file" is actually a serial connection that,
like the Energizer Bunny, just keeps going, and going,
and... Will the system also try to keep track of the
"current position" on a socket, for example? I know, I
know, the OP asked about a text file...
I am obviously too lazy to check what the standard says about
std::numeric_limits<std::streamsize>::max(), but I hope it's
just a case of unfortunate naming and that it has to do with
seekable streams only (like James hinted at elsewhere).
I'd be very disappointed if you couldn't use iostreams with
"infinite streams", which (on Unix) includes pipes, (TCP)
sockets, /dev/random, ... I expect to be able to use
std::cin/cerr constantly for years.

Given that the standard doesn't require support for such
things, it doubtlessly doesn't say anything. Disk file
access (even without seek) often does involve the "current
position", at least internally. To quote from the man page
of "read" (the lowest level system function which accesses
the data) on Solaris:

On files that support seeking (for example, a regular
file), the read() starts at a position in the file
given by the file offset associated with fildes. The
file offset is incremented by the number of bytes
actually read.

Files that do not support seeking (for example,
terminals) always read from the current position. The
value of a file offset associated with such a file is
undefined.

But also:

For regular files, no data transfer will occur past the
offset maximum established in the open file description
associated with fildes.

Interally, the system maintains the position as a 64 bit
value. When compiling in 32 bit mode, std::streamsize is 32
bits, and files are opened by default in a mode which only
allows 2^32 as the offset maximum, so the limitation holds.
(The C++ standard library could open the files in a way that
would allow 64 bit seeks and reads, even in 32 bit mode.
I'm pretty sure it doesn't, since we've had problems with
log data being lost when the log file size was greater than
2^32.)

Generally speaking, a lot of systems allow files larger than
2^32 bytes, but compiling in 32 bit mode. In such cases,
several solutions are possible:

-- If the system has two modes for accessing the files,
like Solaris, the library code just uses the 32 bit
mode, and the system behaves as if files couldn't be
bigger than 2^32 bytes. I suspect that this is the most
frequently used solution. (It's certainly the easiest
to implement, if the system supports it, and I suspect
that most systems, or at least most Unix, do.)

-- If the system doesn't have such support, the library
could keep track of the position as well, and simulate
it.

-- Alternatively, the library could either use a 64 bit
type for std::streamsize (if one exists on the
implementation) or define it as a class type, using 2 or
more smaller integral types in the implementation, using
whatever system requests are necessary to support full
64 bit file positionning at the system level. In many
ways, this would be the best solution. But if it means
making std::streamsize a class type, it will probably
break code. Incorrect code, since the standard doesn't
require that std::streamsize be an integral type, or
even that it reasonably convert to one, but such code
exists, and is, I fear, widespread. (If the system
supports long long, as most do now adays,
std::streamsize could be a typedef to this.)

-- Finally, I'm sure that some libraries just ignore the
issue. If the system defaults to limiting the file size
to 2^32 in 32 bit code, this is identical to the first
case above. If it doesn't, then the library isn't
conform---std::istream::tellg can return an apparently
valid position, but seeking to it will not go to the
right place. Still, conform or not, it wouldn't
surprise me to encounter such a system.
 
J

Jorgen Grahn

Given that the standard doesn't require support for such
things, it doubtlessly doesn't say anything.

Yes, I should have written "I'd be very disappointed /with my
compiler/ if I couldn't use iostreams with 'infinite streams' ...".

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top