unget vs. putback

Kevin Saff · Oct 6, 2003

Why are these two similar functions provided? (unget & putback)

I'm working with a file format where the type of record to read in next
is
stored in the first two bytes of the record. From a design standpoint,
I
wanted the class general enough to handle longer/more complex keys, so I
thought I should store this key value in the record class itself, and
query
the record types by a "canRead(std::istream&)" function. This would
seem to
require an extended peek capability, since if the record type cannot
read
the stream, it must return the stream to its previous location.

I initially tried using tellg and seekg for this, and was rewarded by
60% of
the program being spent within these functions. Replacing these by
"unget"ing the proper number of characters reduced this to less than 1%.

All I know that unget() is much better on my PC, with my compiler than
seekg. Is this likely to be true in general for relatively small (<100)
numbers of bytes? When would I want to use putback(char) instead?

The only explanation I can think of is the file may be buffered somehow;
then ungetting might take you past the beginning of the buffer, whereas
putback'ing will be able to expand the buffer in this case.

Dietmar Kuehl · Oct 7, 2003

Kevin said:
All I know that unget() is much better on my PC, with my compiler than
seekg. Is this likely to be true in general for relatively small (<100)
numbers of bytes? When would I want to use putback(char) instead?

To best answer these questions, lets have a look at the underlying
machinery. IOStreams are built on top of stream buffers (that is,
object of type 'std::basic_streambuf<cT, traits>'). As the name says
this class provides a concept of a buffer although it is possible to
create unbuffered stream buffers (that is, the name is somewhat
misleading). File streams are very likely to use the internal buffer,
except, maybe, when using some special files like a tty or a named
piped. If a buffer is set up for the stream buffer, most operations
are simple pointer operations: check whether the pointers are in the
allowed range and do something with the respective character.

For 'sungetc()', the stream buffer function called by input stream's
'unget()', this basically means to check whether the current read
pointer is at the beginning of the buffer and if it is not to move it
on character back. This operation is very likely to be very fast. If
'sungetc()' is at the beginning of the buffer, it will call
'pbackfail(traits::eof())'.

The operation of 'sputbackc()', as you have correctly guessed the
stream buffer function called by the input stream's 'putback()'
function, is a little bit more complex and slower: it starts by
checking whether the current position is at the beginning of the
buffer and if it is, it checks whether the previous character matches
the one being put back. If either of these fails, 'sputbackc()' calls
'pbackfail(c)' with the character put back character as argument
(after being converted to 'int_type' using 'traits::to_int_type()').
Otherwise the current read position is moved one character back.

For the case that putting back a character does not hit a buffer
boundary this explains that the 'unget()' should be fast. A few
questions obviously remain:

- How many characters can be safely put back? The answer is quite
simple: none. If you are at a buffer boundary, put back can fail
and there is no guarantee in the standard one the number of
available put back positions. I would expect any reasonable
standard library implementation to allow at least one character
being put back but this is really a quality of implementation
issue - and it is unclear what is better quality here: there is
rarely a need for put back (eg. none in the standard library I/O
functions) and providing a put back buffer would incur unnecessary
overhead. Also, it is easly worked around this problem by
providing a filtering stream buffer which allows eg. a specified
number of put back characters.
- What does 'pbackfail()' do? Well, it obviously tries to back up
one position in the stream. In case a wrong character was put back
it can choose to accept these (ie. using 'putback()' you might be
able to put characters into the stream which have not been there).
In case of hitting the beginning of the buffer it might read the
previous page or simply put the character passed to 'pbackfail()'
into the buffer after making room somehow, thereby assuming that
the character was the right one (that is, 'putback()' might be
successful when 'unget()' is not).
- What happens when the END of a buffer is reached? Are characters
retained for put back? When the end of the input buffer is
reached, 'underflow()' is called. This function is supposed to
make new buffer with at least one character available. It can set
up the new buffer in such a way that old characters are retained:
The buffer is set up with the call 'setg(begin, current, end)'.
The first argument is the beginning of the buffer, the second is
the current read position (ie. it points to the character made
available by 'underflow()'), and the third is the end of the
buffer. That is, the range [begin, current) is available for put
back. A library can copy "n" characters from the end of the
previous buffer to the beginning of the new buffer. Unfortunately,
there is no guarantee that "n > 0" for file buffers.

In practical terms, this means, that you cannot rely on the put back
doing anything useful for the standard streams! There are a few paths
how to work around this problem:

- Check the documentation of the standard library you are using: It
might provide better guarantees for file streams. Of course, this
way you become dependent on a particular implementation.

- If the documentation does not tell you anything, you might by able
to look at the implementation. Note, however, that this is a very
dangerous path because the implementation for the next version may
be change.

- The safest approach would be the creation of a simple filtering
stream buffer: if you know that you are simply reading the stream
from beginning to end, except for putting back a maximum of "n"
characters, such a filtering stream buffer is simple to write. If
you mix things with seeking within the stream, things become
somewhat more complex...

- Avoid put back in the first place. What is the point of processing
read characters again? There is no problem with peeking at the
current read position: this always works. ... and for many cases
this is sufficient.

The only explanation I can think of is the file may be buffered somehow;

There is no guarantee that files are buffered (and, in fact, you can
turn off buffering by calling 'setbuf(0, 0)' on the stream buffer)
but I would bet that buffered file streams are the default on all
implementations: unbuffered file reading is just slow.

then ungetting might take you past the beginning of the buffer, whereas
putback'ing will be able to expand the buffer in this case.

This is roughly the deal. Of course, you cannot count on it being
the case...

Siemel Naran · Oct 7, 2003

Kevin Saff said:
I'm working with a file format where the type of record to read in next
is
stored in the first two bytes of the record. From a design standpoint,
I
wanted the class general enough to handle longer/more complex keys, so I
thought I should store this key value in the record class itself, and
query
the record types by a "canRead(std::istream&)" function. This would
seem to
require an extended peek capability, since if the record type cannot
read
the stream, it must return the stream to its previous location.

How about this? In the generic read function read the first two bytes.
Then iterate through all the object types in the registry and call the
(possibly virtual) canRead function. Or perhaps you can set up a map of
2-bytes to object (or a hashtable or direct address table), in order to
quickly determine the object type to read. If you find a match then call
the read function for that object and pass in a parameter to indicate that
you already read the first 2-bytes. If no match, then use seekg to move the
stream to the previous position, then throw an error.

I initially tried using tellg and seekg for this, and was rewarded by
60% of
the program being spent within these functions. Replacing these by
"unget"ing the proper number of characters reduced this to less than 1%.

Do you have a large number of types that can be read? What implementation
of streams are you using?

All I know that unget() is much better on my PC, with my compiler than
seekg. Is this likely to be true in general for relatively small (<100)
numbers of bytes? When would I want to use putback(char) instead?

The only explanation I can think of is the file may be buffered somehow;
then ungetting might take you past the beginning of the buffer, whereas
putback'ing will be able to expand the buffer in this case.

I don't know why they have putback(char). Though I wrote a stream buffer
class that lets you call putback to put an unlimited number of characters
into the stream, to simulate the user typing those characters. I also have
a function putback(const char *).

Both unget and putback may call the streambuf's virtual pbackfail if the
input sequence cannot be backed up. Putback will also call pbackfail if the
previous character in the input sequence is not equal to the character you
are putting back. So putback may be slightly slower. As for pbackfail, I'm
not exactly sure what it does. It may always return eof() which indicates
failure, namely that the stream could not be backed up.

Hans Bos · Oct 8, 2003

Dietmar Kuehl said:
For the case that putting back a character does not hit a buffer
boundary this explains that the 'unget()' should be fast. A few
questions obviously remain:

- How many characters can be safely put back? The answer is quite
simple: none. If you are at a buffer boundary, put back can fail
and there is no guarantee in the standard one the number of
available put back positions. I would expect any reasonable
standard library implementation to allow at least one character
being put back but this is really a quality of implementation
issue - and it is unclear what is better quality here: there is
rarely a need for put back (eg. none in the standard library I/O
functions) and providing a put back buffer would incur unnecessary
overhead.

When you read a character with snextc you can always putback that character.

If the character to be read (with snextc) is already in the streambuf's buffer
it can always be ungetted.

Otherwise uflow() will be called and it is required
to transfer the returned character to the "backup sequence".
Which when nonempty is defined as the gptr() - eback() begining at eback().
Then gptr() - eback() > 0.
So in this case you can also putback the last character.

You can also read characters with sgetc, which will call underflow().
Underflow() is also required to set the streambuf pointers, so that the
character
is in the buffer.

In the standard library basic_ostream:

perator<<(basic_streambuf *) uses this
behaviour.
When it cannot put a character read from the streambuf into the ostream, the
character is
not extracted from the streambuf.

This can only be done if the streambuf is required to buffer the last character
read.

Greetings,
Hans.

Dietmar Kuehl · Oct 8, 2003

Hans Bos said:
When you read a character with snextc you can always putback that character.

I don't think so...

If the character to be read (with snextc) is already in the streambuf's buffer
it can always be ungetted.

Yes, there is no problem with ungetting characters sitting in the current
buffer. Things become interesting with unbuffered stream buffers and at
buffer boundaries.

Otherwise uflow() will be called and it is required
to transfer the returned character to the "backup sequence".

Which, according to 27.5.2.4.3 paragraph 11, can be empty. I don't see
any requirement which states that the backup sequence is non-empty after
the transfer - even though the requirement for 'uflow()' mentions transfer
of a character to the backup sequence. Effectively, the whole purpose of
the function 'uflow()' is to allow unbuffered stream buffers in the first
place. Well, only the current character, as returned by 'sgetc()' has to
be remembered.

Which when nonempty is defined as the gptr() - eback() begining at eback().
Then gptr() - eback() > 0.
So in this case you can also putback the last character.

I don't think you can rely on this! Maybe the standard is unclear on this
isse: I seem to a have rather different interpretation of what happens
when a character is transfered to an empty backup sequence than you. Can
any of the other standard library implementers comment on this issue?

You can also read characters with sgetc, which will call underflow().
Underflow() is also required to set the streambuf pointers, so that the
character is in the buffer.

'underflow()' is definitely *NOT* required to setup the pointer such that
the character is in the buffer! Great pains are taken to avoid this
particular requirement: there are several sequences defined and it is
always said how the alternative would look like if there is explicitly no
buffer at all. 'underflow()' has, however, to arrange for the stream buffer
to remember the character being returned: multiple calls to 'underflow()'
without intervening calls to 'setg()' or 'uflow()' are required to return
the same character.

In the standard library basic_ostream:perator<<(basic_streambuf *) uses this
behaviour.

If it relies on this behavior, I think the implementation is non-conforming.
The buffer may, however, be used directly if available to improve the
performance. If there is no buffer, the mentioned operator has to be
careful not to extract a character before sending it is inserted.

When it cannot put a character read from the streambuf into the ostream, the
character is not extracted from the streambuf.

Correct. But this does not at all require any form of put back: it merely
requires that 'underflow()' does not extract the character. Character
extraction is done either by adjusting the internal pointer or by calling
'uflow()'.

This can only be done if the streambuf is required to buffer the last
character read.

This is correct but a different issue: the stream buffer has to remember
the last character returned by 'underflow()'. This is, however, not the
character you can put back: it is the predeeding character which you can
put back and there is no requirement that this character is buffered!
Actually, there is no point in putting back the character returned from
'underflow()' because this character was not yet extracted in the first
place.

Even if 'uflow()' would be required to retain a character in the backup
sequence: a call to 'sbumpc()' at the end of the buffer just returns the
character from the pending sequence and afterward the pending sequence
is empty (ie. 'gptr() == egptr()'). If the next operation is 'sgetc()',
it is 'underflow()' that is getting called. There is no requirement for
this function to setup a backup sequence.

kanze · Oct 9, 2003

(e-mail address removed) (Dietmar Kuehl) wrote in message

'underflow()' has, however, to arrange
for the stream buffer to remember the character being returned:
multiple calls to 'underflow()' without intervening calls to 'setg()'
or 'uflow()' are required to return the same character.

Do you have any reference in the standard for this. It would seem to be
common sense, and is what I had always expected. It is not, however,
what most implementations I've used do. In particular, with most of the
implementations I've tested (g++ 3.3.1 is the exception), underflow()
may return EOF on one call, and a legal character on the next call. I'd
like for this to be illegal, but I can't find anything in the standard
to back up my wishes.

The problem is easy to put in evidence under Unix or Windows: try the
following program:

#ifndef OLD
#include <iostream>
#include <ios>
#include <ostream>
#include <streambuf>
#else
#include <iostream.h>
#define std
#endif

int
main()
{
std::streambuf* sb = std::cin.rdbuf() ;
while ( sb->sgetc() != EOF ) {
sb->sbumpc() ;
}
std::cout << "EOF seen" << std::endl ;
if ( sb->sgetc() != EOF ) {
std::cout << "BROKEN" << std::endl ;
}
return 0 ;
}

Under Unix, run it, enter ^D, and when "EOF seen" appears, enter any
character but ^D. Under Windows, run it, then enter ^Z Return, and when
"EOF seen" appears, a line without a ^Z in it.

(Obviously, this doesn't test underflow directly. But IMHO, the
important aspect for the user is that sgetc() always returns the same
thing, as long as there are no intervening calls to other functions of
the streambuf.)

For what it's worth, this didn't work with the orginal USL <iostream.h>
either, so in some cases, espeically when OLD is defined, it may be a
case of intentional bug compatibility rather than an error. (IMHO, the
argument doesn't hold when OLD is not defined, since the new streams
aren't compatible with the old ones anyway.)

Hans Bos · Oct 10, 2003

Dietmar Kuehl said:
I don't think so...

Which, according to 27.5.2.4.3 paragraph 11, can be empty. I don't see
any requirement which states that the backup sequence is non-empty after
the transfer - even though the requirement for 'uflow()' mentions transfer
of a character to the backup sequence. Effectively, the whole purpose of
the function 'uflow()' is to allow unbuffered stream buffers in the first
place. Well, only the current character, as returned by 'sgetc()' has to
be remembered.

I don't think you can rely on this! Maybe the standard is unclear on this
isse: I seem to a have rather different interpretation of what happens
when a character is transfered to an empty backup sequence than you. Can
any of the other standard library implementers comment on this issue?

If you transfer something it doesn't disappear.
If transfer money from one account to another I don't expect it to disappear,
even though an account can be empty.

Since the standard doesn't explicitly mention that the backup sequence may
be empty after the character is transfered to it, I say the character must be
in the backup sequence.
And when the backup sequence is not empty, you can putback the character.

'underflow()' is definitely *NOT* required to setup the pointer such that
the character is in the buffer! Great pains are taken to avoid this
particular requirement: there are several sequences defined and it is
always said how the alternative would look like if there is explicitly no
buffer at all. 'underflow()' has, however, to arrange for the stream buffer
to remember the character being returned: multiple calls to 'underflow()'
without intervening calls to 'setg()' or 'uflow()' are required to return
the same character.

27.5.2.4.3/12 says that after the call to underflow(), gptr() and egptr()
statisfies one of:
a) If the pending sequence is nonempty, egptr() is nonnull and egptr() -
gptr() characters starting at gptr() are the characters in the pending sequence
b) If the pending sequence is empty, either gptr() is null or gptr() and egptr()
are set to the same nonNULL pointer.

And the pending sequence is defined as the concatenation of (27.5.2.4.3/10):
a) If gptr() is nonNULL, then the egptr() - gptr() characters starting at
gptr(), otherwise the empty sequence.
b) Some sequence (possibly empty) of characters read from the input sequence.

Now suppose gptr() is NULL when underflow() is called. Then the pending sequence
is b: the characters read from the input sequence.
If underflow returns a value (when gptr() was NULL) it means that the pending
sequence was not empty.
Therefore in 27.5.2.4.3/12 a applies and egptr() must be nonNULL.

So when a character is returned by underflow(), the pending sequence is non
empty and it contains at least the character returned. Therefore *gptr() must be
equal to the character returned (provided it is not equal to eof()) and
egptr() - gptr() > 0.

Note that this behaviour is trivially implemented by having a buffer of one
character for "unbuffered" streams. I think that this is also used in FILE for
unbuffered streams where for ungetc a one character putback is guaranteed.

Personally I so not much value in redefining uflow to save a one character
buffer.

Regards,
Hans Bos.

Dietmar Kuehl · Oct 10, 2003

(e-mail address removed) (Dietmar Kuehl) wrote in message

Do you have any reference in the standard for this.

I stand to what I have said with respect to the returned character:
it is supposed to be always the same. However, this does not apply
when 'underflow()' did not return a character, ie. when it returns
'eof()' - but I didn't claim anything for this case.

Concerning the character being returned, it turns out that I was
indeed wrong in claiming that the stream buffer is not supposed to
setup a buffer: Following the logic spelled out by Hans Bos in his
reply to my article, the pending sequence is non-empty when a
character is returned. This is stated in 27.5.2.4.3 paragraph 8
(the character being returned is the first character of the pending
sequence). And in paragraph 12 it states that the sequence [gptr(),
egptr()) forms the pending sequence if this sequence is non-empty.

For repeated calls of 'underflow()' the pending sequence did not
change (although the set of functions not allowed to be called is
rather bigger: it also includes 'gbump()', the other virtual
functions, etc...). Hence, the first character of the pending
sequence, ie. the one pointed to be 'gptr()', is returned.

It would seem to be common sense, and is what I had always
expected.

Well, after a stream buffer returned 'eof()' once, it might still
be able to provide characters later, eg. after receiving a key
press from the keyboard. This is consistent eg. with UNIX' 'read()'
function which might return '0' at some point but a positive
non-null value later: at some point end of file is reached and later
in time end of file moved on, making new character available.

Dietmar Kuehl · Oct 10, 2003

If you transfer something it doesn't disappear.
If transfer money from one account to another I don't expect it to
disappear, even though an account can be empty.

Well, it will be disappearing if you transfer it into my empty
account

Since the standard doesn't explicitly mention that the backup sequence may
be empty after the character is transfered to it, I say the character must
be in the backup sequence.
And when the backup sequence is not empty, you can putback the character.

I think we should this for clarification to the standardization
committee: Since 'underflow()' is definitely allowed to leave the
backup sequence empty (see below) there is no guarantee that
'putback()' or 'unget()' will be successful. Thus, I think this
requirement would be an unnecessary restriction assuming it is
there.

My above statement about 'underflow()' not being required to set up
the internal stream buffer pointers to hold the returned character
is wrong as Hans correctly points out. Thank you!

However, the backup sequence is still allowed to be empty after a
call 'underflow()' as is stated explicitly in 27.5.2.4.3 paragraph
13: "... the function is not constrained as to their contents...".
Thus, after a call eg. to 'sgetc()' you cannot safely put back a
character.

James Kanze · Oct 13, 2003

|> (e-mail address removed) wrote:
|> > (e-mail address removed) (Dietmar Kuehl) wrote in message
|> >
|> >> 'underflow()' has, however, to
|> >> arrange for the stream buffer to remember the character being
|> >> returned: multiple calls to 'underflow()' without intervening
|> >> calls to 'setg()' or 'uflow()' are required to return the same
|> >> character.

|> > Do you have any reference in the standard for this.

|> I stand to what I have said with respect to the returned character:
|> it is supposed to be always the same. However, this does not apply
|> when 'underflow()' did not return a character, ie. when it returns
|> 'eof()' - but I didn't claim anything for this case.

|> Concerning the character being returned, it turns out that I was
|> indeed wrong in claiming that the stream buffer is not supposed to
|> setup a buffer: Following the logic spelled out by Hans Bos in his
|> reply to my article, the pending sequence is non-empty when a
|> character is returned. This is stated in 27.5.2.4.3 paragraph 8 (the
|> character being returned is the first character of the pending
|> sequence). And in paragraph 12 it states that the sequence [gptr(),
|> egptr()) forms the pending sequence if this sequence is non-empty.

|> For repeated calls of 'underflow()' the pending sequence did not
|> change (although the set of functions not allowed to be called is
|> rather bigger: it also includes 'gbump()', the other virtual
|> functions, etc...). Hence, the first character of the pending
|> sequence, ie. the one pointed to be 'gptr()', is returned.

Good point. No guarantee with regards to underflow, but the character
is guaranteed to be in the buffer, so further calls to sgetc() shouldn't
invoke underflow.

Except, as you say, when underflow returns EOF (which cannot be
buffered).

|> > It would seem to be common sense, and is what I had always
|> > expected.

|> Well, after a stream buffer returned 'eof()' once, it might still be
|> able to provide characters later, eg. after receiving a key press
|> from the keyboard.

I am aware of this phenomenon. Way back when, I wrote the equivalent of
tail -f (at a time when tail didn't have a -f option).

|> This is consistent eg. with UNIX' 'read()' function which might
|> return '0' at some point but a positive non-null value later: at
|> some point end of file is reached and later in time end of file
|> moved on, making new character available.

The question isn't whether it is consistent with the behavior of some
low level function in a particular operating system. The question is,
in the end, what should the standard say about this case. Allowing EOF
not the be definitive makes a certain number of things more difficult.
And IMHO, the behavior is counter-intuitive. And I'm not sure that it
was really explicitely desired.

ifstream seeking	3	Feb 15, 2007
Python vs "Low Code" solutions for SaaS development	0	Oct 22, 2023
C99 Seg fault on while(), why ?	0	Sep 13, 2022
Probably trivial fstream question	2	Jan 13, 2009
Byte Stream Vs Char Stream Buffer	21	May 7, 2014
COPYING (not extracting) data from an istream object	13	Jan 16, 2006
Reading a file using istearm	4	Dec 13, 2007
Multiple Inheritance vs. Interface	29	Sep 20, 2012

unget vs. putback

Kevin Saff

Dietmar Kuehl

Siemel Naran

Hans Bos

Dietmar Kuehl

kanze

Hans Bos

Dietmar Kuehl

Dietmar Kuehl

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads