complexity for tellg()

T

toton

Hi,
I am reading a big file , and need to have a flag for current file
position so that I can store the positions for later direct access.
However it looks tellg is a very costly function ! But it's code says
it should just return the current buffer position , thus should be a
very low cost function.
To explain,
{
boost::progress_timer t;
std::ifstream in("Y:/Data/workspaces/tob4f/tob4f.dat");
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
}
}
This code takes 0.58 sec in my computer, while if I uncomment the line
in.tellg() , it takes 120.8 sec (varies a little )

can anyone say the reason & the possible workout ?
I amusing MS Visual Studio 7.1 and the std library provided by visual
studio 7.1
 
A

Alf P. Steinbach

* toton:
Hi,
I am reading a big file , and need to have a flag for current file
position so that I can store the positions for later direct access.
However it looks tellg is a very costly function ! But it's code says
it should just return the current buffer position , thus should be a
very low cost function.
To explain,
{
boost::progress_timer t;
std::ifstream in("Y:/Data/workspaces/tob4f/tob4f.dat");
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
}
}
This code takes 0.58 sec in my computer, while if I uncomment the line
in.tellg() , it takes 120.8 sec (varies a little )

can anyone say the reason & the possible workout ?
I amusing MS Visual Studio 7.1 and the std library provided by visual
studio 7.1

Most likely the cause is conversion of CRLF to LF, which you've
specified by (1) opening the file in text mode and (2) compiling with a
Windows compiler.

One cure could then be to open the file in binary mode, and handle
newlines as appropriate (or not).
 
J

John Harrison

toton said:
Hi,
I am reading a big file , and need to have a flag for current file
position so that I can store the positions for later direct access.
However it looks tellg is a very costly function ! But it's code says
it should just return the current buffer position , thus should be a
very low cost function.
To explain,
{
boost::progress_timer t;
std::ifstream in("Y:/Data/workspaces/tob4f/tob4f.dat");
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
}
}
This code takes 0.58 sec in my computer, while if I uncomment the line
in.tellg() , it takes 120.8 sec (varies a little )

can anyone say the reason & the possible workout ?
I amusing MS Visual Studio 7.1 and the std library provided by visual
studio 7.1

The reason is that tellg performs a seek to the current position. This
flushes the input buffer so dramatically slowing down your program.

Looks as through the defintion is streambuf (which is used by all
streams) is such that the only way to find the current position is to
perform a seek to the current position.

john
 
J

John Harrison

Looks as through the defintion is streambuf (which is used by all
streams) is such that the only way to find the current position is to
perform a seek to the current position.

Let me try that again

Looks as though the definition of streambuf (which is used by all
streams) is such that the only way to find the current position is to
perform a seek to the current position.

john
 
C

Carlo Capelli

Of course you can approach the problem computing the position yourself, if
you know the size of the input read.
Not elegant, but it works for simple cases...

std::ifstream in("Y:/Data/workspaces/tob4f/tob4f.dat");
size_t pos = 0;
std::string line;
while(in){
// int pos = in.tellg();
std::getline(in,line);
pos += line.length() + 2; // account for line terminator...
}

Bye Carlo
 
J

John Harrison

John said:
Let me try that again

Looks as though the definition of streambuf (which is used by all
streams) is such that the only way to find the current position is to
perform a seek to the current position.

john

Let me really try this again, I shouldn't speculate on things I have no
real knowledge of.

I would imagine that the *likely* reason is that calling tellg in the
particular circumstances you are is causing the input buffer to flush.
Certainly the slow down you are observing would be consistent with that.

However the only way to know for sure would be a careful examination of
the library code, or use of a debugger to step into the library code.

john
 
T

toton

* toton:





Most likely the cause is conversion of CRLF to LF, which you've
specified by (1) opening the file in text mode and (2) compiling with a
Windows compiler.

One cure could then be to open the file in binary mode, and handle
newlines as appropriate (or not).

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

There are enough bad things related to new line ...
seekg and tellg doesn't match when newline char is \n , and file is
opened in text mode.
For the unix file,
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
std::cout<<pos<<" "<<line<<std::endl;

if(line==".PEN_DOWN"){
in.seekg(pos);
break;
}
}
std::getline(in,line);///This doesn't print .PEN_DOWN !
std::cout<<line<<std::endl;
Now if I open it in binary mode, Then this problem is solved.
But it creates another set of problems,
for unix file now it is fine, but for windows file \r is attached at
the end of line, as newline char is \n. So I need to remove \r from
the line if it is present.

I wonder, what will getline will return in case of a mac file where
newline terminator is \r only. Will it return the total file as single
line ?
Is there any std api support to take care of all these things, and yet
to make seekg & tellg consistent ?

Thanks
abir
 
P

P.J. Plauger

There are enough bad things related to new line ...
seekg and tellg doesn't match when newline char is \n , and file is
opened in text mode.

That shouldn't be, if you're just using seekg to return to a place
earlier memorized by tellg.
For the unix file,
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
std::cout<<pos<<" "<<line<<std::endl;

if(line==".PEN_DOWN"){
in.seekg(pos);
break;
}
}
std::getline(in,line);///This doesn't print .PEN_DOWN !
std::cout<<line<<std::endl;
Now if I open it in binary mode, Then this problem is solved.
But it creates another set of problems,
for unix file now it is fine, but for windows file \r is attached at
the end of line, as newline char is \n. So I need to remove \r from
the line if it is present.

If you wrote the file in binary mode, the \r characters wouldn't
be appended in the first place. It is important that you read and
write consistently, at least if you don't want to deal with local
conventions for reading and writing text files.
I wonder, what will getline will return in case of a mac file where
newline terminator is \r only. Will it return the total file as single
line ?

If you write in text mode and read in binary mode, that could happen,
yes.
Is there any std api support to take care of all these things, and yet
to make seekg & tellg consistent ?

Yes, it's called the Standard C++ library, if you use it right.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
P

Pete Becker

toton said:
There are enough bad things related to new line ...
seekg and tellg doesn't match when newline char is \n , and file is
opened in text mode.

Sure it does. See below.
For the unix file,
std::string line;
while(in){
int pos = in.tellg();
std::getline(in,line);
std::cout<<pos<<" "<<line<<std::endl;

if(line==".PEN_DOWN"){
in.seekg(pos);
break;
}
}
std::getline(in,line);///This doesn't print .PEN_DOWN !
std::cout<<line<<std::endl;
Now if I open it in binary mode, Then this problem is solved.
But it creates another set of problems,
for unix file now it is fine, but for windows file \r is attached at
the end of line, as newline char is \n. So I need to remove \r from
the line if it is present.

I wonder, what will getline will return in case of a mac file where
newline terminator is \r only. Will it return the total file as single
line ?
Is there any std api support to take care of all these things, and yet
to make seekg & tellg consistent ?

Be careful: you're mixing two different things. In C++ source code, '\n'
is the character that's used to mark the end of a line, and '\r' is the
character that is used to mark a carriage return. That has only a
historical connection with the ASCII newline character whose value is
0x0D and the ASCII carriage return character whose value is 0x0A (or
maybe the other way around).

For text files, if you know the conventions that your operating system
uses, you can talk about the details of how line ends are represented in
the text file. But from a high level language perspective, that's
irrelevant detail: it's up to the I/O library to translate things, so
that when you write the character '\n' it does whatever is appropriate
to mark the end of a line using the OS's conventions. Similarly, when
you read a text file, the I/O library translates whatever the OS uses to
mark the end of a line into a single '\n' character.

The problem you're running into is that you're apparently not using
native text files, since you're talking about unix files, mac files, and
Windows. The I/O library isn't prepared to deal with all of them. When
you move text files from one system to another, use a utility like ftp
that understands line ending conventions and does the appropriate
translations. Don't expect Unix I/O libraries to understand Windows file
conventions, or vice versa.

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)
 
T

toton

That shouldn't be, if you're just using seekg to return to a place
earlier memorized by tellg.





If you wrote the file in binary mode, the \r characters wouldn't
be appended in the first place. It is important that you read and
write consistently, at least if you don't want to deal with local
conventions for reading and writing text files.


If you write in text mode and read in binary mode, that could happen,
yes.


Yes, it's called the Standard C++ library, if you use it right.

P.J. Plauger
Dinkumware, Ltd.http://www.dinkumware.com

May be I am unable to express the problem clearly.
1) I am not writing the file, I am reading the file only. It is a text
file, but nothing is fixed like line terminator will be \n or \r\n or
\r . It all depends on who saved the file using which editor .
So this is the question for parsing ...
The file looks something like this
..X_DIM 20701
..Y_DIM 27000
..X_POINTS_PER_MM 100
..Y_POINTS_PER_MM 100
..POINTS_PER_SECOND 200
..COMMENT YES_PRES_ORG 0
..COMMENT YES_PRES_EXT 1023
..DT 3975234
..PEN_DOWN
..COMMENT .PEN_WIDTH 1
..COMMENT .PEN_WIDTH_ORG 1
..COMMENT .PEN_COLOR 0x0

Now I need to remember past position using tellg() , and go to that
position using seekg().
The cases are,
1) file is opened in text mode . The file contains \n as terminator.
seekg doesn't place file pointer to proper pos saved by tellg (as
given in my previous program ) . It works as expected when newline is
\r\n.
2) The file is opened in binary mode . The file contains \n as line
terminator.
seekg & tellg works as expected. The file contains \r\n as
terminator . the returned string contains \r , which need to be
removed.
3) This one I hadn't tested. Several mac files have \r as newline
char. What std::getline(stream,str ) will return ? The whole page or
the line only ?

Thus my questions are, how to check which newline char to use , so
that I can parse all of the files properly ?
It should be noted, files are not written by me, I just read it.
And all the test's are done with MSVC 7.1 , gcc might give just
opposite result (I will check it quickly ) .
 
P

P.J. Plauger

.....
May be I am unable to express the problem clearly.
1) I am not writing the file, I am reading the file only. It is a text
file, but nothing is fixed like line terminator will be \n or \r\n or
\r . It all depends on who saved the file using which editor .

Then it *is* fixed, but not by you. If, as Pete Becker said, the
file was written as text on one system and read on another, the
lines might not be terminated as the reading system expects. And if
you read the file as binary, you have to know what line terminators
look like.
So this is the question for parsing ...
The file looks something like this
.X_DIM 20701
.Y_DIM 27000
.X_POINTS_PER_MM 100
.Y_POINTS_PER_MM 100
.POINTS_PER_SECOND 200
.COMMENT YES_PRES_ORG 0
.COMMENT YES_PRES_EXT 1023
.DT 3975234
.PEN_DOWN
.COMMENT .PEN_WIDTH 1
.COMMENT .PEN_WIDTH_ORG 1
.COMMENT .PEN_COLOR 0x0

Now I need to remember past position using tellg() , and go to that
position using seekg().
The cases are,
1) file is opened in text mode . The file contains \n as terminator.
seekg doesn't place file pointer to proper pos saved by tellg (as
given in my previous program ) . It works as expected when newline is
\r\n.

You're violating the Windows notion of text file, so it's possible
you're confusing the underlying C library, which the Standard C++
uses for basic file operations. Convert the file to Windows format
and seekg/tellg should work fine.
2) The file is opened in binary mode . The file contains \n as line
terminator.
seekg & tellg works as expected.

Right. No surprise.
The file contains \r\n as
terminator . the returned string contains \r , which need to be
removed.

Yep. You're now violating the C/C++ conventions for text streams
internal to a program. The \r is considered part of the text line,
not part of the line terminator.
3) This one I hadn't tested. Several mac files have \r as newline
char. What std::getline(stream,str ) will return ? The whole page or
the line only ?

The whole works, unless you specify \r as the line terminator.
Thus my questions are, how to check which newline char to use , so
that I can parse all of the files properly ?

Well, you have to know what they are, don't you? Or at least all the
possible options. One approach is to read the file as binary and be
prepared for any of \n, \n\r, \r, or \r\n as line terminators. It's
kinda hard to use getline directly that way, but you can write your
own version.
It should be noted, files are not written by me, I just read it.
And all the test's are done with MSVC 7.1 , gcc might give just
opposite result (I will check it quickly ) .

I think "opposite" is an over simplification. It's just different.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
T

toton

Sure it does. See below.






Be careful: you're mixing two different things. In C++ source code, '\n'
is the character that's used to mark the end of a line, and '\r' is the
character that is used to mark a carriage return. That has only a
historical connection with the ASCII newline character whose value is
0x0D and the ASCII carriage return character whose value is 0x0A (or
maybe the other way around).

For text files, if you know the conventions that your operating system
uses, you can talk about the details of how line ends are represented in
the text file. But from a high level language perspective, that's
irrelevant detail: it's up to the I/O library to translate things, so
that when you write the character '\n' it does whatever is appropriate
to mark the end of a line using the OS's conventions. Similarly, when
you read a text file, the I/O library translates whatever the OS uses to
mark the end of a line into a single '\n' character.

The problem you're running into is that you're apparently not using
native text files, since you're talking about unix files, mac files, and
Windows. The I/O library isn't prepared to deal with all of them. When
you move text files from one system to another, use a utility like ftp
that understands line ending conventions and does the appropriate
translations. Don't expect Unix I/O libraries to understand Windows file
conventions, or vice versa.

Sure. You got the right point. I am using a unix encoded file in
windows machine. Those came from http download, as a zipped folder and
thus doesn't handle the translation. So I need to handle them in
binary mode. Moreover the file format doesn't specify what is new-line
( I hope they should make it \n someday ).
So at present I need to handle both of them (ie from unix => windows &
unix, from windows => unix & windows ) . This at present I am doing
using binary mode, and discarding the \r if any. However I wonder ,
how to handle that if at all some mac file with \r comes!
Just want to know, as std library doesnt' handle this ,how it can be
done.

abir
 
P

Pete Becker

toton said:
Sure. You got the right point. I am using a unix encoded file in
windows machine. Those came from http download, as a zipped folder and
thus doesn't handle the translation.

Most versions of unzip have a -a option that translates line terminators.

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top