Question about reading from stream.

C

Carfield Yim

HI all, we currently using following code to read a file that another
process continuously appending content ( like tail -f something to
process )

infile.seekg(_currentFilePointer);
infile.read(_buffer,_buffer_size) ;
_bytesLeftInBuffer = infile.gcount() ;

I suspect this in fact is very noneffective, what is the more prefer
way to read from a growing files? Use getc() ? But isn't that need to
loop many time?
 
V

Vaclav Haisman

Carfield Yim wrote, On 25.3.2009 16:32:
HI all, we currently using following code to read a file that another
process continuously appending content ( like tail -f something to
process )

infile.seekg(_currentFilePointer);
infile.read(_buffer,_buffer_size) ;
_bytesLeftInBuffer = infile.gcount() ;

I suspect this in fact is very noneffective, what is the more prefer
way to read from a growing files? Use getc() ? But isn't that need to
loop many time?
If you are concerned about raw speed than do not use streams for IO, there
are many layers of abstraction that makes things less than spectacular. That
said, do you know that the IO is the bottleneck of your application? Avoid
premature optimization.
 
V

Victor Bazarov

Vaclav said:
Carfield Yim wrote, On 25.3.2009 16:32:
If you are concerned about raw speed than do not use streams for IO, there
are many layers of abstraction that makes things less than spectacular. That
said, do you know that the IO is the bottleneck of your application? Avoid
premature optimization.

Those are good suggestions, and we all can agree that to optimize one
most often needs to measure first. But it does not take a measurement
to know that IO is a bottleneck. In every application. Hardware is
slow. And one needs to keep things like IO in mind when devising the
approach to serialization. Some optimization is not premature, like
picking quick sort over bubble sort: you don't need measurements for
that, you can use the measurements people have collected over the years.

On the flip side, once the sort is abstracted, one algorithm can
probably be replaced with another easily; so, to the OP: don't integrate
reading/writing into your code too tightly. Create an abstraction layer
so you can use a different method of serializing once you figure that
you might need that.

V
 
J

James Kanze

HI all, we currently using following code to read a file that
another process continuously appending content ( like tail -f
something to process )
infile.seekg(_currentFilePointer);
infile.read(_buffer,_buffer_size) ;
_bytesLeftInBuffer = infile.gcount() ;
I suspect this in fact is very noneffective, what is the more
prefer way to read from a growing files? Use getc() ? But
isn't that need to loop many time?

There isn't really anything to support this in C++. Once a
filebuf has seen end of file, it stops; end of file is the end.
If you want something like tail -f, you'll have to use a lower
level, open and read, under Unix, for example. (I don't know if
CreateFile/ReadFile will work for this under Windows.)
 
C

Carfield Yim

Those are good suggestions, and we all can agree that to optimize one
most often needs to measure first.  But it does not take a measurement
to know that IO is a bottleneck.  In every application.  Hardware is
slow.  And one needs to keep things like IO in mind when devising the
approach to serialization.  Some optimization is not premature, like
picking quick sort over bubble sort: you don't need measurements for
that, you can use the measurements people have collected over the years.

On the flip side, once the sort is abstracted, one algorithm can
probably be replaced with another easily; so, to the OP: don't integrate
reading/writing into your code too tightly.  Create an abstraction layer
so you can use a different method of serializing once you figure that
you might need that.

V

Thx, btw I am not yet try to optimize it , I just feel this is kind of
tedious to seek everytime. Of course may be that is required.
 
V

Vaclav Haisman

Victor Bazarov wrote, On 25.3.2009 22:41:
Those are good suggestions, and we all can agree that to optimize one
most often needs to measure first. But it does not take a measurement
to know that IO is a bottleneck. In every application. Hardware is
I partially disagree. IO is always slow, that's true. But for many
applications the time spent doing and waiting for IO is not the majority of
their run time. Processing of data can take a lot more time than the raw IO
itself. In such applications IO is not the bottleneck.
 
V

Victor Bazarov

Vaclav said:
Victor Bazarov wrote, On 25.3.2009 22:41:
I partially disagree.

With what, exactly?
> IO is always slow, that's true.

That's it. Period. Slow. When you read data from a file or write data
to a file, IO is the bottleneck, not conversions (if any), not
compression (if any), not creation of any other auxiliary objects...
> But for many
applications the time spent doing and waiting for IO is not the majority of
their run time.

No, but what does their overall run time have to do with the fact that
during reading or writing data the interaction with the device through
the platform abstractions (isn't that what the streams are?) is the
slowest part?

Why do customers care about startup time or the time it takes to load a
file into the application? They only do it a few times a day. And if
the application is stable, you don't have to shut it down at all, ever,
right? But for some reason people still try to make the startup
quicker, loading of files threaded, and so on. Why? It only takes a
few minutes? Yes, but those are often the minutes of waiting deducted
from the lives of our customers, you know.
> Processing of data can take a lot more time than the raw IO
itself. In such applications IO is not the bottleneck.

In my overall life IO takes really, really tiny portion. It's not a
bottleneck at all. But then again, nothing is. Breathing, maybe. Or
thinking, decision making. But we're not talking about overall run of
the program, are we?

Sorry, did mean to snap (if it appeared to be so).

V
 
J

James Kanze

[...]
With what, exactly?
That's it. Period. Slow. When you read data from a file or
write data to a file, IO is the bottleneck, not conversions
(if any), not compression (if any), not creation of any other
auxiliary objects...

That's usually true, but not always. I've seen cases where
compression was a "bottleneck" (although this is relative---in
the final code, more time was spent on compression than on the
physical writes, but the program was still faster with the
compression, since it wrote a lot less). And I've seen cases
where where allocations were far more significant than the
actual writes. So there are exceptions.

And of course, if you're writing to a pipe under Unix, the
writes can be very, very fast.
No, but what does their overall run time have to do with the
fact that during reading or writing data the interaction with
the device through the platform abstractions (isn't that what
the streams are?) is the slowest part?

Well, if you eliminate the other causes, IO will end up being
the slowest remaining part.
Why do customers care about startup time or the time it takes
to load a file into the application? They only do it a few
times a day.

That depends on the application. I've written servers that run
for years at a time---startup time isn't significant. But I've
also written a lot of Unix filters, which are invoked
interactively (often on a block of text in the editor). In such
cases, start-up time can be an issue. (If you doubt it, try
using one written in Java---where loading the classes ensures a
significantly long start up time.)
And if the application is stable, you don't have to shut it
down at all, ever, right?

It depends on the application. What if it's a compiler? Or a
Unix filter like grep or sed? For that matter, if clients share
no data directly, there are strong arguments for starting up a
new instance of a server for each connection; you don't want
start up time to be too long there, either. (Note that on the
server I currently work on, the start-up time is several tens of
seconds---the time to reconstruct all of the persistent data
structures in memory, resynchronize with the data base, etc.)
 
C

Carfield Yim

Thx, btw I am not yet try to optimize it , I just feel this is kind of
tedious to seek everytime. Of course may be that is required.

In fact what I really like to ask is, if it is usual to

seek and read, save the position, then seek and read again.

for a growing file to process? Look like this required moving the file
cursor again and again.
 
C

coal

    [...]
With what, exactly?
 > IO is always slow, that's true.
That's it.  Period.  Slow.  When you read data from a file or
write data to a file, IO is the bottleneck, not conversions
(if any), not compression (if any), not creation of any other
auxiliary objects...

That's usually true, but not always.  I've seen cases where
compression was a "bottleneck" (although this is relative---in
the final code, more time was spent on compression than on the
physical writes, but the program was still faster with the
compression, since it wrote a lot less).  And I've seen cases
where where allocations were far more significant than the
actual writes.  So there are exceptions.

And of course, if you're writing to a pipe under Unix, the
writes can be very, very fast.
No, but what does their overall run time have to do with the
fact that during reading or writing data the interaction with
the device through the platform abstractions (isn't that what
the streams are?) is the slowest part?

Well, if you eliminate the other causes, IO will end up being
the slowest remaining part.
Why do customers care about startup time or the time it takes
to load a file into the application?  They only do it a few
times a day.

That depends on the application.  I've written servers that run
for years at a time---startup time isn't significant.  But I've
also written a lot of Unix filters, which are invoked
interactively (often on a block of text in the editor).  In such
cases, start-up time can be an issue.  (If you doubt it, try
using one written in Java---where loading the classes ensures a
significantly long start up time.)
And if the application is stable, you don't have to shut it
down at all, ever, right?

It depends on the application.  What if it's a compiler?  Or a
Unix filter like grep or sed?  For that matter, if clients share
no data directly, there are strong arguments for starting up a
new instance of a server for each connection;

I think the arguments in favor of a long running server are
stronger than those against it in the case of compilers.
The design and implementation have to be such that separate
requests do not interfere with each other. There are some
steps that you can take that help in that area, but don't
require anything like a new process and a completely fresh
start. Besides the basic efficiencies afforded, there's
a lot of basic information that doesn't change between
requests. Why rescan/prepare for <vector> billions of
times when it doesn't change one little bit? It surprises
me that you question this given what I know of your
background.
you don't want
start up time to be too long there, either.

It has to done efficiently. Single-run compilers are a
luxury that is fading. I harp on this, but gcc needs to
be overhauled twice. First a rewrite in C++ and then a
rewrite to be an on line compiler. The first phase of
the on line part could be to simply run once and exit
after each request. That though would have to be
replaced by a real server approach that runs like a
champ. They are so far away from this it isn't even
funny. As far as I know all they are working on is
C++0X support. Some of that is important, too, but
they shouldn't keep ignoring these other matters.
It may be that gcc has just become a dinosaur that
can't adapt to the times. They certainly haven't
done a good job keeping up in some respects. Where's
the "gcc on line" page like some compilers have? And
even the compilers that have that type of page
haven't done much with them in the past ten years.

I understood this stuff in 1999 so don't think it
should be a surprise to people now. The internet is
here to stay. I didn't vote for George W. Bust (I lived
in Texas when he was governor and I voted for him as
governor at least once, but by the time he ran for
President I was on to him. I didn't vote for Barack
Obama, aka B.O., either.), but one thing Bust got right
was to encourage people to bring new services on line in
many of his speeches. The US would be busted even worse
if not for that consistent encouragement.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net

"Trust in the L-rd with all your heart and lean not on
your own understanding. In all your ways acknowledge
him and he will direct your paths." Proverbs 3:5,6
 
J

James Kanze

[...]
It depends on the application. What if it's a compiler? Or a
Unix filter like grep or sed? For that matter, if clients share
no data directly, there are strong arguments for starting up a
new instance of a server for each connection;
I think the arguments in favor of a long running server are
stronger than those against it in the case of compilers.
The design and implementation have to be such that separate
requests do not interfere with each other. There are some
steps that you can take that help in that area, but don't
require anything like a new process and a completely fresh
start. Besides the basic efficiencies afforded, there's
a lot of basic information that doesn't change between
requests. Why rescan/prepare for <vector> billions of
times when it doesn't change one little bit? It surprises
me that you question this given what I know of your
background.

The contents of std::vector are data, not code, and don't
evolve. There's nothing wrong with having it cached somewhere,
maybe loaded by mmap, but just keeping a server up so that
compilations won't have to reread it seems a bit overkill.
(There's also the fact that formally, the effects of compiling
an include file depend on what macros are defined beforehand.
Even something like std::vector: the user doesn't have the right
to define any macros which might conflict, but most
implementations have two or more different versions, depending
on the settings of various macros.)

Anyway, my comments were, largely, based on the way things are,
rather than how they could be. I've not actually given the idea
of implementing a compiler as a server much thought, but today's
compilers are not implemented that way. I certainly don't want
to imply that things have to be like they are.
It has to done efficiently. Single-run compilers are a
luxury that is fading. I harp on this, but gcc needs to
be overhauled twice. First a rewrite in C++ and then a
rewrite to be an on line compiler. The first phase of
the on line part could be to simply run once and exit
after each request. That though would have to be
replaced by a real server approach that runs like a
champ. They are so far away from this it isn't even
funny.

So are most of the other compiler implementers, as far as I
know.

I'm not too sure what the server approach would by us in most
cases, as opposed, say, to decent pre-compiled headers and
caching. (If I were implementing a compiler today, it would
definitely make intensive use of caching; as you say, there's no
point in reparsing std::vector everytime you compile.)
As far as I know all they are working on is
C++0X support. Some of that is important, too, but
they shouldn't keep ignoring these other matters.
It may be that gcc has just become a dinosaur that
can't adapt to the times. They certainly haven't
done a good job keeping up in some respects. Where's
the "gcc on line" page like some compilers have?

The "xxx on line" pages I know of for other compilers are just
front ends, which run the usual, batch mode compiler. It would
be trivial for someone to do this with g++---if Comeau wanted,
for example, I doubt that it would take Greg more than a half a
day to modify his page so you could use either his compiler or
g++ (for comparison purposes?).

I'm certainly in favor of making compilers accessible on-line,
but that's a different issue. No professional organization
would use such a compiler, except for test or comparison
purposes.
 
C

coal

    [...]
It depends on the application.  What if it's a compiler?  Or a
Unix filter like grep or sed?  For that matter, if clients share
no data directly, there are strong arguments for starting up a
new instance of a server for each connection;
I think the arguments in favor of a long running server are
stronger than those against it in the case of compilers.
The design and implementation have to be such that separate
requests do not interfere with each other.  There are some
steps that you can take that help in that area, but don't
require anything like a new process and a completely fresh
start.  Besides the basic efficiencies afforded, there's
a lot of basic information that doesn't change between
requests.  Why rescan/prepare for <vector> billions of
times when it doesn't change one little bit?  It surprises
me that you question this given what I know of your
background.

The contents of std::vector are data, not code, and don't
evolve.  There's nothing wrong with having it cached somewhere,
maybe loaded by mmap, but just keeping a server up so that
compilations won't have to reread it seems a bit overkill.
(There's also the fact that formally, the effects of compiling
an include file depend on what macros are defined beforehand.
Even something like std::vector: the user doesn't have the right
to define any macros which might conflict, but most
implementations have two or more different versions, depending
on the settings of various macros.)

In that case, it might make sense to only have the most common
case cached and reparse the file if someone is doing something
somewhat unusual. That's how I would probably start to support
the idea.
Anyway, my comments were, largely, based on the way things are,
rather than how they could be.  I've not actually given the idea
of implementing a compiler as a server much thought, but today's
compilers are not implemented that way.  I certainly don't want
to imply that things have to be like they are.


So are most of the other compiler implementers, as far as I
know.

Well, I think some C++ compilers are written in C++ so they
would be in better shape (potentially) in my opinion than gcc.
But probably few if any of them are being rewritten to be
on line servers. I believe that will change. I read an
article in the Wall Street Journal about how some companies
are giving away their software because over the past 18 months
their software stopped selling. That's not exactly news
except that the practice of giving away software is spreading
and companies that have previously been immune to some of the
market forces are now having to play by the rules that the
smaller guys accept.
I'm not too sure what the server approach would by us in most
cases, as opposed, say, to decent pre-compiled headers and
caching.  (If I were implementing a compiler today, it would
definitely make intensive use of caching; as you say, there's no
point in reparsing std::vector everytime you compile.)

People have mentioned on here that it is a headache to
manage more than a couple of compilers on your own system.
It buys efficiency though mainly. A new release is made in
one place and then anyone may use it without having to down-
load and install it. This avoids the case where a user
accidentally corrupts his installation and then has to
download and reinstall the compiler. The main advantage
might be the speed with which new releases and fixes can be
made available.

The "xxx on line" pages I know of for other compilers are just
front ends, which run the usual, batch mode compiler.  It would
be trivial for someone to do this with g++---if Comeau wanted,
for example, I doubt that it would take Greg more than a half a
day to modify his page so you could use either his compiler or
g++ (for comparison purposes?).

Someone will probably do it eventually.
I'm certainly in favor of making compilers accessible on-line,
but that's a different issue.  No professional organization
would use such a compiler, except for test or comparison
purposes.

What about using it within their intranet? I think the choice
comes down to using a service for free on a public site or
paying for it and using it behind a firewall. Individuals and
smaller organizations will probably go with the free approach
and may use some techniques to protect their work.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net
 
C

coal

What about using it within their intranet?   I think the choice
comes down to using a service for free on a public site or
paying for it and using it behind a firewall.   Individuals and
smaller organizations will probably go with the free approach
and may use some techniques to protect their work.

I would like to point out a couple more things. Those that
can afford to buy the service and use it privately are paying
a high price for the privacy because they have to pay for
managing/maintaining it themselves. They have to patch it
whenever they want to pick up a fix. They better be a very
well run organization in order to make that work. And say
Comeau had an on line version of his compiler. How can he
prevent someone who works for an organization who has bought
a license from him from copying his software and then trying
to make some money by making illegal copies of it? It's tough
to stop. This is a reason why Google and others who have the
on line versions of programs similar to Microsoft Office are
in better shape in my opinion than Microsoft going forward.
Microsoft drags it's feet in developing/promoting their on
line versions.

In my case, I don't trust most politicians or governments
to do the right thing by me. So I make the service
available for free on line, but don't think it's a good
idea to sell it to someone because it might be illegally
copied after that. Software piracy/theft was a problem
before the recent economic problems, so I doubt matters
will improve and they may get worse. The truth is the
Chinese, Indian, Russian, etc. governments just don't
care about an American boy like me. They gotta think
about helping their people out. If that means
betraying me or Microsoft, that's OK with them.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net
 
W

woodbrian77

On Mar 27, 8:31 pm, (e-mail address removed) wrote:
    [...]
It depends on the application.  What if it's a compiler?  Or a
Unix filter like grep or sed?  For that matter, if clients share
no data directly, there are strong arguments for starting up a
new instance of a server for each connection;
I think the arguments in favor of a long running server are
stronger than those against it in the case of compilers.
The design and implementation have to be such that separate
requests do not interfere with each other.  There are some
steps that you can take that help in that area, but don't
require anything like a new process and a completely fresh
start.  Besides the basic efficiencies afforded, there's
a lot of basic information that doesn't change between
requests.  Why rescan/prepare for <vector> billions of
times when it doesn't change one little bit?  It surprises
me that you question this given what I know of your
background.
The contents of std::vector are data, not code, and don't
evolve.  There's nothing wrong with having it cached somewhere,
maybe loaded by mmap, but just keeping a server up so that
compilations won't have to reread it seems a bit overkill.
(There's also the fact that formally, the effects of compiling
an include file depend on what macros are defined beforehand.
Even something like std::vector: the user doesn't have the right
to define any macros which might conflict, but most
implementations have two or more different versions, depending
on the settings of various macros.)

In that case, it might make sense to only have the most common
case cached and reparse the file if someone is doing something
somewhat unusual.  That's how I would probably start to support
the idea.


Anyway, my comments were, largely, based on the way things are,
rather than how they could be.  I've not actually given the idea
of implementing a compiler as a server much thought, but today's
compilers are not implemented that way.  I certainly don't want
to imply that things have to be like they are.
So are most of the other compiler implementers, as far as I
know.

Well, I think some C++ compilers are written in C++ so they
would be in better shape (potentially) in my opinion than gcc.
But probably few if any of them are being rewritten to be
on line servers.  I believe that will change.  I read an
article in the Wall Street Journal about how some companies
are giving away their software because over the past 18 months
their software stopped selling.  That's not exactly news
except that the practice of giving away software is spreading
and companies that have previously been immune to some of the
market forces are now having to play by the rules that the
smaller guys accept.


I'm not too sure what the server approach would by us in most
cases, as opposed, say, to decent pre-compiled headers and
caching.  (If I were implementing a compiler today, it would
definitely make intensive use of caching; as you say, there's no
point in reparsing std::vector everytime you compile.)

People have mentioned on here that it is a headache to
manage more than a couple of compilers on your own system.
It buys efficiency though mainly.  A new release is made in
one place and then anyone may use it without having to down-
load and install it.  This avoids the case where a user
accidentally corrupts his installation and then has to
download and reinstall the compiler.  The main advantage
might be the speed with which new releases and fixes can be
made available.

I felt like there were some other things to list, but couldn't
think of them at the time. On one of Comeau's page it says:

"Evaluating Comeau C++ for your purchase. Some customers ask
about a test, demo, eval or trial version of Comeau C++. This
form is the next best thing."

Well, that form is helpful, but I'm not sure he can use it
to replace demos if he has a big client that wants more
than they can tell from the on line form. In that case he
would have to do something additional to provide them with
a demo. If he had the compiler fully on line, he would be
able to totally avoid demo-related expenses.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top