Number formatting

C

Chris Theis

Hi all,

I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some software
(upon which I have no influence!!!!) which delivers data in scientific
notation and I have to read it. This is fairly simple, but here is the
tricky thing. This software is written in FORTRAN and shows the following
feature, which IMHO is rather a bug than a feature. If numbers get very
small like 7.0614E-238 it starts writing them out as 7.0614-238. So when I
parse the file what I get is 7.0614 because the minus is seen as a
separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but this
would slow down the process of reading the data which can be really huge!
Does anybody of you have an idea on how to "fix" this problem because I
cannot change the software which delivers these , IMHO corrupted values,
which are FORTRAN standard compliant.

Cheers
Chris
 
P

Phlip

Chris said:
I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some
software (upon which I have no influence!!!!) which delivers data in
scientific notation and I have to read it. This is fairly simple, but here
is the tricky thing. This software is written in FORTRAN and shows the
following feature, which IMHO is rather a bug than a feature. If numbers
get very small like 7.0614E-238 it starts writing them out as 7.0614-238.
So when I parse the file what I get is 7.0614 because the minus is seen as
a separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but
this would slow down the process of reading the data which can be really
huge!

How do you know that parsing the - would slow the program down?

Here's a reprehensibly simple parser:

http://c2.com/cgi/wiki?MsWindowsResourceLint

Here's one of its member functions:

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToken();
return m_currentToken;
}

Here's a unit test on that function:

TEST_(TestCase, pullNextToken)
{

Source aSource("a b\nc\n d");

string
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("b", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("c", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("" , token); // EOF!

}

Now imagine if you wrote a dirt-simple parser, using fstream goodies, and
you also wrote unit tests like that. You could add a test that calls a hard
function ten thousand times, and then asserts that the CPU time didn't
exceed some obvious limit, like a thousandth of a second.

You will probably discover that your parser is not slow. If you only stream
characters, and never buffer strings into std::string (possibly slow), then
all your code might run inside the CPU's cache, without excessive data
motion on the main bus.

Never guess what could be slow; measure.
 
F

F.J.K.

Chris said:
Hi all,

I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some software
(upon which I have no influence!!!!) which delivers data in scientific
notation and I have to read it. This is fairly simple, but here is the
tricky thing. This software is written in FORTRAN and shows the following
feature, which IMHO is rather a bug than a feature. If numbers get very
small like 7.0614E-238 it starts writing them out as 7.0614-238. So when I
parse the file what I get is 7.0614 because the minus is seen as a
separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but this
would slow down the process of reading the data which can be really huge!
Does anybody of you have an idea on how to "fix" this problem because I
cannot change the software which delivers these , IMHO corrupted values,
which are FORTRAN standard compliant.

Cheers
Chris

Pretty much every programmer of scientific code has had that "joy". I'd
be interested myself, whether there's some secret "Fortran locale",
that would make all of this obsolete. Looking at LC_NUMERIC and co. I
doubt so :(

In C++ I use code like the following. If you really, really need to go
for speed, you'll have to roll your parser yourself. However, if speed
was an absolute issue, you'd be reading/writing binary data anyways, so
there's no point. Btw, it would be pretty easy to fix this problem from
the fortran side.

#include <iostream>
#include <cmath>
#include <sstream>

struct fortran_double {
fortran_double operator = (const double d) {
value=d;
return *this;
}
operator double() const {return value;}
friend std::istream& operator >> (std::istream &in,
fortran_double &fd);
private:
double value;
};
template <typename T>
inline T exp10 (T x)
{
static T log_10 = std::log(static_cast<T>(10.0));
return exp(log_10 * x);
}

std::istream& operator >> (std::istream &in, fortran_double &fd) {
double d;
int mantissa;
in >> d;
char ch=in.peek();
if (ch=='+' || ch=='-') {
in >> mantissa;
d*=exp10(static_cast<double> (mantissa));
}
fd = d;
return in;
}

int main () {
double x=0;
fortran_double fd;
std::istringstream in("1.2344-200");
in >> fd;
x=fd;
std::cout << x << "\n";
}
 
C

Chris Theis

Phlip said:
How do you know that parsing the - would slow the program down?

Here's a reprehensibly simple parser:

http://c2.com/cgi/wiki?MsWindowsResourceLint

Here's one of its member functions:

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToken();
return m_currentToken;
}

Here's a unit test on that function:

TEST_(TestCase, pullNextToken)
{

Source aSource("a b\nc\n d");

string
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("b", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("c", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("" , token); // EOF!

}

Now imagine if you wrote a dirt-simple parser, using fstream goodies, and
you also wrote unit tests like that. You could add a test that calls a
hard function ten thousand times, and then asserts that the CPU time
didn't exceed some obvious limit, like a thousandth of a second.

You will probably discover that your parser is not slow. If you only
stream characters, and never buffer strings into std::string (possibly
slow), then all your code might run inside the CPU's cache, without
excessive data motion on the main bus.

Never guess what could be slow; measure.

Hi Phlip,

now you're actually guessing that I didn't measure, aren't you? ;-) Well,
the thing is that at one point I would have to use strings to assemble the
total number and finally convert it into a double. All of this is more work
than simply reading and storing a value. Therefore, I was looking for a
solution which doesn't necessarily need to re-assemble numbers via
strings/characters but rather some way to emulate this quirky FORTRAN
format. Although, I more and more get the impression that this simply
doesn't work and I will have to try to convince the responsponsible people
to adjust their format specifiers, as it's just a couple of key punches for
them, whereas I would have to invest quite some time to solve this.

Thanks
Chris
 
P

Phlip

Howard said:
You've got a little crush on that "unit test" thingie, don't you? C'mon,
fess up, you know you like it...

A "crush"? You might also call it a marriage...

now you're actually guessing that I didn't measure, aren't you? ;-)

I answer "premature optimization is the root of all evil" too often here...
Well, the thing is that at one point I would have to use strings to
assemble the total number and finally convert it into a double.

At the bottom of my post I hinted that dealing in streams instead of strings
would be faster, and more like a parser.

So if you put my technique together with F.J.K.'s, you could use his main()
as your first unit test.
All of this is more work than simply reading and storing a value.

More coding for you or more work for the CPU? F.J.K.'s solution shows how to
parse and treat each number as you get it, without putting the numbers into
separate std::string objects or anything like that.
...Although, I more and more get the impression that this simply doesn't
work and I will have to try to convince the responsponsible people to
adjust their format specifiers, as it's just a couple of key punches for
them, whereas I would have to invest quite some time to solve this.

And in terms of process, one fixes a bug as close as possible to its source.
Don't output a bug, then detect it and clean up after it with extra
statements.
 
C

Chris Theis

Hi there,
Pretty much every programmer of scientific code has had that "joy". I'd
be interested myself, whether there's some secret "Fortran locale",
that would make all of this obsolete. Looking at LC_NUMERIC and co. I
doubt so :(

I did some research but I honestly doubt so too :-(
In C++ I use code like the following. If you really, really need to go
for speed, you'll have to roll your parser yourself. However, if speed
was an absolute issue, you'd be reading/writing binary data anyways, so
there's no point.

Binary is a little complicated as we have to remain portable for a lot of
platforms and there are some backwards compatibility issues with the program
delivering the data already. So this topic is unfortunately a little touchy
and beyond my influence.
Btw, it would be pretty easy to fix this problem from
the fortran side.

Yes that's for sure - it would be adding "E3" to the format string and
that's it. But the tricky thing is to convice the responsible, a hardcore
FORTRAN developer, to acknowledge that something like 7.0631-236 is an
expression and not a proper scientifc notation for a value ;-)

Thanks for the code - it's pretty much what I finally came up with and
implemented as a first work-around.

Thanks a lot guys
Chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top