Number formatting

Chris Theis · Oct 11, 2006

Hi all,

I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some software
(upon which I have no influence!!!!) which delivers data in scientific
notation and I have to read it. This is fairly simple, but here is the
tricky thing. This software is written in FORTRAN and shows the following
feature, which IMHO is rather a bug than a feature. If numbers get very
small like 7.0614E-238 it starts writing them out as 7.0614-238. So when I
parse the file what I get is 7.0614 because the minus is seen as a
separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but this
would slow down the process of reading the data which can be really huge!
Does anybody of you have an idea on how to "fix" this problem because I
cannot change the software which delivers these , IMHO corrupted values,
which are FORTRAN standard compliant.

Cheers
Chris

Phlip · Oct 11, 2006

Chris said:
I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some
software (upon which I have no influence!!!!) which delivers data in
scientific notation and I have to read it. This is fairly simple, but here
is the tricky thing. This software is written in FORTRAN and shows the
following feature, which IMHO is rather a bug than a feature. If numbers
get very small like 7.0614E-238 it starts writing them out as 7.0614-238.
So when I parse the file what I get is 7.0614 because the minus is seen as
a separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but
this would slow down the process of reading the data which can be really
huge!

How do you know that parsing the - would slow the program down?

Here's a reprehensibly simple parser:

http://c2.com/cgi/wiki?MsWindowsResourceLint

Here's one of its member functions:

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToken();
return m_currentToken;
}

Here's a unit test on that function:

TEST_(TestCase, pullNextToken)
{

Source aSource("a b\nc\n d");

string
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("b", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("c", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("" , token); // EOF!

}

Now imagine if you wrote a dirt-simple parser, using fstream goodies, and
you also wrote unit tests like that. You could add a test that calls a hard
function ten thousand times, and then asserts that the CPU time didn't
exceed some obvious limit, like a thousandth of a second.

You will probably discover that your parser is not slow. If you only stream
characters, and never buffer strings into std::string (possibly slow), then
all your code might run inside the CPU's cache, without excessive data
motion on the main bus.

Never guess what could be slow; measure.

F.J.K. · Oct 11, 2006

Chris said:
Hi all,

I'm currently facing something which is quite annoying and probably one of
you might have an idea of how to solve it efficiently. I have some software
(upon which I have no influence!!!!) which delivers data in scientific
notation and I have to read it. This is fairly simple, but here is the
tricky thing. This software is written in FORTRAN and shows the following
feature, which IMHO is rather a bug than a feature. If numbers get very
small like 7.0614E-238 it starts writing them out as 7.0614-238. So when I
parse the file what I get is 7.0614 because the minus is seen as a
separator. Of course I could start reading all the data a strings,
tokenizing them and start checking for this rather quirky behavior, but this
would slow down the process of reading the data which can be really huge!
Does anybody of you have an idea on how to "fix" this problem because I
cannot change the software which delivers these , IMHO corrupted values,
which are FORTRAN standard compliant.

Cheers
Chris

Pretty much every programmer of scientific code has had that "joy". I'd
be interested myself, whether there's some secret "Fortran locale",
that would make all of this obsolete. Looking at LC_NUMERIC and co. I
doubt so

In C++ I use code like the following. If you really, really need to go
for speed, you'll have to roll your parser yourself. However, if speed
was an absolute issue, you'd be reading/writing binary data anyways, so
there's no point. Btw, it would be pretty easy to fix this problem from
the fortran side.

#include <iostream>
#include <cmath>
#include <sstream>

struct fortran_double {
fortran_double operator = (const double d) {
value=d;
return *this;
}
operator double() const {return value;}
friend std::istream& operator >> (std::istream &in,
fortran_double &fd);
private:
double value;
};
template <typename T>
inline T exp10 (T x)
{
static T log_10 = std::log(static_cast<T>(10.0));
return exp(log_10 * x);
}

std::istream& operator >> (std::istream &in, fortran_double &fd) {
double d;
int mantissa;
in >> d;
char ch=in.peek();
if (ch=='+' || ch=='-') {
in >> mantissa;
d*=exp10(static_cast<double> (mantissa));
}
fd = d;
return in;
}

int main () {
double x=0;
fortran_double fd;
std::istringstream in("1.2344-200");
in >> fd;
x=fd;
std::cout << x << "\n";
}

Chris Theis · Oct 11, 2006

Phlip said:
How do you know that parsing the - would slow the program down?

Here's a reprehensibly simple parser:

http://c2.com/cgi/wiki?MsWindowsResourceLint

Here's one of its member functions:

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToken();
return m_currentToken;
}

Here's a unit test on that function:

TEST_(TestCase, pullNextToken)
{

Source aSource("a b\nc\n d");

string
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("b", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("c", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken();
CPPUNIT_ASSERT_EQUAL("" , token); // EOF!

}

Now imagine if you wrote a dirt-simple parser, using fstream goodies, and
you also wrote unit tests like that. You could add a test that calls a
hard function ten thousand times, and then asserts that the CPU time
didn't exceed some obvious limit, like a thousandth of a second.

You will probably discover that your parser is not slow. If you only
stream characters, and never buffer strings into std::string (possibly
slow), then all your code might run inside the CPU's cache, without
excessive data motion on the main bus.

Never guess what could be slow; measure.

Hi Phlip,

now you're actually guessing that I didn't measure, aren't you? ;-) Well,
the thing is that at one point I would have to use strings to assemble the
total number and finally convert it into a double. All of this is more work
than simply reading and storing a value. Therefore, I was looking for a
solution which doesn't necessarily need to re-assemble numbers via
strings/characters but rather some way to emulate this quirky FORTRAN
format. Although, I more and more get the impression that this simply
doesn't work and I will have to try to convince the responsponsible people
to adjust their format specifiers, as it's just a couple of key punches for
them, whereas I would have to invest quite some time to solve this.

Thanks
Chris

Howard · Oct 11, 2006

Phlip said:
Chris Theis wrote:

Here's a unit test on that function:

Now imagine if you wrote a dirt-simple parser, using fstream goodies, and
you also wrote unit tests like that.

You've got a little crush on that "unit test" thingie, don't you? C'mon,
fess up, you know you like it...

;-)

Phlip · Oct 12, 2006

Howard said:
You've got a little crush on that "unit test" thingie, don't you? C'mon,
fess up, you know you like it...

A "crush"? You might also call it a marriage...

now you're actually guessing that I didn't measure, aren't you? ;-)

I answer "premature optimization is the root of all evil" too often here...

Well, the thing is that at one point I would have to use strings to
assemble the total number and finally convert it into a double.

At the bottom of my post I hinted that dealing in streams instead of strings
would be faster, and more like a parser.

So if you put my technique together with F.J.K.'s, you could use his main()
as your first unit test.

All of this is more work than simply reading and storing a value.

More coding for you or more work for the CPU? F.J.K.'s solution shows how to
parse and treat each number as you get it, without putting the numbers into
separate std::string objects or anything like that.

...Although, I more and more get the impression that this simply doesn't
work and I will have to try to convince the responsponsible people to
adjust their format specifiers, as it's just a couple of key punches for
them, whereas I would have to invest quite some time to solve this.

And in terms of process, one fixes a bug as close as possible to its source.
Don't output a bug, then detect it and clean up after it with extra
statements.

Chris Theis · Oct 12, 2006

Hi there,

Pretty much every programmer of scientific code has had that "joy". I'd
be interested myself, whether there's some secret "Fortran locale",
that would make all of this obsolete. Looking at LC_NUMERIC and co. I
doubt so

I did some research but I honestly doubt so too :-(

In C++ I use code like the following. If you really, really need to go
for speed, you'll have to roll your parser yourself. However, if speed
was an absolute issue, you'd be reading/writing binary data anyways, so
there's no point.

Binary is a little complicated as we have to remain portable for a lot of
platforms and there are some backwards compatibility issues with the program
delivering the data already. So this topic is unfortunately a little touchy
and beyond my influence.

Btw, it would be pretty easy to fix this problem from
the fortran side.

Yes that's for sure - it would be adding "E3" to the format string and
that's it. But the tricky thing is to convice the responsible, a hardcore
FORTRAN developer, to acknowledge that something like 7.0631-236 is an
expression and not a proper scientifc notation for a value ;-)

Thanks for the code - it's pretty much what I finally came up with and
implemented as a first work-around.

Thanks a lot guys
Chris

Sort by number of characters	1	Nov 2, 2023
jQuery Scrapping & Formatting Inputted Paste	2	Sep 30, 2020
basic::string Formatting	2	Mar 31, 2014
How to add augmented reality content to a website?	3	Apr 7, 2023
Trying to figure out http request POST phrasing	1	Mar 30, 2023
datetime formatting output	0	Feb 8, 2014
Help with recompiling a small software into 32 bit format	0	Jun 2, 2023
New To Javascript - Accessing Data	3	Nov 26, 2023

Number formatting

Chris Theis

Phlip

F.J.K.

Chris Theis

Howard

Phlip

Chris Theis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads