Reading an array from file?

F

fdm

Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861,
0, 0.063379,
0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425, -1.10221, -1.59358,
0,
0.298058, -0.564321, -1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
-0.954023, 0,
0.54965, -0.57544, -2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267,
-0.570523, 0,
0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
-0.411637, 0,
0.0390142, -0.273687, -0.624816, -1.51086, -2.18197, -1.86934,
0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326, 0.543721, -1.87936,
1.06337, 0.0752932, -0.0704278, -0.307334, -0.99684, 0,
0.00486263, -0.12788, -0.25644, -0.491107,
0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, -0.000112593, 0.0110926,
0.0167261, -0.049946, -0.0783788, -0.0384529,
0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299, 0.00784705,
0.270986, 1.90373, 0.0225429, -0.684401, -0.250102, 0.0345052,
0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634, -0.987079, -0.837127,
1.46826, 0.682614,
0, -0.0313031, -0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
-1.18301, 0.0228328, 0, -0.0125886, -4.34123,
0.0787134, -45.9453, -66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0,
0.65181, -8.32657, 6.24457, -37.9488, -110.562, -54.1226,
3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154, -3.71556,
9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]

Now I would like to read this array into a C++ container/array so I can do
something like:

double first = container[0]; // -0.00231844

The number of elements in the file can vary from file to file so I was
thinking of using a std::vector. The below code:

std::vector<std::string> file;
std::string line;
file.clear();
std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
while (getline(infile, line, '\n')) {
file.push_back (line);
}

std::cout << "Read " << file.front() << " lines.\n";


stores the whole array from the file as a single string in the vector. But I
still need to "tokenize" this string into its double parts. Before throwing
myself into cumbersome code I would like to hear if anyone has a good idea
to do this the right way.
 
V

Victor Bazarov

fdm said:
Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853, 0.119546,
-0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395, 0.104663,
-0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861, 0,
0.063379, 0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425,
-1.10221, -1.59358, 0, 0.298058, -0.564321, -1.91097, -3.58063,
-6.30183, -4.78945, -1.61198, -0.70215, -0.954023, 0, 0.54965, -0.57544,
-2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267, -0.570523,
0, 0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318,
-3.09268, -0.411637, 0, 0.0390142, -0.273687, -0.624816, -1.51086,
-2.18197, -1.86934, 0.297622, -1.07632, -0.0820767, 0, 0.0166208,
-0.543326, 0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278,
-0.307334, -0.99684, 0, 0.00486263, -0.12788, -0.25644, -0.491107,
0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, -0.000112593, 0.0110926, 0.0167261, -0.049946, -0.0783788,
-0.0384529, 0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299,
0.00784705, 0.270986, 1.90373, 0.0225429, -0.684401, -0.250102,
0.0345052, 0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634,
-0.987079, -0.837127, 1.46826, 0.682614, 0, -0.0313031, -0.717254,
-0.545265, -17.2006, -31.0422, -20.0047, -2.02679, -1.18301, 0.0228328,
0, -0.0125886, -4.34123, 0.0787134, -45.9453, -66.6283, -50.7082,
1.52779, -1.68643, -0.339206, 0, 0.65181, -8.32657, 6.24457, -37.9488,
-110.562, -54.1226, 3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154,
-3.71556, 9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253, 0.403886,
-0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]

Are the brackets part of the file or part of your message showing the
contents of the file?
Now I would like to read this array into a C++ container/array so I can
do something like:

double first = container[0]; // -0.00231844

The number of elements in the file can vary from file to file so I was
thinking of using a std::vector. The below code:

std::vector<std::string> file;
std::string line;
file.clear();
std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
while (getline(infile, line, '\n')) {
file.push_back (line);
}

std::cout << "Read " << file.front() << " lines.\n";


stores the whole array from the file as a single string in the vector.
But I still need to "tokenize" this string into its double parts. Before
throwing myself into cumbersome code

But you already have! Why don't you just read doubles from the file
instead of reading lines and parsing them later?
> I would like to hear if anyone has
a good idea to do this the right way.

// if your file does have the leading bracket:
std::vector<double> values;
while (infile)
{
char dummy_char;
if (infile >> dummy_char)
{
double d;
if (infile >> d)
values.push_back(d);
}
}

// if your file does NOT have the leading bracket:
std::vector<double> values;
while (infile)
{
double d;
if (infile >> d)
values.push_back(d);
char dummy_char;
infile >> dummy_char;
}

And the second variation can probably be used even for the file with the
bracket, I didn't check.

V
 
F

fdm

Victor Bazarov said:
fdm said:
Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
0.104663, -0.550282, -1.26802, -0.705694,
0.0873308, -0.309962, -0.802861, 0, 0.063379,
0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425, -1.10221, -1.59358,
0,
0.298058, -0.564321, -1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
-0.954023, 0,
0.54965, -0.57544, -2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267,
-0.570523, 0,
0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
-0.411637, 0,
0.0390142, -0.273687, -0.624816, -1.51086, -2.18197, -1.86934,
0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326,
0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278, -0.307334, -0.99684,
0, 0.00486263, -0.12788, -0.25644, -0.491107,
0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, -0.000112593, 0.0110926,
0.0167261, -0.049946, -0.0783788, -0.0384529,
0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299, 0.00784705,
0.270986, 1.90373, 0.0225429, -0.684401, -0.250102, 0.0345052,
0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634, -0.987079, -0.837127,
1.46826, 0.682614,
0, -0.0313031, -0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
-1.18301, 0.0228328, 0, -0.0125886, -4.34123,
0.0787134, -45.9453, -66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0,
0.65181, -8.32657, 6.24457, -37.9488, -110.562, -54.1226,
3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154, -3.71556,
9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]

Are the brackets part of the file or part of your message showing the
contents of the file?
Now I would like to read this array into a C++ container/array so I can
do something like:

double first = container[0]; // -0.00231844

The number of elements in the file can vary from file to file so I was
thinking of using a std::vector. The below code:

std::vector<std::string> file;
std::string line;
file.clear();
std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
while (getline(infile, line, '\n')) {
file.push_back (line);
}

std::cout << "Read " << file.front() << " lines.\n";


stores the whole array from the file as a single string in the vector.
But I still need to "tokenize" this string into its double parts. Before
throwing myself into cumbersome code

But you already have! Why don't you just read doubles from the file
instead of reading lines and parsing them later?
I would like to hear if anyone has
a good idea to do this the right way.

// if your file does have the leading bracket:
std::vector<double> values;
while (infile)
{
char dummy_char;
if (infile >> dummy_char)
{
double d;
if (infile >> d)
values.push_back(d);
}
}





I have now tried:

std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

The loop only gets executed once since all the values are on one line (this
is the format) and no doubles are created.

Should the line not be read and then parsed to doubles where ',' is used as
a delimiter?
 
V

Victor Bazarov

fdm said:
Victor Bazarov said:
fdm said:
Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853, 0.119546,
-0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962,
-0.802861, 0, 0.063379, 0.398289, -1.44105, -1.53938, -1.7608,
-1.38484, -0.711425, -1.10221, -1.59358, 0, 0.298058, -0.564321,
-1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
-0.954023, 0, 0.54965, -0.57544, -2.33652, -6.10376, -4.54323,
-4.77601, -4.48725, -0.489267, -0.570523, 0, 0.668925, -0.46068,
-2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
-0.411637, 0, 0.0390142, -0.273687, -0.624816, -1.51086, -2.18197,
-1.86934, 0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326,
0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278, -0.307334,
-0.99684, 0, 0.00486263, -0.12788, -0.25644, -0.491107, 0.201335,
-1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, -0.000112593, 0.0110926, 0.0167261, -0.049946, -0.0783788,
-0.0384529, 0.0407556, -0.154947, -0.0463077, 0, -0.0182507,
0.00359299, 0.00784705, 0.270986, 1.90373, 0.0225429, -0.684401,
-0.250102, 0.0345052, 0, -0.0636621, -0.364021, -1.0357, -2.70395,
-4.77634, -0.987079, -0.837127, 1.46826, 0.682614, 0, -0.0313031,
-0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
-1.18301, 0.0228328, 0, -0.0125886, -4.34123, 0.0787134, -45.9453,
-66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0, 0.65181,
-8.32657, 6.24457, -37.9488, -110.562, -54.1226, 3.39012, -0.0921196,
0.12512, 0, 1.67071, 0.694154, -3.71556, 9.19359, -8.64445, 14.5316,
-1.12059, -0.852576, 0.59615, 0, 0.001542, -0.94513, -0.844656,
-6.95102, 1.63441, -5.0893, -3.16847, 1.19829, 0.0257344, 0,
-0.186464, -1.54877, 0.321253, 0.403886, -0.983199, -1.91005,
-0.53617, -0.353148, -0.0942512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[..]

I have now tried:

std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

The loop only gets executed once since all the values are on one line
(this is the format) and no doubles are created.

I don't understand this statement, sorry. I tested that loop on a
stringstream with three doubles separated by commas, and it read all
three of them.
Should the line not be read and then parsed to doubles where ',' is used
as a delimiter?

Why bother?

OK, I am going to test on a real file with five doubles... Gimme five
minutes...

V
 
V

Victor Bazarov

fdm said:
[..]
I have now tried:

std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

The loop only gets executed once since all the values are on one line
(this is the format) and no doubles are created.

Should the line not be read and then parsed to doubles where ',' is used
as a delimiter?

OK, here are the results:
------------------------- code
#include <iostream>
#include <fstream>
#include <vector>

int main()
{
std::string path_to_file("testdoubles.txt");
std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

std::cout << "read " << values.size() << " doubles\n";

return 0;
}
-------------------------------- file 'testdoubles.txt'
[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853]
-----------------------------------------------------------

Compiled with VC++ 2008 and run, here is the output:
-------------------------------------
val = -0.00231844
val = -0.02326
val = 0.0484723
val = 0.0782189
val = 0.0917853
read 5 doubles
-------------------------------------

I took your first 5 doubles, closed them with the bracket, and placed
the TXT file next to the executable. Everything went fine, as you can
see here. Would you take my word for it? <shrug> Now, I have no idea
what you're doing wrong, but you must be doing something wrong.

V
 
J

Jerry Coffin

Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,

[ ... ]
I
still need to "tokenize" this string into its double parts. Before throwing
myself into cumbersome code I would like to hear if anyone has a good idea
to do this the right way.

I'm not sure if it's really "right" (some would argue that it's
downright wrong), but for your purposes, the commas are essentially
the same as white space -- i.e. they're something between the data
that you ignore. As such, I'd probably just create a locale where
',' is treated as white space, and then let the iostream handle the
"parsing" involved:

#include <iostream>
#include <locale>
#include <algorithm>
#include <vector>

struct digits_only: std::ctype<char>
{
digits_only(): std::ctype<char>(get_table()) {}

static std::ctype_base::mask const* get_table()
{
static std::ctype_base::mask
rc[std::ctype<char>::table_size];
static bool inited = false;

if (!inited)
{
// Everything is a "space"
std::fill_n(rc, std::ctype<char>::table_size,
std::ctype_base::space);

// unless it can be part of a floating point number.
std::fill_n(rc+'0', 9, std::ctype_base::mask());
rc['.'] = std::ctype_base::mask();
rc['-'] = std::ctype_base::mask();
rc['+'] = std::ctype_base::mask();
rc['e'] = std::ctype_base::mask();
rc['E'] = std::ctype_base::mask();
inited = true;
}
return rc;
}
};

int main() {
int low, high;

std::cin.imbue(std::locale(std::locale(), new digits_only()));

std::vector<double> data;
std::copy(std::istream_iterator<double>(std::cin),
std::istream_iterator<double>(),
std::back_inserter(data));

std::copy(data.begin(), data.end(),
std::eek:stream_iterator<double>(std::cout, "\n"));
return 0;
}

This should accept any valid input. Most invalid data in the input
file will simply be ignored. A few things in the input could stop the
input at that point though. Just for example, input that contained
text would fail at the first 'e' or 'E' in the text -- those can't be
ignored because they _could_ be part of a f.p. number, but by itself
(without some digits), the attempt to convert it as an f.p. number
would fail.
 
F

fdm

Victor Bazarov said:
fdm said:
[..]
I have now tried:

std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

The loop only gets executed once since all the values are on one line
(this is the format) and no doubles are created.

Should the line not be read and then parsed to doubles where ',' is used
as a delimiter?

OK, here are the results:
------------------------- code
#include <iostream>
#include <fstream>
#include <vector>

int main()
{
std::string path_to_file("testdoubles.txt");
std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

// if your file does have the leading bracket: Yes it does
std::vector<double> values;
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

std::cout << "read " << values.size() << " doubles\n";

return 0;
}
-------------------------------- file 'testdoubles.txt'
[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853]
-----------------------------------------------------------

Compiled with VC++ 2008 and run, here is the output:
-------------------------------------
val = -0.00231844
val = -0.02326
val = 0.0484723
val = 0.0782189
val = 0.0917853
read 5 doubles
-------------------------------------

I took your first 5 doubles, closed them with the bracket, and placed the
TXT file next to the executable. Everything went fine, as you can see
here. Would you take my word for it? <shrug> Now, I have no idea what
you're doing wrong, but you must be doing something wrong.


I was specifying a wrong path, now it works perfect.

Just to make sure I understand. The below while loop will be executed as
long as there are content
in the file that can be converted to doubles:

while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

I assume that it runs from left to right line by line in the file and each
'd' is a valid double it passes on its run.

Normally I would expect some kind of 'infile.next()' operation but this
seems to be implicitly invoked in the loop or what?
 
V

Victor Bazarov

fdm said:
[..]
I was specifying a wrong path, now it works perfect.

Just to make sure I understand. The below while loop will be executed as
long as there are content
in the file that can be converted to doubles:

while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}

I assume that it runs from left to right line by line in the file

Files don't have lines. Lines are invented by people. Files have
bytes. Some byte value people decided to call "a line break", and
assume that whatever is in the file between here and that "line break"
is "a line". In fact, to the operator>> that reads a double a "line
break" is just whitespace, a separator.

High-level formatted I/O bundles up whitespace. That's why if you make
a special locale that treats commas as whitespace (like Jerry suggested)
it would be line you made a find-and-replace operation and swapped all
commas for more spaces.

The loop skips a char, then reads a *field* (anything that can be
converted into a double value) as long as it can. Then it stops,
converts the field and assigns the value to 'd', then starts over. By
that I mean that it reads out another character (the comma), then reads
another field, converts into a double, and so on.
> and
each 'd' is a valid double it passes on its run.

Normally I would expect some kind of 'infile.next()' operation but this
seems to be implicitly invoked in the loop or what?

I am not sure I understand the question. *Reading* from the file moves
the "cursor", there is no need to do any additional movement (by calling
"next" or whatever).

V
 
J

James Kanze

Hi I have a .txt file containing:
[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
[ ... ]
I still need to "tokenize" this string into its double
parts. Before throwing myself into cumbersome code I would
like to hear if anyone has a good idea to do this the right
way.
I'm not sure if it's really "right" (some would argue that
it's downright wrong), but for your purposes, the commas are
essentially the same as white space -- i.e. they're something
between the data that you ignore. As such, I'd probably just
create a locale where ',' is treated as white space, and then
let the iostream handle the "parsing" involved:

Where commas and opening and closing braces are white space.

But a lot depends on how much error checking is deamed
necessary. In his case, a priori, commas aren't "just white
space", since something like "1.2,,3.4" should probably produce
an error, and if the file doesn't start with a '[' and end with
an ']', it's also an error.

There are several ways of tackling this; Victor's loop, but
checking the value of dummy_char, is probably the simplest for a
beginner to write and understand. If error checking on the
input format is not desired (although it generally isn't a good
idea to drop such error checking), the easiest solution is
probably a FloatWithSeparator class, something like:

class FloatWithSeparator
{
public:
operator double() const
{
return myValue ;
}
friend std::istream& operator>>(
std::istream& source,
FloatWithSeparator& dest )
{
char separator ;
source >> separator >> myValue ;
return source ;
}

private:
double myValue ;
} ;

You can then define the vector with something like:

std::vector< double > v(
(std::istream_iterator< FloatWithSeparator >( file )),
(std::istream_iterator< FloatWithSeparator >()) ) ;
 
J

Jerry Coffin

[ ... ]
But a lot depends on how much error checking is deamed
necessary. In his case, a priori, commas aren't "just white
space", since something like "1.2,,3.4" should probably produce
an error, and if the file doesn't start with a '[' and end with
an ']', it's also an error.

That could well be -- when I posted this, I hadn't yet seen his post
confirming that there really even _was_ a leading and trailing
bracket, rather than that being something he added for posting.

He still hasn't said anything about how much (if at all) he cares
about verifying that the input is really in the correct format.
Sometimes you need to verify the input rigorously, but others
(including a few projects I've been involved with) they knew their
data was a mess, and that dealing with it all correctly just wasn't
going to happen -- but they wanted a best effort to read as much as
possible as reasonably as possible.

Unfortunately, the OP hasn't really told us enough to figure out
which category this data falls into. What he posted _looks_
sufficiently "regular" that it might make sense to just reject
anything that doesn't look perfect -- but that's purely a guess based
on a sample size of 1...
 
J

James Kanze

fdm said:
[..]
I was specifying a wrong path, now it works perfect.
Just to make sure I understand. The below while loop will be
executed as long as there are content in the file that can
be converted to doubles:
while (infile) {
char dummy_char;
if (infile >> dummy_char) {
double d;
if (infile >> d) {
values.push_back(d);
std::cout << "val = " << d << std::endl;
}
}
}
I assume that it runs from left to right line by line in the
file
Files don't have lines.

Of course they do. The C++ (and the C) standard says they do.
Lines are invented by people.

So were files.
Files have bytes.

Files come in all sorts of varieties and flavors, depending on
the OS. Although it's true that on almost all systems, files
contain "bytes" (Windows is, I think, the only exception), those
bytes are structured in various ways; text files are structured
in lines, for example.
Some byte value people decided to call "a line break", and
assume that whatever is in the file between here and that
"line break" is "a line".

That's more or less the Unix point of view. It's not the C/C++
point of view (where we have both text and binary files), nor
the point of view of most OS's.
In fact, to the operator>> that reads a double a "line break"
is just whitespace, a separator.

If the file is opened in text mode, the system reads lines, and
appends a '\n' to the end of each line. What a line is depends
on the system. And a '\n' is considered white space by
std::istream (at least in the usual locales---Jerry suggested a
locale where ',' was also considered white space, but one could
just as easily create a locale where '\n', or even ' ', was not
white space).

If the file is opened in binary mode, the system reads bytes
(whatever that means on the system), and the input isn't
structured in lines (although it might very well contain bytes
whose numeric value corresponds to '\n').
High-level formatted I/O bundles up whitespace.

The correct word is "skips", not bundles up. And only at the
start of each input, and only if std::ios::skipws is set in
std::ios::fmtfield (the default).
That's why if you make a special locale that treats commas as
whitespace (like Jerry suggested) it would be line you made a
find-and-replace operation and swapped all commas for more
spaces.
The loop skips a char, then reads a *field* (anything that can
be converted into a double value) as long as it can.

The loop reads two fields, one a char, and the other a double.
It skips any preceding white space for both fields.

A more idiomatic way of writing the loop would be:

char sep ;
double d ;
while ( infile >> sep >> d ) {
// Check if sep is the expected value?
values.push_back( d ) ;
std::cout << "val = " << d << std::endl;
}

Whether this is preferable to your version, I don't know; if
there is an error in the format of the file, your version makes
it easier to determine where. (On the other hand, if you want
really clear error messages, you'll modify the reading in some
way in order to be able to output the line number. I typically
use a filtering streambuf for this, but that's definitly a
technique that a beginner wouldn't apply. Note too that any
good error handling will make the code three to five times more
complicated.)
Then it stops, converts the field and assigns the value to
'd', then starts over. By that I mean that it reads out
another character (the comma), then reads another field,
converts into a double, and so on.

Above all, it skips any whitespace before the comma AND before
the double.
I am not sure I understand the question.

He seems to be taking the Pascal view, in which a "file" is a
sliding window in the input stream---you can read without
advancing the position (although the more frequently used
commands also advance it). In C or C++, this can easily be done
at the character level, e.g. istream::peek() and istream::get(),
or by using the streambuf directly. For higher level objects,
however, it doesn't work. (Note that he couldn't read his file
at all in Pascal, except by declaring it as a FILE OF CHARACTER,
and doing all the parsing and conversions himself.)
 
A

Alf P. Steinbach

* James Kanze:
Files come in all sorts of varieties and flavors, depending on
the OS. Although it's true that on almost all systems, files
contain "bytes" (Windows is, I think, the only exception), those
bytes are structured in various ways; text files are structured
in lines, for example.

Files in Windows only contain "bytes". But they can have areas where they don't
really contain anything, namely, a sparse file. ;-) Also, a Windows file can
contain more than one stream (sequence) of bytes. Whether these features are
present depend on the filesystem the file resides on. The default filesystem,
NTFS, support these features. But any way you look at it, at the bottom of the
interpretation ladder there are bytes. And nothing else.

Don military boots and kick the person who misinformed you, in his/her ass.

But, be careful: be sure to not seriously hurt the brain residing there. ;-)



Cheers,

- Alf
 
J

James Kanze

* James Kanze:
Files in Windows only contain "bytes". But they can have areas
where they don't really contain anything, namely, a sparse
file. ;-) Also, a Windows file can contain more than one
stream (sequence) of bytes. Whether these features are present
depend on the filesystem the file resides on. The default
filesystem, NTFS, support these features. But any way you look
at it, at the bottom of the interpretation ladder there are
bytes. And nothing else.

Well, there's a certain level where everything is just bytes.
But I was under the impression that Windows used UTF-16 for text
at the system level, and that files could (and text files
generally did) contain UTF-16---i.e. 16 bit entities. (And
under Windows on a PC, a byte is 8 bits.)

On the other hand, now that you mention it... When I ported some
of my file handling classes to Windows, filenames for CreateFile
were always LPCTSTR, whatever that is (but a narrow character
string literal converts implicitly to it, as does the results of
std::string.c_str()), which makes me wonder why people argue
that std::fstream must have a form which takes a wchar_t string
as filename argument. According to the documentation, WriteFile
and ReadFile take what I assume to be a void* (LPCVOID or
LPVOID), which doesn't say much one way or the other, but the
length argument is specified as "number of bytes".
Don military boots and kick the person who misinformed you, in
his/her ass.
But, be careful: be sure to not seriously hurt the brain
residing there. ;-)

I think it was more an impression I got from postings here, from
which I got the impression that all (or most) of the API was
ambivalent; change a macro or a compiler option, and you got a
different set of system API's, which expected wchar_t (and the
type of TCHAR changed as well). I was probably just
exterpolating too much into it.
 
J

Jerry Coffin

[ ... ]
Well, there's a certain level where everything is just bytes.
But I was under the impression that Windows used UTF-16 for text
at the system level, and that files could (and text files
generally did) contain UTF-16---i.e. 16 bit entities. (And
under Windows on a PC, a byte is 8 bits.)

They can, but they far more often contain something like ISO 8859.

In the end, the OS is mostly agnostic about the content of text
files. As you'd expect, it includes some utilities that know how to
work with text files, and most of those can work with files
containing either 8-bit or 16-bit entities, and even guess which a
particular file contains (though the guess isn't always right).
On the other hand, now that you mention it... When I ported some
of my file handling classes to Windows, filenames for CreateFile
were always LPCTSTR, whatever that is (but a narrow character
string literal converts implicitly to it, as does the results of
std::string.c_str()), which makes me wonder why people argue
that std::fstream must have a form which takes a wchar_t string
as filename argument.

Just FWIW, LPCTSTR is something like long pointer to const text
string (where 'text' means char's or wchar_t's depending on whether
_UNICODE was defined or not when compiling).

If you don't have _UNICODE defined, CreateFile will accept a char *.
If you do define it, CreateFile accepts a wchar_t *.

In reality, most of the functions in Windows that take strings come
in two flavors: an 'A' version and a 'W' version, so the headers look
something like this:

HANDLE CreateFileW(wchar_t const *, /* ... */);
HANDLE CreateFileA(char const *, /* ... */);

#ifdef _UNICODE
#define CreateFile CreateFileW
#else
#define CreateFile CreateFileA
#endif

The 'A' version, however, is a small stub that converts the string
from the current code page to UTF-16, and then (in essence) feeds
that result to the 'W' version. That can lead to a problem if you use
the 'A' version -- if your current code page doesn't contain a
character corresponding to a character in the file name, you may not
be able to create that file name with the 'A' version at all.

The 'W' version lets you specify UTF-16 characters directly, so it
can specify any file name that can exist -- but fstream::fstream and
fstream::eek:pen usually act as wrappers for the 'A' version.

Of course, you _could_ work around this without changing the fstream
interface -- for example, you could write it to expect a UTF-8
string, convert it to UTF-16, and then pass the result to CreateFileW
-- but I don't know of anybody who does so. As I recall, there are
also some characters that can't be encoded as UTF-8, so even that
wouldn't be a perfect solution, though it would usually be adequate.
According to the documentation, WriteFile
and ReadFile take what I assume to be a void* (LPCVOID or
LPVOID), which doesn't say much one way or the other, but the
length argument is specified as "number of bytes".

Right -- the OS just passes this data through transparently.
Fundamentally it's about like write() on Unix -- it just deals with a
stream of bytes; any other structure is entirely up to you and what
you choose to write and how you choose to interpret data you read.

[ ... ]
I think it was more an impression I got from postings here, from
which I got the impression that all (or most) of the API was
ambivalent; change a macro or a compiler option, and you got a
different set of system API's, which expected wchar_t (and the
type of TCHAR changed as well). I was probably just
exterpolating too much into it.

I think that sounds about right. Most functions that accept a
_string_ come in two flavors, one that accepts a narrow string and
another that accepts a wide string. From its viewpoint, when you
write to a file, however, that's not really a string, but just raw
data, so there's just one version that passes the data through
without interpretation or modification.
 
J

James Kanze

[ ... ]
Well, there's a certain level where everything is just
bytes. But I was under the impression that Windows used
UTF-16 for text at the system level, and that files could
(and text files generally did) contain UTF-16---i.e. 16 bit
entities. (And under Windows on a PC, a byte is 8 bits.)
They can, but they far more often contain something like ISO
8859.
In the end, the OS is mostly agnostic about the content of
text files. As you'd expect, it includes some utilities that
know how to work with text files, and most of those can work
with files containing either 8-bit or 16-bit entities, and
even guess which a particular file contains (though the guess
isn't always right).
Just FWIW, LPCTSTR is something like long pointer to const
text string (where 'text' means char's or wchar_t's depending
on whether _UNICODE was defined or not when compiling).

In other words, you don't know what you're getting. That sounds
like the worst of both worlds.
If you don't have _UNICODE defined, CreateFile will accept a
char *. If you do define it, CreateFile accepts a wchar_t *.
In reality, most of the functions in Windows that take strings
come in two flavors: an 'A' version and a 'W' version, so the
headers look something like this:
HANDLE CreateFileW(wchar_t const *, /* ... */);
HANDLE CreateFileA(char const *, /* ... */);
#ifdef _UNICODE
#define CreateFile CreateFileW
#else
#define CreateFile CreateFileA
#endif

Hopefully, they do use an inline function in the #ifdef, and not
a macro.
The 'A' version, however, is a small stub that converts the
string from the current code page to UTF-16, and then (in
essence) feeds that result to the 'W' version. That can lead
to a problem if you use the 'A' version -- if your current
code page doesn't contain a character corresponding to a
character in the file name, you may not be able to create that
file name with the 'A' version at all.

Hopefully, they have a code page for UTF-8.

And what happens with the name when it is actually passed to the
file system? Most file systems I have mounted won't support
UTF-16 in filenames---the file system will read it as a NTMB
string, and stop at the first byte with 0. (Also, the file
servers are often big endian, and not little endian.) I'm
pretty sure that NFS doesn't support UTF-16 in the protocol, and
I don't think SMB does either.
The 'W' version lets you specify UTF-16 characters directly,
so it can specify any file name that can exist -- but
fstream::fstream and fstream::eek:pen usually act as wrappers for
the 'A' version.
Of course, you _could_ work around this without changing the
fstream interface -- for example, you could write it to expect
a UTF-8 string, convert it to UTF-16, and then pass the result
to CreateFileW -- but I don't know of anybody who does so. As
I recall, there are also some characters that can't be encoded
as UTF-8, so even that wouldn't be a perfect solution, though
it would usually be adequate.

UTF-8 can encode anything in Unicode. And more; basically, in
it's most abstract form, it's just a means of encoding 32 bit
values as sequences of 8 bit bytes, and can handle an 32 bit
value. (The Unicode definition of UTF-8 does introduce some
restrictions---I don't think encodings of surrogates are
allowed, for example, and codes Unicode forbids, like 0xFFFF,
certainly aren't. But in the basic original UTF-8, there's no
problem with those either.)
Right -- the OS just passes this data through transparently.
Fundamentally it's about like write() on Unix -- it just deals
with a stream of bytes; any other structure is entirely up to
you and what you choose to write and how you choose to
interpret data you read.

In other words, there is no transfer of 16 bit entities. It's
up to the writer to write it as bytes, and the reader to read it
as bytes, and the two to agree how to do so. (In practice, of
course, if the two are both on the same machine, this won't be a
problem. But in practice, in the places I've worked, most of
the files on the PC's have been remote mounted on a Sparc, which
is big-endian.)
 
A

Alf P. Steinbach

* James Kanze:
[ ... ]
Well, there's a certain level where everything is just
bytes. But I was under the impression that Windows used
UTF-16 for text at the system level, and that files could
(and text files generally did) contain UTF-16---i.e. 16 bit
entities. (And under Windows on a PC, a byte is 8 bits.)
They can, but they far more often contain something like ISO
8859.
In the end, the OS is mostly agnostic about the content of
text files. As you'd expect, it includes some utilities that
know how to work with text files, and most of those can work
with files containing either 8-bit or 16-bit entities, and
even guess which a particular file contains (though the guess
isn't always right).
Just FWIW, LPCTSTR is something like long pointer to const
text string (where 'text' means char's or wchar_t's depending
on whether _UNICODE was defined or not when compiling).

In other words, you don't know what you're getting. That sounds
like the worst of both worlds.

T was a feature enabling compilation of C and C?+ for both Windows 9x (narrow
characters only) and NT (wide characters, representing Unicode).

T is not used today except by (1) those who need to support old 9x *and* are
using some libraries that really require narrow characters (namely, in practice,
DLL-based MFC), and (2) utter novices, being misled by Microsoft example code
(which apparently also is written by utter novices), and (3) incompetents.

We'd not want any kind of macros like that in the standard, and neither have
they anything to do in any quality app.

Hopefully, they do use an inline function in the #ifdef, and not
a macro.

No, it's all macros.

Thousands of them.

:)

Hopefully, they have a code page for UTF-8.

No. Or, technically yes, there's a designation, and the APIs happily convert to
and from that codepage, correctly. But as of Windows XP UTF-8 is not supported
by standard Windows programs, in particular the command interpreter (where
commands can just fail silently when you change to codepage 65001) -- I don't
know whether that's been fixed in Vista or Windows 7.

And what happens with the name when it is actually passed to the
file system? Most file systems I have mounted won't support
UTF-16 in filenames---the file system will read it as a NTMB
string, and stop at the first byte with 0. (Also, the file
servers are often big endian, and not little endian.) I'm
pretty sure that NFS doesn't support UTF-16 in the protocol, and
I don't think SMB does either.

The NTFS filesystem stores filenames with UTF-16 encoding.

UTF-8 can encode anything in Unicode. And more; basically, in
it's most abstract form, it's just a means of encoding 32 bit
values as sequences of 8 bit bytes, and can handle an 32 bit
value. (The Unicode definition of UTF-8 does introduce some
restrictions---I don't think encodings of surrogates are
allowed, for example, and codes Unicode forbids, like 0xFFFF,
certainly aren't. But in the basic original UTF-8, there's no
problem with those either.)



In other words, there is no transfer of 16 bit entities. It's
up to the writer to write it as bytes, and the reader to read it
as bytes, and the two to agree how to do so. (In practice, of
course, if the two are both on the same machine, this won't be a
problem. But in practice, in the places I've worked, most of
the files on the PC's have been remote mounted on a Sparc, which
is big-endian.)

The basic problem is that while g++ compiler doesn't support a Byte Order Mark
at the start of an UTF-8 source code file, MSVC compiler requires it.
 
J

Jerry Coffin

On Aug 6, 8:27 pm, Jerry Coffin <[email protected]> wrote:

[ ... ]
In other words, you don't know what you're getting. That sounds
like the worst of both worlds.

I can't say I've ever run into a situation where I didn't get what I
wanted or didn't know what I was going to get. At the same time, for
_most_ new development, I'd ignore all that and use the "W" versions
of functions directly. Those are really its native functions, and
they're always a bit faster, require less storage, and have at least
the same capabilities as the "A" versions of the same (and sometimes
more).

[ ... ]
Hopefully, they do use an inline function in the #ifdef, and not
a macro.

I haven't rechecked recently, but the last time I looked, it was a
macro.
Hopefully, they have a code page for UTF-8.

Yes, thankfully, they do.

[ ... ]
And what happens with the name when it is actually passed to the
file system? Most file systems I have mounted won't support
UTF-16 in filenames---the file system will read it as a NTMB
string, and stop at the first byte with 0. (Also, the file
servers are often big endian, and not little endian.) I'm
pretty sure that NFS doesn't support UTF-16 in the protocol, and
I don't think SMB does either.

This is one of the places that I think the GUI way of doing things is
helpful -- you're normally giving the user a list of files from the
server, and then passing the server back a name picked from the list.

As far as the mechanics go, I've never looked very carefully -- I
suspect it's up to the FS driver to translate names as well as
possible, and (particularly) ensure that translations work
bidirectionally, so if you get a name from the remote server, and
then pass that same name back, that is signifies the original file.

[ ... ]
UTF-8 can encode anything in Unicode. And more; basically, in
it's most abstract form, it's just a means of encoding 32 bit
values as sequences of 8 bit bytes, and can handle an 32 bit
value. (The Unicode definition of UTF-8 does introduce some
restrictions---I don't think encodings of surrogates are
allowed, for example, and codes Unicode forbids, like 0xFFFF,
certainly aren't. But in the basic original UTF-8, there's no
problem with those either.)

I think we're mostly dealing with a difference in how terminology is
being used, but I also think it's more or less irrelevant -- as long
as you use UTF-8, you'll almost certainly be able to represent any
file name there is.

[ ... ]
In other words, there is no transfer of 16 bit entities. It's
up to the writer to write it as bytes, and the reader to read it
as bytes, and the two to agree how to do so. (In practice, of
course, if the two are both on the same machine, this won't be a
problem. But in practice, in the places I've worked, most of
the files on the PC's have been remote mounted on a Sparc, which
is big-endian.)

As long as the files are only being used on PCs, and stored on
SPARCs, that shouldn't matter. Just to act as a file server, all it
has to do is ensure that the stream of bytes that was sent to it
matches the stream of bytes it plays back.

We're on fairly familiar ground here though -- Windows being involved
doesn't really change anything. If you're writing a file of Unicode
text, putting a BOM at the beginning should be enough to let anything
that "knows" Unicode read it. If the file needs to contain anything
much more complex, you probably want to use some standardized
encoding format like ASN.1 or XDR. Choosing between those is usually
pretty easy as well: you use XDR when you can, and ASN.1 if you have
to (e.g. to exchange data with something that only understands ASN.1,
or if you really need the data to be self-describing).
 
J

James Kanze

[ ... ]
In other words, you don't know what you're getting. That
sounds like the worst of both worlds.
I can't say I've ever run into a situation where I didn't get
what I wanted or didn't know what I was going to get.

A library with inline functions or template code?

More generally, how do you ensure that all components of an
application are compiled with the same value for _UNICODE?
At the same time, for _most_ new development, I'd ignore all
that and use the "W" versions of functions directly. Those are
really its native functions, and they're always a bit faster,
require less storage, and have at least the same capabilities
as the "A" versions of the same (and sometimes more).

That sounds reasonable.
[ ... ]
Hopefully, they have a code page for UTF-8.
Yes, thankfully, they do.

What about Alf's claim that it doesn't really work?

More generally, if you're going to do this sort of thing, you
need to offer a bit more flexibility. Filenames can come from
many different sources, and depending on the origine, the
encoding may not be the same.
[ ... ]
And what happens with the name when it is actually passed to
the file system? Most file systems I have mounted won't
support UTF-16 in filenames---the file system will read it
as a NTMB string, and stop at the first byte with 0. (Also,
the file servers are often big endian, and not little
endian.) I'm pretty sure that NFS doesn't support UTF-16 in
the protocol, and I don't think SMB does either.
This is one of the places that I think the GUI way of doing
things is helpful -- you're normally giving the user a list
of files from the server, and then passing the server back a
name picked from the list.

Not when you're creating new files. And most of my programs
don't run under a GUI; they're servers, which run 24 hours a
day. Of course, they don't run under Windows either, so the
question is moot:). But the question remains---picking up the
name from a GUI is fine for interactive programs, but a lot of
programs aren't interactive.
[ ... ]
UTF-8 can encode anything in Unicode. And more; basically,
in it's most abstract form, it's just a means of encoding 32
bit values as sequences of 8 bit bytes, and can handle an 32
bit value. (The Unicode definition of UTF-8 does introduce
some restrictions---I don't think encodings of surrogates
are allowed, for example, and codes Unicode forbids, like
0xFFFF, certainly aren't. But in the basic original UTF-8,
there's no problem with those either.)
I think we're mostly dealing with a difference in how
terminology is being used,

UTF-8 really does have two commonly accepted meanings. The
original UTF-8 was just a means of formatting 16, and later 31
bit entities as bytes, and could handle any value that could be
represented in 31 bits. The Unicode definition clearly
restricts it somewhat, but their site is down right now, so I
can't see exactly how. If nothing else, they only allow values
in the range 0-0x10FFFF (which means that the longest sequence
is only 4 bytes, rather than 6), but I'm sure that there are
other restrictions as well.
but I also think it's more or less irrelevant -- as long as
you use UTF-8, you'll almost certainly be able to represent
any file name there is.
Yes.

[ ... ]
In other words, there is no transfer of 16 bit entities.
It's up to the writer to write it as bytes, and the reader
to read it as bytes, and the two to agree how to do so. (In
practice, of course, if the two are both on the same
machine, this won't be a problem. But in practice, in the
places I've worked, most of the files on the PC's have been
remote mounted on a Sparc, which is big-endian.)
As long as the files are only being used on PCs, and stored on
SPARCs, that shouldn't matter. Just to act as a file server,
all it has to do is ensure that the stream of bytes that was
sent to it matches the stream of bytes it plays back.

And that the file name matches, somehow. But typically, this
isn't the case---I regularly share files between systems, and
this seems to be the case for everyone where I work.
We're on fairly familiar ground here though -- Windows being
involved doesn't really change anything. If you're writing a
file of Unicode text, putting a BOM at the beginning should be
enough to let anything that "knows" Unicode read it. If the
file needs to contain anything much more complex, you probably
want to use some standardized encoding format like ASN.1 or
XDR. Choosing between those is usually pretty easy as well:
you use XDR when you can, and ASN.1 if you have to (e.g. to
exchange data with something that only understands ASN.1, or
if you really need the data to be self-describing).

I agree that standard (and simple) solutions exist. Putting a
BOM at the start of a text file allows immediate identification
of the encoding format. But how many editors that you know
actually do this? (For non-text files, of course, you have to
define a format, and the defined formats do tend to work
everywhere. Although there's still the question of what to do
if you have a filename embedded in an otherwise non-text
format.)
 
J

Jorgen Grahn

Hi I have a .txt file containing:

[-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861, ....
1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]

Now I would like to read this array into a C++ container/array so I can do
something like:

double first = container[0]; // -0.00231844
....

What I'd do in this case is: define the file format semi-formally
(i.e. not just with an example). Then everything falls into place for
me.

I'd probably decide the brackets and commas carry no information, and
that I needed support for comments. Then I'd be left with a
line-oriented file format where each line is
- a comment introduced by #
- an empty line (nothing but whitespace)
- a series of whitespace-delimited doubles, parsable using strtod()
or
- a syntax error

I prefer not to let iostreams do the tokenization, partly because I
don't know it that well, and partly because I don't want it to define
the file format. I want a file format which I can explain to a
Perl/whatever programmer without mentioning C++ (strtod() documentation
is easy to come by).

/Jorgen
 
J

Jerry Coffin

On Aug 7, 4:00 pm, Jerry Coffin <[email protected]> wrote:

[ ... ]
A library with inline functions or template code?

Oh, don't get me wrong -- I'm not saying there aren't any situations
that could/would cause problems, only that I've been able to avoid
problems from it so far.
More generally, how do you ensure that all components of an
application are compiled with the same value for _UNICODE?

That, quite frankly, can be a pain. Most of the libraries and such
I've used came with both versions; from what I've seen that seems to
be fairly common.

[ UTF-8 code page ]
What about Alf's claim that it doesn't really work?

I didn't see his post saying that. In any case, I've used it (some)
and can say it works to some degree, but I'll openly admit that I've
never really put it under a lot of stress either -- nearly all of my
code gets used primarily in the US, where the conversion is usually
trivial.
More generally, if you're going to do this sort of thing, you
need to offer a bit more flexibility. Filenames can come from
many different sources, and depending on the origine, the
encoding may not be the same.

True -- but ultimately, nothing you can do really gets away from
problems. For any heterogeneous client and server, there's at least
some possibility of a discrepancy between how the client and server
will interpret things -- and even when they _seem_ homogeneous (e.g.
both running the same variant of Unix) a difference in file system
could still lead to a problem.

[ Figuring out valid names on a server ]
Not when you're creating new files. And most of my programs
don't run under a GUI; they're servers, which run 24 hours a
day. Of course, they don't run under Windows either, so the
question is moot:). But the question remains---picking up the
name from a GUI is fine for interactive programs, but a lot of
programs aren't interactive.

Even when the main program isn't interactive, configuration for it
can be.

Ultimately you're right though -- it would be nice if you could
depend on (for example) being able to query a server about some basic
characteristics of a shared/exported file system, so you could
portably figure out what it allows. Right now, virtually all such
"knowledge" is encoded implicitly in client code (or simply doesn't
exist -- the client just passes a string through and hopes for the
best).

[ ... ]
And that the file name matches, somehow. But typically, this
isn't the case---I regularly share files between systems, and
this seems to be the case for everyone where I work.

I wish I could offer something positive here, but I doubt I can.
Ultimately, this depends more on the FS than the OS though -- just
for example, regardless of the OS, an ISO 9660 FS (absent something
like Joliet extensions) places draconian restrictions on file names.

[ ... ]
I agree that standard (and simple) solutions exist. Putting a
BOM at the start of a text file allows immediate identification
of the encoding format. But how many editors that you know
actually do this?

A few -- Windows Notepad knows how to create and work with UTF-16LE,
UTF-16BE, all including BOMs (or whatever you call the UTF-8
signature). The current version of Visual Studio also seems to work
fine with UTF-8 and UTF-16 (BE & LE) text files as well. It preserves
the BOM and endianess when saving a modified version -- but if you
want to use it to create a new file with UTF-16BE encoding (for
example) that might be a bit more difficult (I haven't tried to very
hard, but I don't immediately see a "Unicode big endian" option like
Notepad provides).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top