Reading an array from file?

Discussion in 'C++' started by fdm, Aug 4, 2009.

  1. fdm

    fdm Guest

    Hi I have a .txt file containing:

    [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
    0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
    0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861,
    0, 0.063379,
    0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425, -1.10221, -1.59358,
    0,
    0.298058, -0.564321, -1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
    -0.954023, 0,
    0.54965, -0.57544, -2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267,
    -0.570523, 0,
    0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
    -0.411637, 0,
    0.0390142, -0.273687, -0.624816, -1.51086, -2.18197, -1.86934,
    0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326, 0.543721, -1.87936,
    1.06337, 0.0752932, -0.0704278, -0.307334, -0.99684, 0,
    0.00486263, -0.12788, -0.25644, -0.491107,
    0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, -0.000112593, 0.0110926,
    0.0167261, -0.049946, -0.0783788, -0.0384529,
    0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299, 0.00784705,
    0.270986, 1.90373, 0.0225429, -0.684401, -0.250102, 0.0345052,
    0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634, -0.987079, -0.837127,
    1.46826, 0.682614,
    0, -0.0313031, -0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
    -1.18301, 0.0228328, 0, -0.0125886, -4.34123,
    0.0787134, -45.9453, -66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0,
    0.65181, -8.32657, 6.24457, -37.9488, -110.562, -54.1226,
    3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154, -3.71556,
    9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
    0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
    1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
    0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0]

    Now I would like to read this array into a C++ container/array so I can do
    something like:

    double first = container[0]; // -0.00231844

    The number of elements in the file can vary from file to file so I was
    thinking of using a std::vector. The below code:

    std::vector<std::string> file;
    std::string line;
    file.clear();
    std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
    while (getline(infile, line, '\n')) {
    file.push_back (line);
    }

    std::cout << "Read " << file.front() << " lines.\n";


    stores the whole array from the file as a single string in the vector. But I
    still need to "tokenize" this string into its double parts. Before throwing
    myself into cumbersome code I would like to hear if anyone has a good idea
    to do this the right way.
     
    fdm, Aug 4, 2009
    #1
    1. Advertising

  2. fdm wrote:
    > Hi I have a .txt file containing:
    >
    > [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853, 0.119546,
    > -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395, 0.104663,
    > -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861, 0,
    > 0.063379, 0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425,
    > -1.10221, -1.59358, 0, 0.298058, -0.564321, -1.91097, -3.58063,
    > -6.30183, -4.78945, -1.61198, -0.70215, -0.954023, 0, 0.54965, -0.57544,
    > -2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267, -0.570523,
    > 0, 0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318,
    > -3.09268, -0.411637, 0, 0.0390142, -0.273687, -0.624816, -1.51086,
    > -2.18197, -1.86934, 0.297622, -1.07632, -0.0820767, 0, 0.0166208,
    > -0.543326, 0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278,
    > -0.307334, -0.99684, 0, 0.00486263, -0.12788, -0.25644, -0.491107,
    > 0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0,
    > 0, 0, 0, 0, -0.000112593, 0.0110926, 0.0167261, -0.049946, -0.0783788,
    > -0.0384529, 0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299,
    > 0.00784705, 0.270986, 1.90373, 0.0225429, -0.684401, -0.250102,
    > 0.0345052, 0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634,
    > -0.987079, -0.837127, 1.46826, 0.682614, 0, -0.0313031, -0.717254,
    > -0.545265, -17.2006, -31.0422, -20.0047, -2.02679, -1.18301, 0.0228328,
    > 0, -0.0125886, -4.34123, 0.0787134, -45.9453, -66.6283, -50.7082,
    > 1.52779, -1.68643, -0.339206, 0, 0.65181, -8.32657, 6.24457, -37.9488,
    > -110.562, -54.1226, 3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154,
    > -3.71556, 9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
    > 0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
    > 1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253, 0.403886,
    > -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0, 0, 0,
    > 0, 0, 0, 0, 0]


    Are the brackets part of the file or part of your message showing the
    contents of the file?

    >
    > Now I would like to read this array into a C++ container/array so I can
    > do something like:
    >
    > double first = container[0]; // -0.00231844
    >
    > The number of elements in the file can vary from file to file so I was
    > thinking of using a std::vector. The below code:
    >
    > std::vector<std::string> file;
    > std::string line;
    > file.clear();
    > std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
    > while (getline(infile, line, '\n')) {
    > file.push_back (line);
    > }
    >
    > std::cout << "Read " << file.front() << " lines.\n";
    >
    >
    > stores the whole array from the file as a single string in the vector.
    > But I still need to "tokenize" this string into its double parts. Before
    > throwing myself into cumbersome code


    But you already have! Why don't you just read doubles from the file
    instead of reading lines and parsing them later?

    > I would like to hear if anyone has
    > a good idea to do this the right way.


    // if your file does have the leading bracket:
    std::vector<double> values;
    while (infile)
    {
    char dummy_char;
    if (infile >> dummy_char)
    {
    double d;
    if (infile >> d)
    values.push_back(d);
    }
    }

    // if your file does NOT have the leading bracket:
    std::vector<double> values;
    while (infile)
    {
    double d;
    if (infile >> d)
    values.push_back(d);
    char dummy_char;
    infile >> dummy_char;
    }

    And the second variation can probably be used even for the file with the
    bracket, I didn't check.

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Aug 4, 2009
    #2
    1. Advertising

  3. fdm

    fdm Guest

    "Victor Bazarov" <> wrote in message
    news:h59k29$scm$...
    > fdm wrote:
    >> Hi I have a .txt file containing:
    >>
    >> [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
    >> 0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
    >> 0.104663, -0.550282, -1.26802, -0.705694,
    >> 0.0873308, -0.309962, -0.802861, 0, 0.063379,
    >> 0.398289, -1.44105, -1.53938, -1.7608, -1.38484, -0.711425, -1.10221, -1.59358,
    >> 0,
    >> 0.298058, -0.564321, -1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
    >> -0.954023, 0,
    >> 0.54965, -0.57544, -2.33652, -6.10376, -4.54323, -4.77601, -4.48725, -0.489267,
    >> -0.570523, 0,
    >> 0.668925, -0.46068, -2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
    >> -0.411637, 0,
    >> 0.0390142, -0.273687, -0.624816, -1.51086, -2.18197, -1.86934,
    >> 0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326,
    >> 0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278, -0.307334, -0.99684,
    >> 0, 0.00486263, -0.12788, -0.25644, -0.491107,
    >> 0.201335, -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0,
    >> 0, 0, 0, 0, -0.000112593, 0.0110926,
    >> 0.0167261, -0.049946, -0.0783788, -0.0384529,
    >> 0.0407556, -0.154947, -0.0463077, 0, -0.0182507, 0.00359299, 0.00784705,
    >> 0.270986, 1.90373, 0.0225429, -0.684401, -0.250102, 0.0345052,
    >> 0, -0.0636621, -0.364021, -1.0357, -2.70395, -4.77634, -0.987079, -0.837127,
    >> 1.46826, 0.682614,
    >> 0, -0.0313031, -0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
    >> -1.18301, 0.0228328, 0, -0.0125886, -4.34123,
    >> 0.0787134, -45.9453, -66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0,
    >> 0.65181, -8.32657, 6.24457, -37.9488, -110.562, -54.1226,
    >> 3.39012, -0.0921196, 0.12512, 0, 1.67071, 0.694154, -3.71556,
    >> 9.19359, -8.64445, 14.5316, -1.12059, -0.852576, 0.59615, 0,
    >> 0.001542, -0.94513, -0.844656, -6.95102, 1.63441, -5.0893, -3.16847,
    >> 1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
    >> 0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0,
    >> 0, 0, 0, 0, 0, 0, 0, 0]

    >
    > Are the brackets part of the file or part of your message showing the
    > contents of the file?
    >
    >>
    >> Now I would like to read this array into a C++ container/array so I can
    >> do something like:
    >>
    >> double first = container[0]; // -0.00231844
    >>
    >> The number of elements in the file can vary from file to file so I was
    >> thinking of using a std::vector. The below code:
    >>
    >> std::vector<std::string> file;
    >> std::string line;
    >> file.clear();
    >> std::ifstream infile (parameters_path.c_str(), std::ios_base::in);
    >> while (getline(infile, line, '\n')) {
    >> file.push_back (line);
    >> }
    >>
    >> std::cout << "Read " << file.front() << " lines.\n";
    >>
    >>
    >> stores the whole array from the file as a single string in the vector.
    >> But I still need to "tokenize" this string into its double parts. Before
    >> throwing myself into cumbersome code

    >
    > But you already have! Why don't you just read doubles from the file
    > instead of reading lines and parsing them later?
    >
    > > I would like to hear if anyone has
    >> a good idea to do this the right way.

    >
    > // if your file does have the leading bracket:
    > std::vector<double> values;
    > while (infile)
    > {
    > char dummy_char;
    > if (infile >> dummy_char)
    > {
    > double d;
    > if (infile >> d)
    > values.push_back(d);
    > }
    > }
    >






    I have now tried:

    std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

    // if your file does have the leading bracket: Yes it does
    std::vector<double> values;
    while (infile) {
    char dummy_char;
    if (infile >> dummy_char) {
    double d;
    if (infile >> d) {
    values.push_back(d);
    std::cout << "val = " << d << std::endl;
    }
    }
    }

    The loop only gets executed once since all the values are on one line (this
    is the format) and no doubles are created.

    Should the line not be read and then parsed to doubles where ',' is used as
    a delimiter?
     
    fdm, Aug 4, 2009
    #3
  4. fdm wrote:
    > "Victor Bazarov" <> wrote in message
    > news:h59k29$scm$...
    >> fdm wrote:
    >>> Hi I have a .txt file containing:
    >>>
    >>> [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853, 0.119546,
    >>> -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
    >>> 0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962,
    >>> -0.802861, 0, 0.063379, 0.398289, -1.44105, -1.53938, -1.7608,
    >>> -1.38484, -0.711425, -1.10221, -1.59358, 0, 0.298058, -0.564321,
    >>> -1.91097, -3.58063, -6.30183, -4.78945, -1.61198, -0.70215,
    >>> -0.954023, 0, 0.54965, -0.57544, -2.33652, -6.10376, -4.54323,
    >>> -4.77601, -4.48725, -0.489267, -0.570523, 0, 0.668925, -0.46068,
    >>> -2.42157, -4.74326, -12.8757, -6.57763, -1.16318, -3.09268,
    >>> -0.411637, 0, 0.0390142, -0.273687, -0.624816, -1.51086, -2.18197,
    >>> -1.86934, 0.297622, -1.07632, -0.0820767, 0, 0.0166208, -0.543326,
    >>> 0.543721, -1.87936, 1.06337, 0.0752932, -0.0704278, -0.307334,
    >>> -0.99684, 0, 0.00486263, -0.12788, -0.25644, -0.491107, 0.201335,
    >>> -1.09141, -0.694021, -0.24188, -0.212387, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    >>> 0, 0, -0.000112593, 0.0110926, 0.0167261, -0.049946, -0.0783788,
    >>> -0.0384529, 0.0407556, -0.154947, -0.0463077, 0, -0.0182507,
    >>> 0.00359299, 0.00784705, 0.270986, 1.90373, 0.0225429, -0.684401,
    >>> -0.250102, 0.0345052, 0, -0.0636621, -0.364021, -1.0357, -2.70395,
    >>> -4.77634, -0.987079, -0.837127, 1.46826, 0.682614, 0, -0.0313031,
    >>> -0.717254, -0.545265, -17.2006, -31.0422, -20.0047, -2.02679,
    >>> -1.18301, 0.0228328, 0, -0.0125886, -4.34123, 0.0787134, -45.9453,
    >>> -66.6283, -50.7082, 1.52779, -1.68643, -0.339206, 0, 0.65181,
    >>> -8.32657, 6.24457, -37.9488, -110.562, -54.1226, 3.39012, -0.0921196,
    >>> 0.12512, 0, 1.67071, 0.694154, -3.71556, 9.19359, -8.64445, 14.5316,
    >>> -1.12059, -0.852576, 0.59615, 0, 0.001542, -0.94513, -0.844656,
    >>> -6.95102, 1.63441, -5.0893, -3.16847, 1.19829, 0.0257344, 0,
    >>> -0.186464, -1.54877, 0.321253, 0.403886, -0.983199, -1.91005,
    >>> -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

    >> [..]

    >
    > I have now tried:
    >
    > std::ifstream infile(path_to_file.c_str(), std::ios_base::in);
    >
    > // if your file does have the leading bracket: Yes it does
    > std::vector<double> values;
    > while (infile) {
    > char dummy_char;
    > if (infile >> dummy_char) {
    > double d;
    > if (infile >> d) {
    > values.push_back(d);
    > std::cout << "val = " << d << std::endl;
    > }
    > }
    > }
    >
    > The loop only gets executed once since all the values are on one line
    > (this is the format) and no doubles are created.


    I don't understand this statement, sorry. I tested that loop on a
    stringstream with three doubles separated by commas, and it read all
    three of them.

    > Should the line not be read and then parsed to doubles where ',' is used
    > as a delimiter?


    Why bother?

    OK, I am going to test on a real file with five doubles... Gimme five
    minutes...

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Aug 4, 2009
    #4
  5. fdm wrote:
    > [..]
    > I have now tried:
    >
    > std::ifstream infile(path_to_file.c_str(), std::ios_base::in);
    >
    > // if your file does have the leading bracket: Yes it does
    > std::vector<double> values;
    > while (infile) {
    > char dummy_char;
    > if (infile >> dummy_char) {
    > double d;
    > if (infile >> d) {
    > values.push_back(d);
    > std::cout << "val = " << d << std::endl;
    > }
    > }
    > }
    >
    > The loop only gets executed once since all the values are on one line
    > (this is the format) and no doubles are created.
    >
    > Should the line not be read and then parsed to doubles where ',' is used
    > as a delimiter?


    OK, here are the results:
    ------------------------- code
    #include <iostream>
    #include <fstream>
    #include <vector>

    int main()
    {
    std::string path_to_file("testdoubles.txt");
    std::ifstream infile(path_to_file.c_str(), std::ios_base::in);

    // if your file does have the leading bracket: Yes it does
    std::vector<double> values;
    while (infile) {
    char dummy_char;
    if (infile >> dummy_char) {
    double d;
    if (infile >> d) {
    values.push_back(d);
    std::cout << "val = " << d << std::endl;
    }
    }
    }

    std::cout << "read " << values.size() << " doubles\n";

    return 0;
    }
    -------------------------------- file 'testdoubles.txt'
    [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853]
    -----------------------------------------------------------

    Compiled with VC++ 2008 and run, here is the output:
    -------------------------------------
    val = -0.00231844
    val = -0.02326
    val = 0.0484723
    val = 0.0782189
    val = 0.0917853
    read 5 doubles
    -------------------------------------

    I took your first 5 doubles, closed them with the bracket, and placed
    the TXT file next to the executable. Everything went fine, as you can
    see here. Would you take my word for it? <shrug> Now, I have no idea
    what you're doing wrong, but you must be doing something wrong.

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Aug 4, 2009
    #5
  6. fdm

    Jerry Coffin Guest

    In article <4a784940$0$303$>,
    says...
    >
    > Hi I have a .txt file containing:
    >
    > [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,


    [ ... ]

    > I
    > still need to "tokenize" this string into its double parts. Before throwing
    > myself into cumbersome code I would like to hear if anyone has a good idea
    > to do this the right way.


    I'm not sure if it's really "right" (some would argue that it's
    downright wrong), but for your purposes, the commas are essentially
    the same as white space -- i.e. they're something between the data
    that you ignore. As such, I'd probably just create a locale where
    ',' is treated as white space, and then let the iostream handle the
    "parsing" involved:

    #include <iostream>
    #include <locale>
    #include <algorithm>
    #include <vector>

    struct digits_only: std::ctype<char>
    {
    digits_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
    static std::ctype_base::mask
    rc[std::ctype<char>::table_size];
    static bool inited = false;

    if (!inited)
    {
    // Everything is a "space"
    std::fill_n(rc, std::ctype<char>::table_size,
    std::ctype_base::space);

    // unless it can be part of a floating point number.
    std::fill_n(rc+'0', 9, std::ctype_base::mask());
    rc['.'] = std::ctype_base::mask();
    rc['-'] = std::ctype_base::mask();
    rc['+'] = std::ctype_base::mask();
    rc['e'] = std::ctype_base::mask();
    rc['E'] = std::ctype_base::mask();
    inited = true;
    }
    return rc;
    }
    };

    int main() {
    int low, high;

    std::cin.imbue(std::locale(std::locale(), new digits_only()));

    std::vector<double> data;
    std::copy(std::istream_iterator<double>(std::cin),
    std::istream_iterator<double>(),
    std::back_inserter(data));

    std::copy(data.begin(), data.end(),
    std::eek:stream_iterator<double>(std::cout, "\n"));
    return 0;
    }

    This should accept any valid input. Most invalid data in the input
    file will simply be ignored. A few things in the input could stop the
    input at that point though. Just for example, input that contained
    text would fail at the first 'e' or 'E' in the text -- those can't be
    ignored because they _could_ be part of a f.p. number, but by itself
    (without some digits), the attempt to convert it as an f.p. number
    would fail.

    --
    Later,
    Jerry.
     
    Jerry Coffin, Aug 4, 2009
    #6
  7. fdm

    fdm Guest

    "Victor Bazarov" <> wrote in message
    news:h5a1dt$cuc$...
    > fdm wrote:
    >> [..]
    >> I have now tried:
    >>
    >> std::ifstream infile(path_to_file.c_str(), std::ios_base::in);
    >>
    >> // if your file does have the leading bracket: Yes it does
    >> std::vector<double> values;
    >> while (infile) {
    >> char dummy_char;
    >> if (infile >> dummy_char) {
    >> double d;
    >> if (infile >> d) {
    >> values.push_back(d);
    >> std::cout << "val = " << d << std::endl;
    >> }
    >> }
    >> }
    >>
    >> The loop only gets executed once since all the values are on one line
    >> (this is the format) and no doubles are created.
    >>
    >> Should the line not be read and then parsed to doubles where ',' is used
    >> as a delimiter?

    >
    > OK, here are the results:
    > ------------------------- code
    > #include <iostream>
    > #include <fstream>
    > #include <vector>
    >
    > int main()
    > {
    > std::string path_to_file("testdoubles.txt");
    > std::ifstream infile(path_to_file.c_str(), std::ios_base::in);
    >
    > // if your file does have the leading bracket: Yes it does
    > std::vector<double> values;
    > while (infile) {
    > char dummy_char;
    > if (infile >> dummy_char) {
    > double d;
    > if (infile >> d) {
    > values.push_back(d);
    > std::cout << "val = " << d << std::endl;
    > }
    > }
    > }
    >
    > std::cout << "read " << values.size() << " doubles\n";
    >
    > return 0;
    > }
    > -------------------------------- file 'testdoubles.txt'
    > [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853]
    > -----------------------------------------------------------
    >
    > Compiled with VC++ 2008 and run, here is the output:
    > -------------------------------------
    > val = -0.00231844
    > val = -0.02326
    > val = 0.0484723
    > val = 0.0782189
    > val = 0.0917853
    > read 5 doubles
    > -------------------------------------
    >
    > I took your first 5 doubles, closed them with the bracket, and placed the
    > TXT file next to the executable. Everything went fine, as you can see
    > here. Would you take my word for it? <shrug> Now, I have no idea what
    > you're doing wrong, but you must be doing something wrong.



    I was specifying a wrong path, now it works perfect.

    Just to make sure I understand. The below while loop will be executed as
    long as there are content
    in the file that can be converted to doubles:

    while (infile) {
    char dummy_char;
    if (infile >> dummy_char) {
    double d;
    if (infile >> d) {
    values.push_back(d);
    std::cout << "val = " << d << std::endl;
    }
    }
    }

    I assume that it runs from left to right line by line in the file and each
    'd' is a valid double it passes on its run.

    Normally I would expect some kind of 'infile.next()' operation but this
    seems to be implicitly invoked in the loop or what?
     
    fdm, Aug 4, 2009
    #7
  8. fdm wrote:
    > [..]
    > I was specifying a wrong path, now it works perfect.
    >
    > Just to make sure I understand. The below while loop will be executed as
    > long as there are content
    > in the file that can be converted to doubles:
    >
    > while (infile) {
    > char dummy_char;
    > if (infile >> dummy_char) {
    > double d;
    > if (infile >> d) {
    > values.push_back(d);
    > std::cout << "val = " << d << std::endl;
    > }
    > }
    > }
    >
    > I assume that it runs from left to right line by line in the file


    Files don't have lines. Lines are invented by people. Files have
    bytes. Some byte value people decided to call "a line break", and
    assume that whatever is in the file between here and that "line break"
    is "a line". In fact, to the operator>> that reads a double a "line
    break" is just whitespace, a separator.

    High-level formatted I/O bundles up whitespace. That's why if you make
    a special locale that treats commas as whitespace (like Jerry suggested)
    it would be line you made a find-and-replace operation and swapped all
    commas for more spaces.

    The loop skips a char, then reads a *field* (anything that can be
    converted into a double value) as long as it can. Then it stops,
    converts the field and assigns the value to 'd', then starts over. By
    that I mean that it reads out another character (the comma), then reads
    another field, converts into a double, and so on.

    > and
    > each 'd' is a valid double it passes on its run.
    >
    > Normally I would expect some kind of 'infile.next()' operation but this
    > seems to be implicitly invoked in the loop or what?


    I am not sure I understand the question. *Reading* from the file moves
    the "cursor", there is no need to do any additional movement (by calling
    "next" or whatever).

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Aug 4, 2009
    #8
  9. fdm

    James Kanze Guest

    On Aug 4, 10:00 pm, Jerry Coffin <> wrote:
    > In article <4a784940$0$303$>,
    > says...
    > > Hi I have a .txt file containing:


    > > [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,


    > [ ... ]


    > > I still need to "tokenize" this string into its double
    > > parts. Before throwing myself into cumbersome code I would
    > > like to hear if anyone has a good idea to do this the right
    > > way.


    > I'm not sure if it's really "right" (some would argue that
    > it's downright wrong), but for your purposes, the commas are
    > essentially the same as white space -- i.e. they're something
    > between the data that you ignore. As such, I'd probably just
    > create a locale where ',' is treated as white space, and then
    > let the iostream handle the "parsing" involved:


    Where commas and opening and closing braces are white space.

    But a lot depends on how much error checking is deamed
    necessary. In his case, a priori, commas aren't "just white
    space", since something like "1.2,,3.4" should probably produce
    an error, and if the file doesn't start with a '[' and end with
    an ']', it's also an error.

    There are several ways of tackling this; Victor's loop, but
    checking the value of dummy_char, is probably the simplest for a
    beginner to write and understand. If error checking on the
    input format is not desired (although it generally isn't a good
    idea to drop such error checking), the easiest solution is
    probably a FloatWithSeparator class, something like:

    class FloatWithSeparator
    {
    public:
    operator double() const
    {
    return myValue ;
    }
    friend std::istream& operator>>(
    std::istream& source,
    FloatWithSeparator& dest )
    {
    char separator ;
    source >> separator >> myValue ;
    return source ;
    }

    private:
    double myValue ;
    } ;

    You can then define the vector with something like:

    std::vector< double > v(
    (std::istream_iterator< FloatWithSeparator >( file )),
    (std::istream_iterator< FloatWithSeparator >()) ) ;

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Aug 5, 2009
    #9
  10. fdm

    Jerry Coffin Guest

    In article <f9827e8b-1626-4a50-b2a1-9ba85da52669
    @s15g2000yqs.googlegroups.com>, says...

    [ ... ]

    > But a lot depends on how much error checking is deamed
    > necessary. In his case, a priori, commas aren't "just white
    > space", since something like "1.2,,3.4" should probably produce
    > an error, and if the file doesn't start with a '[' and end with
    > an ']', it's also an error.


    That could well be -- when I posted this, I hadn't yet seen his post
    confirming that there really even _was_ a leading and trailing
    bracket, rather than that being something he added for posting.

    He still hasn't said anything about how much (if at all) he cares
    about verifying that the input is really in the correct format.
    Sometimes you need to verify the input rigorously, but others
    (including a few projects I've been involved with) they knew their
    data was a mess, and that dealing with it all correctly just wasn't
    going to happen -- but they wanted a best effort to read as much as
    possible as reasonably as possible.

    Unfortunately, the OP hasn't really told us enough to figure out
    which category this data falls into. What he posted _looks_
    sufficiently "regular" that it might make sense to just reject
    anything that doesn't look perfect -- but that's purely a guess based
    on a sample size of 1...

    --
    Later,
    Jerry.
     
    Jerry Coffin, Aug 5, 2009
    #10
  11. fdm

    James Kanze Guest

    On Aug 5, 12:13 am, Victor Bazarov <> wrote:
    > fdm wrote:
    > > [..]
    > > I was specifying a wrong path, now it works perfect.


    > > Just to make sure I understand. The below while loop will be
    > > executed as long as there are content in the file that can
    > > be converted to doubles:


    > > while (infile) {
    > > char dummy_char;
    > > if (infile >> dummy_char) {
    > > double d;
    > > if (infile >> d) {
    > > values.push_back(d);
    > > std::cout << "val = " << d << std::endl;
    > > }
    > > }
    > > }


    > > I assume that it runs from left to right line by line in the
    > > file


    > Files don't have lines.


    Of course they do. The C++ (and the C) standard says they do.

    > Lines are invented by people.


    So were files.

    > Files have bytes.


    Files come in all sorts of varieties and flavors, depending on
    the OS. Although it's true that on almost all systems, files
    contain "bytes" (Windows is, I think, the only exception), those
    bytes are structured in various ways; text files are structured
    in lines, for example.

    > Some byte value people decided to call "a line break", and
    > assume that whatever is in the file between here and that
    > "line break" is "a line".


    That's more or less the Unix point of view. It's not the C/C++
    point of view (where we have both text and binary files), nor
    the point of view of most OS's.

    > In fact, to the operator>> that reads a double a "line break"
    > is just whitespace, a separator.


    If the file is opened in text mode, the system reads lines, and
    appends a '\n' to the end of each line. What a line is depends
    on the system. And a '\n' is considered white space by
    std::istream (at least in the usual locales---Jerry suggested a
    locale where ',' was also considered white space, but one could
    just as easily create a locale where '\n', or even ' ', was not
    white space).

    If the file is opened in binary mode, the system reads bytes
    (whatever that means on the system), and the input isn't
    structured in lines (although it might very well contain bytes
    whose numeric value corresponds to '\n').

    > High-level formatted I/O bundles up whitespace.


    The correct word is "skips", not bundles up. And only at the
    start of each input, and only if std::ios::skipws is set in
    std::ios::fmtfield (the default).

    > That's why if you make a special locale that treats commas as
    > whitespace (like Jerry suggested) it would be line you made a
    > find-and-replace operation and swapped all commas for more
    > spaces.


    > The loop skips a char, then reads a *field* (anything that can
    > be converted into a double value) as long as it can.


    The loop reads two fields, one a char, and the other a double.
    It skips any preceding white space for both fields.

    A more idiomatic way of writing the loop would be:

    char sep ;
    double d ;
    while ( infile >> sep >> d ) {
    // Check if sep is the expected value?
    values.push_back( d ) ;
    std::cout << "val = " << d << std::endl;
    }

    Whether this is preferable to your version, I don't know; if
    there is an error in the format of the file, your version makes
    it easier to determine where. (On the other hand, if you want
    really clear error messages, you'll modify the reading in some
    way in order to be able to output the line number. I typically
    use a filtering streambuf for this, but that's definitly a
    technique that a beginner wouldn't apply. Note too that any
    good error handling will make the code three to five times more
    complicated.)

    > Then it stops, converts the field and assigns the value to
    > 'd', then starts over. By that I mean that it reads out
    > another character (the comma), then reads another field,
    > converts into a double, and so on.


    Above all, it skips any whitespace before the comma AND before
    the double.

    > > and


    > > each 'd' is a valid double it passes on its run.


    > > Normally I would expect some kind of 'infile.next()'
    > > operation but this seems to be implicitly invoked in the
    > > loop or what?


    > I am not sure I understand the question.


    He seems to be taking the Pascal view, in which a "file" is a
    sliding window in the input stream---you can read without
    advancing the position (although the more frequently used
    commands also advance it). In C or C++, this can easily be done
    at the character level, e.g. istream::peek() and istream::get(),
    or by using the streambuf directly. For higher level objects,
    however, it doesn't work. (Note that he couldn't read his file
    at all in Pascal, except by declaring it as a FILE OF CHARACTER,
    and doing all the parsing and conversions himself.)

    > *Reading* from the file moves the "cursor", there is no need
    > to do any additional movement (by calling "next" or whatever).


    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Aug 6, 2009
    #11
  12. * James Kanze:
    >
    > Files come in all sorts of varieties and flavors, depending on
    > the OS. Although it's true that on almost all systems, files
    > contain "bytes" (Windows is, I think, the only exception), those
    > bytes are structured in various ways; text files are structured
    > in lines, for example.


    Files in Windows only contain "bytes". But they can have areas where they don't
    really contain anything, namely, a sparse file. ;-) Also, a Windows file can
    contain more than one stream (sequence) of bytes. Whether these features are
    present depend on the filesystem the file resides on. The default filesystem,
    NTFS, support these features. But any way you look at it, at the bottom of the
    interpretation ladder there are bytes. And nothing else.

    Don military boots and kick the person who misinformed you, in his/her ass.

    But, be careful: be sure to not seriously hurt the brain residing there. ;-)



    Cheers,

    - Alf
     
    Alf P. Steinbach, Aug 6, 2009
    #12
  13. fdm

    James Kanze Guest

    On Aug 6, 10:26 am, "Alf P. Steinbach" <> wrote:
    > * James Kanze:
    > > Files come in all sorts of varieties and flavors, depending on
    > > the OS. Although it's true that on almost all systems, files
    > > contain "bytes" (Windows is, I think, the only exception), those
    > > bytes are structured in various ways; text files are structured
    > > in lines, for example.


    > Files in Windows only contain "bytes". But they can have areas
    > where they don't really contain anything, namely, a sparse
    > file. ;-) Also, a Windows file can contain more than one
    > stream (sequence) of bytes. Whether these features are present
    > depend on the filesystem the file resides on. The default
    > filesystem, NTFS, support these features. But any way you look
    > at it, at the bottom of the interpretation ladder there are
    > bytes. And nothing else.


    Well, there's a certain level where everything is just bytes.
    But I was under the impression that Windows used UTF-16 for text
    at the system level, and that files could (and text files
    generally did) contain UTF-16---i.e. 16 bit entities. (And
    under Windows on a PC, a byte is 8 bits.)

    On the other hand, now that you mention it... When I ported some
    of my file handling classes to Windows, filenames for CreateFile
    were always LPCTSTR, whatever that is (but a narrow character
    string literal converts implicitly to it, as does the results of
    std::string.c_str()), which makes me wonder why people argue
    that std::fstream must have a form which takes a wchar_t string
    as filename argument. According to the documentation, WriteFile
    and ReadFile take what I assume to be a void* (LPCVOID or
    LPVOID), which doesn't say much one way or the other, but the
    length argument is specified as "number of bytes".

    > Don military boots and kick the person who misinformed you, in
    > his/her ass.


    > But, be careful: be sure to not seriously hurt the brain
    > residing there. ;-)


    I think it was more an impression I got from postings here, from
    which I got the impression that all (or most) of the API was
    ambivalent; change a macro or a compiler option, and you got a
    different set of system API's, which expected wchar_t (and the
    type of TCHAR changed as well). I was probably just
    exterpolating too much into it.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Aug 6, 2009
    #13
  14. fdm

    Jerry Coffin Guest

    In article <693fed3c-761e-4429-b6b0-9a6f77a52748
    @c14g2000yqm.googlegroups.com>, says...

    [ ... ]

    > Well, there's a certain level where everything is just bytes.
    > But I was under the impression that Windows used UTF-16 for text
    > at the system level, and that files could (and text files
    > generally did) contain UTF-16---i.e. 16 bit entities. (And
    > under Windows on a PC, a byte is 8 bits.)


    They can, but they far more often contain something like ISO 8859.

    In the end, the OS is mostly agnostic about the content of text
    files. As you'd expect, it includes some utilities that know how to
    work with text files, and most of those can work with files
    containing either 8-bit or 16-bit entities, and even guess which a
    particular file contains (though the guess isn't always right).

    > On the other hand, now that you mention it... When I ported some
    > of my file handling classes to Windows, filenames for CreateFile
    > were always LPCTSTR, whatever that is (but a narrow character
    > string literal converts implicitly to it, as does the results of
    > std::string.c_str()), which makes me wonder why people argue
    > that std::fstream must have a form which takes a wchar_t string
    > as filename argument.


    Just FWIW, LPCTSTR is something like long pointer to const text
    string (where 'text' means char's or wchar_t's depending on whether
    _UNICODE was defined or not when compiling).

    If you don't have _UNICODE defined, CreateFile will accept a char *.
    If you do define it, CreateFile accepts a wchar_t *.

    In reality, most of the functions in Windows that take strings come
    in two flavors: an 'A' version and a 'W' version, so the headers look
    something like this:

    HANDLE CreateFileW(wchar_t const *, /* ... */);
    HANDLE CreateFileA(char const *, /* ... */);

    #ifdef _UNICODE
    #define CreateFile CreateFileW
    #else
    #define CreateFile CreateFileA
    #endif

    The 'A' version, however, is a small stub that converts the string
    from the current code page to UTF-16, and then (in essence) feeds
    that result to the 'W' version. That can lead to a problem if you use
    the 'A' version -- if your current code page doesn't contain a
    character corresponding to a character in the file name, you may not
    be able to create that file name with the 'A' version at all.

    The 'W' version lets you specify UTF-16 characters directly, so it
    can specify any file name that can exist -- but fstream::fstream and
    fstream::eek:pen usually act as wrappers for the 'A' version.

    Of course, you _could_ work around this without changing the fstream
    interface -- for example, you could write it to expect a UTF-8
    string, convert it to UTF-16, and then pass the result to CreateFileW
    -- but I don't know of anybody who does so. As I recall, there are
    also some characters that can't be encoded as UTF-8, so even that
    wouldn't be a perfect solution, though it would usually be adequate.

    > According to the documentation, WriteFile
    > and ReadFile take what I assume to be a void* (LPCVOID or
    > LPVOID), which doesn't say much one way or the other, but the
    > length argument is specified as "number of bytes".


    Right -- the OS just passes this data through transparently.
    Fundamentally it's about like write() on Unix -- it just deals with a
    stream of bytes; any other structure is entirely up to you and what
    you choose to write and how you choose to interpret data you read.

    [ ... ]

    > I think it was more an impression I got from postings here, from
    > which I got the impression that all (or most) of the API was
    > ambivalent; change a macro or a compiler option, and you got a
    > different set of system API's, which expected wchar_t (and the
    > type of TCHAR changed as well). I was probably just
    > exterpolating too much into it.


    I think that sounds about right. Most functions that accept a
    _string_ come in two flavors, one that accepts a narrow string and
    another that accepts a wide string. From its viewpoint, when you
    write to a file, however, that's not really a string, but just raw
    data, so there's just one version that passes the data through
    without interpretation or modification.

    --
    Later,
    Jerry.
     
    Jerry Coffin, Aug 6, 2009
    #14
  15. fdm

    James Kanze Guest

    On Aug 6, 8:27 pm, Jerry Coffin <> wrote:
    > In article <693fed3c-761e-4429-b6b0-9a6f77a52748
    > @c14g2000yqm.googlegroups.com>, says...


    > [ ... ]


    > > Well, there's a certain level where everything is just
    > > bytes. But I was under the impression that Windows used
    > > UTF-16 for text at the system level, and that files could
    > > (and text files generally did) contain UTF-16---i.e. 16 bit
    > > entities. (And under Windows on a PC, a byte is 8 bits.)


    > They can, but they far more often contain something like ISO
    > 8859.


    > In the end, the OS is mostly agnostic about the content of
    > text files. As you'd expect, it includes some utilities that
    > know how to work with text files, and most of those can work
    > with files containing either 8-bit or 16-bit entities, and
    > even guess which a particular file contains (though the guess
    > isn't always right).


    > > On the other hand, now that you mention it... When I ported
    > > some of my file handling classes to Windows, filenames for
    > > CreateFile were always LPCTSTR, whatever that is (but a
    > > narrow character string literal converts implicitly to it,
    > > as does the results of std::string.c_str()), which makes me
    > > wonder why people argue that std::fstream must have a form
    > > which takes a wchar_t string as filename argument.


    > Just FWIW, LPCTSTR is something like long pointer to const
    > text string (where 'text' means char's or wchar_t's depending
    > on whether _UNICODE was defined or not when compiling).


    In other words, you don't know what you're getting. That sounds
    like the worst of both worlds.

    > If you don't have _UNICODE defined, CreateFile will accept a
    > char *. If you do define it, CreateFile accepts a wchar_t *.


    > In reality, most of the functions in Windows that take strings
    > come in two flavors: an 'A' version and a 'W' version, so the
    > headers look something like this:


    > HANDLE CreateFileW(wchar_t const *, /* ... */);
    > HANDLE CreateFileA(char const *, /* ... */);


    > #ifdef _UNICODE
    > #define CreateFile CreateFileW
    > #else
    > #define CreateFile CreateFileA
    > #endif


    Hopefully, they do use an inline function in the #ifdef, and not
    a macro.

    > The 'A' version, however, is a small stub that converts the
    > string from the current code page to UTF-16, and then (in
    > essence) feeds that result to the 'W' version. That can lead
    > to a problem if you use the 'A' version -- if your current
    > code page doesn't contain a character corresponding to a
    > character in the file name, you may not be able to create that
    > file name with the 'A' version at all.


    Hopefully, they have a code page for UTF-8.

    And what happens with the name when it is actually passed to the
    file system? Most file systems I have mounted won't support
    UTF-16 in filenames---the file system will read it as a NTMB
    string, and stop at the first byte with 0. (Also, the file
    servers are often big endian, and not little endian.) I'm
    pretty sure that NFS doesn't support UTF-16 in the protocol, and
    I don't think SMB does either.

    > The 'W' version lets you specify UTF-16 characters directly,
    > so it can specify any file name that can exist -- but
    > fstream::fstream and fstream::eek:pen usually act as wrappers for
    > the 'A' version.


    > Of course, you _could_ work around this without changing the
    > fstream interface -- for example, you could write it to expect
    > a UTF-8 string, convert it to UTF-16, and then pass the result
    > to CreateFileW -- but I don't know of anybody who does so. As
    > I recall, there are also some characters that can't be encoded
    > as UTF-8, so even that wouldn't be a perfect solution, though
    > it would usually be adequate.


    UTF-8 can encode anything in Unicode. And more; basically, in
    it's most abstract form, it's just a means of encoding 32 bit
    values as sequences of 8 bit bytes, and can handle an 32 bit
    value. (The Unicode definition of UTF-8 does introduce some
    restrictions---I don't think encodings of surrogates are
    allowed, for example, and codes Unicode forbids, like 0xFFFF,
    certainly aren't. But in the basic original UTF-8, there's no
    problem with those either.)

    > > According to the documentation, WriteFile and ReadFile take
    > > what I assume to be a void* (LPCVOID or LPVOID), which
    > > doesn't say much one way or the other, but the length
    > > argument is specified as "number of bytes".


    > Right -- the OS just passes this data through transparently.
    > Fundamentally it's about like write() on Unix -- it just deals
    > with a stream of bytes; any other structure is entirely up to
    > you and what you choose to write and how you choose to
    > interpret data you read.


    In other words, there is no transfer of 16 bit entities. It's
    up to the writer to write it as bytes, and the reader to read it
    as bytes, and the two to agree how to do so. (In practice, of
    course, if the two are both on the same machine, this won't be a
    problem. But in practice, in the places I've worked, most of
    the files on the PC's have been remote mounted on a Sparc, which
    is big-endian.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Aug 7, 2009
    #15
  16. * James Kanze:
    > On Aug 6, 8:27 pm, Jerry Coffin <> wrote:
    >> In article <693fed3c-761e-4429-b6b0-9a6f77a52748
    >> @c14g2000yqm.googlegroups.com>, says...

    >
    >> [ ... ]

    >
    >>> Well, there's a certain level where everything is just
    >>> bytes. But I was under the impression that Windows used
    >>> UTF-16 for text at the system level, and that files could
    >>> (and text files generally did) contain UTF-16---i.e. 16 bit
    >>> entities. (And under Windows on a PC, a byte is 8 bits.)

    >
    >> They can, but they far more often contain something like ISO
    >> 8859.

    >
    >> In the end, the OS is mostly agnostic about the content of
    >> text files. As you'd expect, it includes some utilities that
    >> know how to work with text files, and most of those can work
    >> with files containing either 8-bit or 16-bit entities, and
    >> even guess which a particular file contains (though the guess
    >> isn't always right).

    >
    >>> On the other hand, now that you mention it... When I ported
    >>> some of my file handling classes to Windows, filenames for
    >>> CreateFile were always LPCTSTR, whatever that is (but a
    >>> narrow character string literal converts implicitly to it,
    >>> as does the results of std::string.c_str()), which makes me
    >>> wonder why people argue that std::fstream must have a form
    >>> which takes a wchar_t string as filename argument.

    >
    >> Just FWIW, LPCTSTR is something like long pointer to const
    >> text string (where 'text' means char's or wchar_t's depending
    >> on whether _UNICODE was defined or not when compiling).

    >
    > In other words, you don't know what you're getting. That sounds
    > like the worst of both worlds.


    T was a feature enabling compilation of C and C?+ for both Windows 9x (narrow
    characters only) and NT (wide characters, representing Unicode).

    T is not used today except by (1) those who need to support old 9x *and* are
    using some libraries that really require narrow characters (namely, in practice,
    DLL-based MFC), and (2) utter novices, being misled by Microsoft example code
    (which apparently also is written by utter novices), and (3) incompetents.

    We'd not want any kind of macros like that in the standard, and neither have
    they anything to do in any quality app.


    >> If you don't have _UNICODE defined, CreateFile will accept a
    >> char *. If you do define it, CreateFile accepts a wchar_t *.

    >
    >> In reality, most of the functions in Windows that take strings
    >> come in two flavors: an 'A' version and a 'W' version, so the
    >> headers look something like this:

    >
    >> HANDLE CreateFileW(wchar_t const *, /* ... */);
    >> HANDLE CreateFileA(char const *, /* ... */);

    >
    >> #ifdef _UNICODE
    >> #define CreateFile CreateFileW
    >> #else
    >> #define CreateFile CreateFileA
    >> #endif

    >
    > Hopefully, they do use an inline function in the #ifdef, and not
    > a macro.


    No, it's all macros.

    Thousands of them.

    :)


    >> The 'A' version, however, is a small stub that converts the
    >> string from the current code page to UTF-16, and then (in
    >> essence) feeds that result to the 'W' version. That can lead
    >> to a problem if you use the 'A' version -- if your current
    >> code page doesn't contain a character corresponding to a
    >> character in the file name, you may not be able to create that
    >> file name with the 'A' version at all.

    >
    > Hopefully, they have a code page for UTF-8.


    No. Or, technically yes, there's a designation, and the APIs happily convert to
    and from that codepage, correctly. But as of Windows XP UTF-8 is not supported
    by standard Windows programs, in particular the command interpreter (where
    commands can just fail silently when you change to codepage 65001) -- I don't
    know whether that's been fixed in Vista or Windows 7.


    > And what happens with the name when it is actually passed to the
    > file system? Most file systems I have mounted won't support
    > UTF-16 in filenames---the file system will read it as a NTMB
    > string, and stop at the first byte with 0. (Also, the file
    > servers are often big endian, and not little endian.) I'm
    > pretty sure that NFS doesn't support UTF-16 in the protocol, and
    > I don't think SMB does either.


    The NTFS filesystem stores filenames with UTF-16 encoding.


    >> The 'W' version lets you specify UTF-16 characters directly,
    >> so it can specify any file name that can exist -- but
    >> fstream::fstream and fstream::eek:pen usually act as wrappers for
    >> the 'A' version.

    >
    >> Of course, you _could_ work around this without changing the
    >> fstream interface -- for example, you could write it to expect
    >> a UTF-8 string, convert it to UTF-16, and then pass the result
    >> to CreateFileW -- but I don't know of anybody who does so. As
    >> I recall, there are also some characters that can't be encoded
    >> as UTF-8, so even that wouldn't be a perfect solution, though
    >> it would usually be adequate.

    >
    > UTF-8 can encode anything in Unicode. And more; basically, in
    > it's most abstract form, it's just a means of encoding 32 bit
    > values as sequences of 8 bit bytes, and can handle an 32 bit
    > value. (The Unicode definition of UTF-8 does introduce some
    > restrictions---I don't think encodings of surrogates are
    > allowed, for example, and codes Unicode forbids, like 0xFFFF,
    > certainly aren't. But in the basic original UTF-8, there's no
    > problem with those either.)
    >
    >>> According to the documentation, WriteFile and ReadFile take
    >>> what I assume to be a void* (LPCVOID or LPVOID), which
    >>> doesn't say much one way or the other, but the length
    >>> argument is specified as "number of bytes".

    >
    >> Right -- the OS just passes this data through transparently.
    >> Fundamentally it's about like write() on Unix -- it just deals
    >> with a stream of bytes; any other structure is entirely up to
    >> you and what you choose to write and how you choose to
    >> interpret data you read.

    >
    > In other words, there is no transfer of 16 bit entities. It's
    > up to the writer to write it as bytes, and the reader to read it
    > as bytes, and the two to agree how to do so. (In practice, of
    > course, if the two are both on the same machine, this won't be a
    > problem. But in practice, in the places I've worked, most of
    > the files on the PC's have been remote mounted on a Sparc, which
    > is big-endian.)


    The basic problem is that while g++ compiler doesn't support a Byte Order Mark
    at the start of an UTF-8 source code file, MSVC compiler requires it.
     
    Alf P. Steinbach, Aug 7, 2009
    #16
  17. fdm

    Jerry Coffin Guest

    In article <3ca0c757-cb5a-46ae-ab91-9e4aa27d18f1
    @q14g2000vbi.googlegroups.com>, says...
    >
    > On Aug 6, 8:27 pm, Jerry Coffin <> wrote:


    [ ... ]

    > > Just FWIW, LPCTSTR is something like long pointer to const
    > > text string (where 'text' means char's or wchar_t's depending
    > > on whether _UNICODE was defined or not when compiling).

    >
    > In other words, you don't know what you're getting. That sounds
    > like the worst of both worlds.


    I can't say I've ever run into a situation where I didn't get what I
    wanted or didn't know what I was going to get. At the same time, for
    _most_ new development, I'd ignore all that and use the "W" versions
    of functions directly. Those are really its native functions, and
    they're always a bit faster, require less storage, and have at least
    the same capabilities as the "A" versions of the same (and sometimes
    more).

    [ ... ]

    > > #ifdef _UNICODE
    > > #define CreateFile CreateFileW
    > > #else
    > > #define CreateFile CreateFileA
    > > #endif

    >
    > Hopefully, they do use an inline function in the #ifdef, and not
    > a macro.


    I haven't rechecked recently, but the last time I looked, it was a
    macro.

    > > The 'A' version, however, is a small stub that converts the
    > > string from the current code page to UTF-16, and then (in
    > > essence) feeds that result to the 'W' version. That can lead
    > > to a problem if you use the 'A' version -- if your current
    > > code page doesn't contain a character corresponding to a
    > > character in the file name, you may not be able to create that
    > > file name with the 'A' version at all.

    >
    > Hopefully, they have a code page for UTF-8.


    Yes, thankfully, they do.

    [ ... ]

    > And what happens with the name when it is actually passed to the
    > file system? Most file systems I have mounted won't support
    > UTF-16 in filenames---the file system will read it as a NTMB
    > string, and stop at the first byte with 0. (Also, the file
    > servers are often big endian, and not little endian.) I'm
    > pretty sure that NFS doesn't support UTF-16 in the protocol, and
    > I don't think SMB does either.


    This is one of the places that I think the GUI way of doing things is
    helpful -- you're normally giving the user a list of files from the
    server, and then passing the server back a name picked from the list.

    As far as the mechanics go, I've never looked very carefully -- I
    suspect it's up to the FS driver to translate names as well as
    possible, and (particularly) ensure that translations work
    bidirectionally, so if you get a name from the remote server, and
    then pass that same name back, that is signifies the original file.

    [ ... ]

    > UTF-8 can encode anything in Unicode. And more; basically, in
    > it's most abstract form, it's just a means of encoding 32 bit
    > values as sequences of 8 bit bytes, and can handle an 32 bit
    > value. (The Unicode definition of UTF-8 does introduce some
    > restrictions---I don't think encodings of surrogates are
    > allowed, for example, and codes Unicode forbids, like 0xFFFF,
    > certainly aren't. But in the basic original UTF-8, there's no
    > problem with those either.)


    I think we're mostly dealing with a difference in how terminology is
    being used, but I also think it's more or less irrelevant -- as long
    as you use UTF-8, you'll almost certainly be able to represent any
    file name there is.

    [ ... ]

    > In other words, there is no transfer of 16 bit entities. It's
    > up to the writer to write it as bytes, and the reader to read it
    > as bytes, and the two to agree how to do so. (In practice, of
    > course, if the two are both on the same machine, this won't be a
    > problem. But in practice, in the places I've worked, most of
    > the files on the PC's have been remote mounted on a Sparc, which
    > is big-endian.)


    As long as the files are only being used on PCs, and stored on
    SPARCs, that shouldn't matter. Just to act as a file server, all it
    has to do is ensure that the stream of bytes that was sent to it
    matches the stream of bytes it plays back.

    We're on fairly familiar ground here though -- Windows being involved
    doesn't really change anything. If you're writing a file of Unicode
    text, putting a BOM at the beginning should be enough to let anything
    that "knows" Unicode read it. If the file needs to contain anything
    much more complex, you probably want to use some standardized
    encoding format like ASN.1 or XDR. Choosing between those is usually
    pretty easy as well: you use XDR when you can, and ASN.1 if you have
    to (e.g. to exchange data with something that only understands ASN.1,
    or if you really need the data to be self-describing).

    --
    Later,
    Jerry.
     
    Jerry Coffin, Aug 7, 2009
    #17
  18. fdm

    James Kanze Guest

    On Aug 7, 4:00 pm, Jerry Coffin <> wrote:
    > In article <3ca0c757-cb5a-46ae-ab91-9e4aa27d18f1
    > @q14g2000vbi.googlegroups.com>, says...
    > > On Aug 6, 8:27 pm, Jerry Coffin <> wrote:


    > [ ... ]


    > > > Just FWIW, LPCTSTR is something like long pointer to const
    > > > text string (where 'text' means char's or wchar_t's
    > > > depending on whether _UNICODE was defined or not when
    > > > compiling).


    > > In other words, you don't know what you're getting. That
    > > sounds like the worst of both worlds.


    > I can't say I've ever run into a situation where I didn't get
    > what I wanted or didn't know what I was going to get.


    A library with inline functions or template code?

    More generally, how do you ensure that all components of an
    application are compiled with the same value for _UNICODE?

    > At the same time, for _most_ new development, I'd ignore all
    > that and use the "W" versions of functions directly. Those are
    > really its native functions, and they're always a bit faster,
    > require less storage, and have at least the same capabilities
    > as the "A" versions of the same (and sometimes more).


    That sounds reasonable.

    > [ ... ]
    > > > The 'A' version, however, is a small stub that converts
    > > > the string from the current code page to UTF-16, and then
    > > > (in essence) feeds that result to the 'W' version. That
    > > > can lead to a problem if you use the 'A' version -- if
    > > > your current code page doesn't contain a character
    > > > corresponding to a character in the file name, you may not
    > > > be able to create that file name with the 'A' version at
    > > > all.


    > > Hopefully, they have a code page for UTF-8.


    > Yes, thankfully, they do.


    What about Alf's claim that it doesn't really work?

    More generally, if you're going to do this sort of thing, you
    need to offer a bit more flexibility. Filenames can come from
    many different sources, and depending on the origine, the
    encoding may not be the same.

    > [ ... ]
    > > And what happens with the name when it is actually passed to
    > > the file system? Most file systems I have mounted won't
    > > support UTF-16 in filenames---the file system will read it
    > > as a NTMB string, and stop at the first byte with 0. (Also,
    > > the file servers are often big endian, and not little
    > > endian.) I'm pretty sure that NFS doesn't support UTF-16 in
    > > the protocol, and I don't think SMB does either.


    > This is one of the places that I think the GUI way of doing
    > things is helpful -- you're normally giving the user a list
    > of files from the server, and then passing the server back a
    > name picked from the list.


    Not when you're creating new files. And most of my programs
    don't run under a GUI; they're servers, which run 24 hours a
    day. Of course, they don't run under Windows either, so the
    question is moot:). But the question remains---picking up the
    name from a GUI is fine for interactive programs, but a lot of
    programs aren't interactive.

    > [ ... ]
    > > UTF-8 can encode anything in Unicode. And more; basically,
    > > in it's most abstract form, it's just a means of encoding 32
    > > bit values as sequences of 8 bit bytes, and can handle an 32
    > > bit value. (The Unicode definition of UTF-8 does introduce
    > > some restrictions---I don't think encodings of surrogates
    > > are allowed, for example, and codes Unicode forbids, like
    > > 0xFFFF, certainly aren't. But in the basic original UTF-8,
    > > there's no problem with those either.)


    > I think we're mostly dealing with a difference in how
    > terminology is being used,


    UTF-8 really does have two commonly accepted meanings. The
    original UTF-8 was just a means of formatting 16, and later 31
    bit entities as bytes, and could handle any value that could be
    represented in 31 bits. The Unicode definition clearly
    restricts it somewhat, but their site is down right now, so I
    can't see exactly how. If nothing else, they only allow values
    in the range 0-0x10FFFF (which means that the longest sequence
    is only 4 bytes, rather than 6), but I'm sure that there are
    other restrictions as well.

    > but I also think it's more or less irrelevant -- as long as
    > you use UTF-8, you'll almost certainly be able to represent
    > any file name there is.


    Yes.

    > [ ... ]
    > > In other words, there is no transfer of 16 bit entities.
    > > It's up to the writer to write it as bytes, and the reader
    > > to read it as bytes, and the two to agree how to do so. (In
    > > practice, of course, if the two are both on the same
    > > machine, this won't be a problem. But in practice, in the
    > > places I've worked, most of the files on the PC's have been
    > > remote mounted on a Sparc, which is big-endian.)


    > As long as the files are only being used on PCs, and stored on
    > SPARCs, that shouldn't matter. Just to act as a file server,
    > all it has to do is ensure that the stream of bytes that was
    > sent to it matches the stream of bytes it plays back.


    And that the file name matches, somehow. But typically, this
    isn't the case---I regularly share files between systems, and
    this seems to be the case for everyone where I work.

    > We're on fairly familiar ground here though -- Windows being
    > involved doesn't really change anything. If you're writing a
    > file of Unicode text, putting a BOM at the beginning should be
    > enough to let anything that "knows" Unicode read it. If the
    > file needs to contain anything much more complex, you probably
    > want to use some standardized encoding format like ASN.1 or
    > XDR. Choosing between those is usually pretty easy as well:
    > you use XDR when you can, and ASN.1 if you have to (e.g. to
    > exchange data with something that only understands ASN.1, or
    > if you really need the data to be self-describing).


    I agree that standard (and simple) solutions exist. Putting a
    BOM at the start of a text file allows immediate identification
    of the encoding format. But how many editors that you know
    actually do this? (For non-text files, of course, you have to
    define a format, and the defined formats do tend to work
    everywhere. Although there's still the question of what to do
    if you have a filename embedded in an otherwise non-text
    format.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Aug 8, 2009
    #18
  19. fdm

    Jorgen Grahn Guest

    On Tue, 4 Aug 2009 16:43:34 +0200, fdm <> wrote:
    > Hi I have a .txt file containing:
    >
    > [-0.00231844, -0.02326, 0.0484723, 0.0782189, 0.0917853,
    > 0.119546, -0.00335514, -0.217619, -0.107065, 0, -0.0329693, -0.148395,
    > 0.104663, -0.550282, -1.26802, -0.705694, 0.0873308, -0.309962, -0.802861,

    ....
    > 1.19829, 0.0257344, 0, -0.186464, -1.54877, 0.321253,
    > 0.403886, -0.983199, -1.91005, -0.53617, -0.353148, -0.0942512, 0, 0, 0, 0,
    > 0, 0, 0, 0, 0, 0, 0]
    >
    > Now I would like to read this array into a C++ container/array so I can do
    > something like:
    >
    > double first = container[0]; // -0.00231844

    ....

    What I'd do in this case is: define the file format semi-formally
    (i.e. not just with an example). Then everything falls into place for
    me.

    I'd probably decide the brackets and commas carry no information, and
    that I needed support for comments. Then I'd be left with a
    line-oriented file format where each line is
    - a comment introduced by #
    - an empty line (nothing but whitespace)
    - a series of whitespace-delimited doubles, parsable using strtod()
    or
    - a syntax error

    I prefer not to let iostreams do the tokenization, partly because I
    don't know it that well, and partly because I don't want it to define
    the file format. I want a file format which I can explain to a
    Perl/whatever programmer without mentioning C++ (strtod() documentation
    is easy to come by).

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Aug 8, 2009
    #19
  20. fdm

    Jerry Coffin Guest

    In article <bdbbf8aa-5da2-443a-bd20-98969e1b7633
    @v2g2000vbb.googlegroups.com>, says...
    >
    > On Aug 7, 4:00 pm, Jerry Coffin <> wrote:


    [ ... ]

    > > I can't say I've ever run into a situation where I didn't get
    > > what I wanted or didn't know what I was going to get.

    >
    > A library with inline functions or template code?


    Oh, don't get me wrong -- I'm not saying there aren't any situations
    that could/would cause problems, only that I've been able to avoid
    problems from it so far.

    > More generally, how do you ensure that all components of an
    > application are compiled with the same value for _UNICODE?


    That, quite frankly, can be a pain. Most of the libraries and such
    I've used came with both versions; from what I've seen that seems to
    be fairly common.

    [ UTF-8 code page ]

    > What about Alf's claim that it doesn't really work?


    I didn't see his post saying that. In any case, I've used it (some)
    and can say it works to some degree, but I'll openly admit that I've
    never really put it under a lot of stress either -- nearly all of my
    code gets used primarily in the US, where the conversion is usually
    trivial.

    > More generally, if you're going to do this sort of thing, you
    > need to offer a bit more flexibility. Filenames can come from
    > many different sources, and depending on the origine, the
    > encoding may not be the same.


    True -- but ultimately, nothing you can do really gets away from
    problems. For any heterogeneous client and server, there's at least
    some possibility of a discrepancy between how the client and server
    will interpret things -- and even when they _seem_ homogeneous (e.g.
    both running the same variant of Unix) a difference in file system
    could still lead to a problem.

    [ Figuring out valid names on a server ]

    > Not when you're creating new files. And most of my programs
    > don't run under a GUI; they're servers, which run 24 hours a
    > day. Of course, they don't run under Windows either, so the
    > question is moot:). But the question remains---picking up the
    > name from a GUI is fine for interactive programs, but a lot of
    > programs aren't interactive.


    Even when the main program isn't interactive, configuration for it
    can be.

    Ultimately you're right though -- it would be nice if you could
    depend on (for example) being able to query a server about some basic
    characteristics of a shared/exported file system, so you could
    portably figure out what it allows. Right now, virtually all such
    "knowledge" is encoded implicitly in client code (or simply doesn't
    exist -- the client just passes a string through and hopes for the
    best).

    [ ... ]

    > And that the file name matches, somehow. But typically, this
    > isn't the case---I regularly share files between systems, and
    > this seems to be the case for everyone where I work.


    I wish I could offer something positive here, but I doubt I can.
    Ultimately, this depends more on the FS than the OS though -- just
    for example, regardless of the OS, an ISO 9660 FS (absent something
    like Joliet extensions) places draconian restrictions on file names.

    [ ... ]

    > I agree that standard (and simple) solutions exist. Putting a
    > BOM at the start of a text file allows immediate identification
    > of the encoding format. But how many editors that you know
    > actually do this?


    A few -- Windows Notepad knows how to create and work with UTF-16LE,
    UTF-16BE, all including BOMs (or whatever you call the UTF-8
    signature). The current version of Visual Studio also seems to work
    fine with UTF-8 and UTF-16 (BE & LE) text files as well. It preserves
    the BOM and endianess when saving a modified version -- but if you
    want to use it to create a new file with UTF-16BE encoding (for
    example) that might be a bit more difficult (I haven't tried to very
    hard, but I don't immediately see a "Unicode big endian" option like
    Notepad provides).

    --
    Later,
    Jerry.
     
    Jerry Coffin, Aug 9, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Darrel
    Replies:
    3
    Views:
    688
    Kevin Spencer
    Nov 11, 2004
  2. Replies:
    0
    Views:
    798
  3. Denis Palas
    Replies:
    2
    Views:
    690
    Daniel Pitts
    Dec 19, 2006
  4. Karim Ali

    Reading a file and resuming reading.

    Karim Ali, May 25, 2007, in forum: Python
    Replies:
    2
    Views:
    381
    Hrvoje Niksic
    May 25, 2007
  5. Yvon Thoraval
    Replies:
    5
    Views:
    214
    Jason Creighton
    Sep 17, 2003
Loading...

Share This Page