wide character file to wstring - unexpected results

Discussion in 'C++' started by Christopher, Dec 14, 2011.

  1. Christopher

    Christopher Guest

    I loaded a file using these two blocks of code and examined the
    results. I did not see what I expected. Each wchar_t seems to have its
    byte order swapped when looking at the results as bytes. When
    examining the contents of the wstring, extra '0' characters are
    inserted before each expected character.

    My colleague claims that its some microsoft/intel thing. That doesn't
    help me to write code that handles it though.

    Can someone explain?


    //---
    // Load the file as wide character text
    {
    // Load the Init Document
    std::wifstream initDocFile(initDocumentPath.c_str());
    ASSERT_TRUE( initDocFile );

    // Copy the contents of the file into a string
    std::wstring initDoc((std::istreambuf_iterator<wchar_t,
    std::char_traits<wchar_t> >(initDocFile)),
    (std::istreambuf_iterator<wchar_t,
    std::char_traits<wchar_t> >()));
    ASSERT_FALSE( initDoc.empty() );

    // Close the file
    initDocFile.close();
    }
    //-----

    Hovering over initDoc in Visual Studio 2008 shows:
    <
    0
    A
    0
    T
    0
    etc, etc


    //---
    // Load the file as bytes
    {
    // Load the Init Document
    std::ifstream initDocFile(initDocumentPath.c_str(),
    std::fstream::binary);
    ASSERT_TRUE( initDocFile );

    // Get the size of the file
    initDocFile.seekg(0,std::ios::end);
    std::streampos numBytes = initDocFile.tellg();
    initDocFile.seekg(0,std::ios::beg);

    // Copy the contents of the file into a vector
    std::vector<char> initDoc(numBytes);
    initDocFile.read(&initDoc[0], numBytes);
    ASSERT_FALSE( initDoc.empty() );

    // Close the file
    initDocFile.close();
    }
    //-----

    Hovering over initDoc in Visual Studio 2008 shows:
    60
    0
    65
    0
    etc.
    etc.

    //----

    Looking at the file in a hex editor shows:
    3C 00 41 00 54 00 etc. etc.

    Furthermore,
    1) I cannot double click the file and open it as XML on Windows Server
    2003. It says "Invalid character. Error processing resource"
    2) I cannot hover over initDoc in Visual Studio 2008, click the down
    arrow, and open the variable in the text visualizer, it shows "<"
    3) I cannot hover over initDoc in Visual Studio 2008, click the down
    arrow, and open the variable in the xml visualizer, it shows "A
    declaration was not closed. Error processing resource"

    Someone help me to understand.
    Christopher, Dec 14, 2011
    #1
    1. Advertising

  2. Christopher

    Christopher Guest

    On Dec 14, 5:41 pm, Sam <> wrote:
    > I am assuming, based on your description, that your file contents are coded
    > in UTF-16.
    >
    > If so, each two-byte codepoints should've been read into single wchar_t.
    > That's what a wchar_t is, after all. Sounds like your std::wifstream thought
    > that your file contents were coded in, probably, ISO-8859-1, and you're
    > seeing the results.


    Sounds reasonable.

    > Double-check that you've set your global locale correctly to reflect that
    > your system environment uses UTF-16 coding, or imbue a UTF-16 locale into
    > your std::wifstream.


    As I understand it, In Visual Studio, if a project is set to use
    unicode, then any wide strings are UTF16. I also assume the Windows
    API calls to read and write files treat text as UTF16. That's a
    question for a MS newsgroup though.

    My questions here are,
    How do I set a "global locale"?
    How do I imbue a UTF16 locale into a stream?
    Are there built in UTF-16 locales?
    Are there built in UTF-8 locales?
    Are there built in conversions methods?

    I am googling the hell out of facets and locales and finding very
    little, aside from similarly frustrated people.



    > > Furthermore,
    > > 1) I cannot double click the file and open it as XML on Windows Server
    > > 2003. It says "Invalid character. Error processing resource"

    >
    > If that's the case, then this has nothing to do with your code, and the
    > file's coding does not match your system locale.
    > The file must've been generated on a system that uses a locale with a
    > different character set/code point.


    I think that the encoding is not valid anywhere because of the mix and
    match between multibyte, wide, acii, UTF16, UTF8, Windows generated
    text, 3rd party library generated text, streaming, etc. used
    throughout the project I am in, without any regard or consistancy for
    character encoding.

    I am trying to decypher what they "thought it was" and how to get it
    into something usable.


    > Additionally, all XML files should be coded in UTF-8 anyway, not UTF-16, and
    > not ISO-8859-1.


    It's not XML that follows the rules. It's "XML" that only resembles
    xml in its use of tags, that some developer put into a file using
    Windows API functions.
    Christopher, Dec 15, 2011
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Web Developer

    char 8bit wide or 7bit wide in c++?

    Web Developer, Jul 31, 2003, in forum: C++
    Replies:
    2
    Views:
    581
    John Harrison
    Jul 31, 2003
  2. He Shiming
    Replies:
    8
    Views:
    4,826
    Stephen Howe
    Jan 3, 2005
  3. George2
    Replies:
    2
    Views:
    376
    James Kanze
    Jan 25, 2008
  4. Disc Magnet
    Replies:
    2
    Views:
    711
    Jukka K. Korpela
    May 15, 2010
  5. Disc Magnet
    Replies:
    2
    Views:
    788
    Neredbojias
    May 14, 2010
Loading...

Share This Page