unicode text file

Discussion in 'C++' started by Koulbak, May 20, 2005.

  1. Koulbak

    Koulbak Guest

    I have some unicode (utf8) text file. I _tried_ to write a simple
    program that read one of them and write it to the standard output but...
    of course it doesn't work. There is an easy way to do it? Thanks, K.

    This is my program.

    #include <fstream>
    #include <iostream>
    #include <string>

    using namespace std;

    int main(){
    ifstream infile ("in.txt");
    string s;
    while (infile >> s) {
    cout << s;
    }
    }
     
    Koulbak, May 20, 2005
    #1
    1. Advertising

  2. Koulbak

    Mike Wahler Guest

    "Koulbak" <> wrote in message
    news:428de2b4$...
    >I have some unicode (utf8) text file. I _tried_ to write a simple program
    >that read one of them and write it to the standard output but... of course
    >it doesn't work. There is an easy way to do it? Thanks, K.
    >
    > This is my program.
    >
    > #include <fstream>
    > #include <iostream>
    > #include <string>
    >
    > using namespace std;
    >
    > int main(){
    > ifstream infile ("in.txt");


    You should here check that file was opened successfully
    before attempting to read from it.

    > string s;
    > while (infile >> s) {
    > cout << s;
    > }
    > }


    Try using 'wifstream' and 'wcout'.

    -Mike
     
    Mike Wahler, May 20, 2005
    #2
    1. Advertising

  3. Koulbak

    Koulbak Guest

    Mike Wahler wrote:
    [read unicode text file]
    >>int main(){
    >> ifstream infile ("in.txt");

    >
    > You should here check that file was opened successfully
    > before attempting to read from it.


    In the real program of course I do it, but in my post I put only the
    essential part of the question.

    >> string s;
    >> while (infile >> s) {
    >> cout << s;
    >> }
    >>}

    >
    >
    > Try using 'wifstream' and 'wcout'.


    1 Tried, it doesn't compile.

    error C2679: binary '>>' : no operator found which takes a right-hand
    operand of type 'std::string' (or there is no acceptable conversion)

    I added also wstring and it compile but it doens't work correctly: it
    prints a lot of garbage.

    2 I thought that with C++ there was the possibility to use exactly the
    standard way (avoid special construct as wcout) maybe setting some
    library option. Is it not at all true?

    Thanks a lot, K.
     
    Koulbak, May 20, 2005
    #3
  4. Koulbak wrote:

    > 1 Tried, it doesn't compile.
    >
    > error C2679: binary '>>' : no operator found which takes a right-hand
    > operand of type 'std::string' (or there is no acceptable conversion)



    You should use wstring. A wchar_t string literal is prefixed with L. For example:


    wstring s= L"Some string";



    > I added also wstring and it compile but it doens't work correctly: it
    > prints a lot of garbage.
    >
    > 2 I thought that with C++ there was the possibility to use exactly the
    > standard way (avoid special construct as wcout) maybe setting some
    > library option. Is it not at all true?


    These *are* standard facilities. All string facilities come with their wchar_t equivalents
    (including the facilities of the C-subset).



    --
    Ioannis Vranos

    http://www23.brinkster.com/noicys
     
    Ioannis Vranos, May 21, 2005
    #4
  5. Koulbak

    Koulbak Guest

    Ioannis Vranos wrote:
    > You should use wstring. [...]


    I add wstring, it doesn't works.

    >> 2 I thought that with C++ there was the possibility to use exactly the
    >> standard way (avoid special construct as wcout) maybe setting some
    >> library option. Is it not at all true?

    >
    >
    > These *are* standard facilities. All string facilities come with their
    > wchar_t equivalents (including the facilities of the C-subset).


    Sorry I was not clear at all. I would like to avoid as mush as possible
    the implementation details. I don't want to use explicitely unicode
    function but simply say to the compiler or to the library that my
    character code is unicode and then read a file exactly in the usual way.

    I would like to avoid to learn a new set of function to read and
    manipulate unicode character, unicode string and so on. Of course if it
    is possible.

    Thanks, K.
     
    Koulbak, May 21, 2005
    #5
  6. Koulbak

    Rapscallion Guest

    Koulbak wrote:
    > >> string s;
    > >> while (infile >> s) {
    > >> cout << s;
    > >> }
    > >>}

    >
    > 1 Tried, it doesn't compile.
    >
    > error C2679: binary '>>' : no operator found which takes a right-hand


    > operand of type 'std::string' (or there is no acceptable conversion)


    You have not included all necessary or the wrong header files (or have
    the wrong files in your include path).

    > I added also wstring and it compile but it doens't work correctly: it


    > prints a lot of garbage.


    wstring is not appropriate for UTF-8.

    R.C.
     
    Rapscallion, May 21, 2005
    #6
  7. Koulbak

    Koulbak Guest

    [....]
    >>I added also wstring and it compile but it doens't work correctly: it

    >
    >
    >>prints a lot of garbage.

    >
    >
    > wstring is not appropriate for UTF-8.


    Ok, that' s the problem. My encoding is UTF-8.

    Any solution?
    Thanks, K.
     
    Koulbak, May 21, 2005
    #7
  8. Koulbak

    Rapscallion Guest

    Rapscallion, May 21, 2005
    #8
  9. Koulbak

    Old Wolf Guest

    Koulbak wrote:
    > I have some unicode (utf8) text file. I _tried_ to write a
    > simple program that read one of them and write it to the
    > standard output but... of course it doesn't work. There
    > is an easy way to do it? Thanks, K.
    >
    > This is my program.
    >
    > #include <fstream>
    > #include <iostream>
    > #include <string>
    >
    > using namespace std;
    >
    > int main(){
    > ifstream infile ("in.txt");
    > string s;
    > while (infile >> s) {
    > cout << s;
    > }
    > }


    ostream >> string reads a word (up to whitespace), and then
    ignores any adjacent whitespace and newlines.
    To do line-by-line reading, you would go:

    while (getline(infile, s))
    cout << s;

    But this is not good for UTF-8 files because newline characters
    might be part of a UTF-8 code.

    To output the whole file at once:

    cout << infile.rdbuf();

    I'm assuming you want to output UTF-8 on stdout (Standard
    C++ offers no facilities for converting UTF-8 to a stream
    of wide characters). Can you clarify your intention?

    The best thing to do (IMHO) would be to open the file in
    binary mode, and also force std::cout into binary mode. (This
    would require a system-specific code). Then, no translation
    will occur and it will work correctly.

    If you can't force cout to binary, then it *might* work to
    open the input in text mode too, and hope that the input
    conversions match the output conversions!
     
    Old Wolf, May 21, 2005
    #9
  10. Koulbak wrote:

    > Sorry I was not clear at all. I would like to avoid as mush as possible
    > the implementation details. I don't want to use explicitely unicode
    > function but simply say to the compiler or to the library that my
    > character code is unicode and then read a file exactly in the usual way.
    >
    > I would like to avoid to learn a new set of function to read and
    > manipulate unicode character, unicode string and so on. Of course if it
    > is possible.


    wchar_t represents the largest character set of a system, char mainly represents a byte
    and 1 byte character sets. If you have to deal with various character sets, then better
    stick to wchar_t and the corresponding facilities for it (which are the same with plain
    char facilities, with an additional w in their name) .



    --
    Ioannis Vranos

    http://www23.brinkster.com/noicys
     
    Ioannis Vranos, May 21, 2005
    #10
  11. Ioannis Vranos, May 21, 2005
    #11
  12. Old Wolf wrote:

    > The best thing to do (IMHO) would be to open the file in
    > binary mode, and also force std::cout into binary mode. (This
    > would require a system-specific code). Then, no translation
    > will occur and it will work correctly.
    >
    > If you can't force cout to binary, then it *might* work to
    > open the input in text mode too, and hope that the input
    > conversions match the output conversions!



    What is wrong with the use of wcout?



    --
    Ioannis Vranos

    http://www23.brinkster.com/noicys
     
    Ioannis Vranos, May 21, 2005
    #12
  13. >> The best thing to do (IMHO) would be to open the file in
    >> binary mode, and also force std::cout into binary mode.
    >> (This would require a system-specific code). Then, no
    >> translation will occur and it will work correctly.
    >>
    >> If you can't force cout to binary, then it *might* work to
    >> open the input in text mode too, and hope that the input
    >> conversions match the output conversions!


    > What is wrong with the use of wcout?


    UTF-8 is a stream of 1-byte chars with characters beyond ASCII
    coded as multi byte sequences. I guess that you need to read such
    a stream as a char or binary stream and then decode each line
    with appropriate routine to UTF-16 Unicode. Say
    MultiByteToWideChar and WideCharToMultiByte strings on Win32
    platform. Other API exists on *nix platform in iconv ets.

    --
    Serge
     
    Serge Skorokhodov (216716244), May 21, 2005
    #13
  14. Koulbak

    Guest

    > I have some unicode (utf8) text file. I _tried_ to write a simple
    > program that read one of them and write it to the standard output

    but...
    > of course it doesn't work


    What character set do you want to use when writing to standard output?

    If you want it to write using a character set other than the UTF-8 that
    it read in, you need to do some conversion. You have to do this
    explicitly. It will not happen automatically.

    Assuming that your program is going to actually do something with the
    text, rather than just reading it in and then writing it out again, you
    need to decide what character set you want to use internally. I mostly
    use UTF-8 internally and for input/output, so there is rarely any
    conversion. I store this in chars. This is on Unix, and I'm in the
    "western hemisphere". I understand that Windows programmers tend to
    use UTF-16 quite often and that would also be sensible for non-European
    languages. For that you should use wchars. You should not be using
    ASCII for any new applications.

    To actually perform the conversion you need something like the iconv
    library. This is supported just about everywhere, but you'll want a
    C++ wrapper for it to make it more palateable.

    Regards, Phil.
     
    , May 21, 2005
    #14
  15. Koulbak

    Koulbak Guest

    wrote:
    >>I have some unicode (utf8) text file. I _tried_ to write a simple
    >>program that read one of them and write it to the standard output
    >> but... of course it doesn't work

    >
    > What character set do you want to use when writing to standard output? [..]
    > If you want it to write using a character set other than the UTF-8 that
    > it read in, you need to do some conversion. You have to do this
    > explicitly. It will not happen automatically.



    Thanks! I think I perfectly understood the problem.

    My program was only an exercise, but the goal was learn how to "set" the
    library (?) to read unicode (or eventually another encoding), manipulate
    it using the string functionality of the standard library and then
    write it back in a particular encoding on a file or to the standard output.

    > Assuming that your program is going to actually do
    > something with the text, rather than just reading
    > it in and then writing it out again, you need to
    > decide what character set you want to use internally.


    It's really necessary that I specify the internal encoding? At my level
    (scholastic level) I have no performance problem so if does exists a
    default encoding this is ok for me.

    So I would like specify the input file encoding and the ouput file
    encoding, than use my program, for example:

    string s;
    while (infile >> s) {
    if (s=="hello")
    {;} //delete "hello" from input
    else
    {cout << s;}
    }

    I don't want, if it's possible, to specify wstring, wcut and so on because
    1 I don't want to change the program the day I will need a diffent encoding
    2 The program written without wstring, wcut etc. is more natural and
    general and don't touch the implementation level

    [Old Wolf write]
    > I'm assuming you want to output UTF-8 on stdout (Standard
    > C++ offers no facilities for converting UTF-8 to a stream
    > of wide characters). Can you clarify your intention?


    I hope now it's more clear.

    Thanks to all for the help. K.
     
    Koulbak, May 23, 2005
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,931
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    552
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    523
    Gabriele *darkbard* Farina
    May 16, 2006
  4. Jeremy
    Replies:
    1
    Views:
    811
    Alex Willmer
    Jan 11, 2011
  5. Jeremy
    Replies:
    0
    Views:
    580
    Jeremy
    Jan 11, 2011
Loading...

Share This Page