ifstream >> string with UTF-8?

Discussion in 'C++' started by Wolfnoliir, Sep 9, 2009.

  1. Wolfnoliir

    Wolfnoliir Guest

    Hi,
    Here is an question that must come up all the time but I can't find a
    solution.

    I would like to get a word or a line from an utf-8 encoded file into a
    string but I get '�'s ('?') instead.
    The strange thing is, this works fine from standard input:
    cin >> someString; //works fine
    cout << someString;
    but
    someIfStream >> someString;
    cout << someString;
    prints out question marks instead of accentuated characters!
    (I'm using Linux and g++ 4.3.3)

    Does anyone have an idea why that is or a solution to the problem?
     
    Wolfnoliir, Sep 9, 2009
    #1
    1. Advertising

  2. Wolfnoliir wrote:
    > I would like to get a word or a line from an utf-8 encoded file into a
    > string but I get '�'s ('?') instead.
    > The strange thing is, this works fine from standard input:
    > cin >> someString; //works fine
    > cout << someString;
    > but
    > someIfStream >> someString;
    > cout << someString;
    > prints out question marks instead of accentuated characters!
    > (I'm using Linux and g++ 4.3.3)
    >
    > Does anyone have an idea why that is or a solution to the problem?


    Use your "working" 'cin' solution, but redirect the input to be from
    your file:

    your_test_app < file_with_utf8

    and see if there is any difference. As to the cause, my guess would be
    that your file stream gets dissynchronised from the encoding POV.

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Sep 9, 2009
    #2
    1. Advertising

  3. Wolfnoliir

    Wolfnoliir Guest

    Victor Bazarov wrote:
    >
    > Use your "working" 'cin' solution, but redirect the input to be from
    > your file:
    >
    > your_test_app < file_with_utf8
    >
    > and see if there is any difference. As to the cause, my guess would be
    > that your file stream gets dissynchronised from the encoding POV.
    >
    > V


    Indeed I get the same result when I do:
    your_test_app < file_with_utf8

    I'm not actually sure my file is utf-8. It probably isn't considering
    that when I do this:
    echo éoiàuè > txt
    your_test_app < txt
    it prints out correctly.

    But how can I know what encoding my file is in?
    Once I know that I think I can just convert it with iconv.
     
    Wolfnoliir, Sep 9, 2009
    #3
  4. Wolfnoliir wrote:
    > Victor Bazarov wrote:
    >>
    >> Use your "working" 'cin' solution, but redirect the input to be from
    >> your file:
    >>
    >> your_test_app < file_with_utf8
    >>
    >> and see if there is any difference. As to the cause, my guess would
    >> be that your file stream gets dissynchronised from the encoding POV.
    >>
    >> V

    >
    > Indeed I get the same result when I do:
    > your_test_app < file_with_utf8
    >
    > I'm not actually sure my file is utf-8.


    Uh... Then why are you trying to treat it as such?

    > It probably isn't considering
    > that when I do this:
    > echo éoiàuè > txt
    > your_test_app < txt
    > it prints out correctly.


    But you said that 'cin' worked OK, while your ifstream attempt didn't.
    You need to find out what is different with your ifstream code compared
    to the 'cin'.

    >
    > But how can I know what encoding my file is in?


    Not sure it's a C++ question, to be honest. A file is a file, it
    contains bytes. The encoding is something you think up, apply, and it's
    not part of the file itself, AFAIUI. You get different results based on
    different encodings you apply. The "correctness" of those results is
    also in your head only.

    > Once I know that I think I can just convert it with iconv.


    What's 'iconv'?

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Sep 9, 2009
    #4
  5. Wolfnoliir

    Wolfnoliir Guest

    Victor Bazarov wrote:
    > Wolfnoliir wrote:
    >> Victor Bazarov wrote:
    >>>
    >>> Use your "working" 'cin' solution, but redirect the input to be from
    >>> your file:
    >>>
    >>> your_test_app < file_with_utf8
    >>>
    >>> and see if there is any difference. As to the cause, my guess would
    >>> be that your file stream gets dissynchronised from the encoding POV.
    >>>
    >>> V

    >>
    >> Indeed I get the same result when I do:
    >> your_test_app < file_with_utf8
    >>
    >> I'm not actually sure my file is utf-8.

    >
    > Uh... Then why are you trying to treat it as such?
    >
    > > It probably isn't considering
    >> that when I do this:
    >> echo éoiàuè > txt
    >> your_test_app < txt
    >> it prints out correctly.

    >
    > But you said that 'cin' worked OK, while your ifstream attempt didn't.
    > You need to find out what is different with your ifstream code compared
    > to the 'cin'.


    There's nothing different. As I said in my last message, I was wrong.
    It's just my that my file has a different encoding than the standard
    input my terminal sends (probably utf-8).

    >
    >>
    >> But how can I know what encoding my file is in?

    >
    > Not sure it's a C++ question, to be honest. A file is a file, it
    > contains bytes. The encoding is something you think up, apply, and it's
    > not part of the file itself, AFAIUI. You get different results based on
    > different encodings you apply. The "correctness" of those results is
    > also in your head only.
    >
    >> Once I know that I think I can just convert it with iconv.

    >
    > What's 'iconv'?


    Iconv is a Unix utility that converts a text file from one encoding
    (e.g. utf-16) to another (e.g. utf-8).

    >
    > V


    If nobody knows of a utility to find out what encoding my file is
    using, I'll just go and look somewhere else then.

    Thanks for your interest in my problem.
     
    Wolfnoliir, Sep 9, 2009
    #5
  6. In message <4aa7c28a$0$3511$>, Wolfnoliir
    <> writes
    >Victor Bazarov wrote:
    >> Wolfnoliir wrote:
    >>> Victor Bazarov wrote:
    >>>>
    >>>> Use your "working" 'cin' solution, but redirect the input to be
    >>>>from your file:
    >>>>
    >>>> your_test_app < file_with_utf8
    >>>>
    >>>> and see if there is any difference. As to the cause, my guess
    >>>>would be that your file stream gets dissynchronised from the
    >>>>encoding POV.
    >>>>
    >>>> V
    >>>
    >>> Indeed I get the same result when I do:
    >>> your_test_app < file_with_utf8
    >>>
    >>> I'm not actually sure my file is utf-8.

    >> Uh... Then why are you trying to treat it as such?
    >>
    >> > It probably isn't considering
    >>> that when I do this:
    >>> echo éoiàuè > txt
    >>> your_test_app < txt
    >>> it prints out correctly.

    >> But you said that 'cin' worked OK, while your ifstream attempt
    >>didn't. You need to find out what is different with your ifstream code
    >>compared to the 'cin'.

    >
    >There's nothing different. As I said in my last message, I was wrong.
    >It's just my that my file has a different encoding than the standard
    >input my terminal sends (probably utf-8).
    >
    >>
    >>>
    >>> But how can I know what encoding my file is in?

    >> Not sure it's a C++ question, to be honest. A file is a file, it
    >>contains bytes. The encoding is something you think up, apply, and
    >>it's not part of the file itself, AFAIUI. You get different results
    >>based on different encodings you apply. The "correctness" of those
    >>results is also in your head only.
    >>
    >>> Once I know that I think I can just convert it with iconv.

    >> What's 'iconv'?

    >
    >Iconv is a Unix utility that converts a text file from one encoding
    >(e.g. utf-16) to another (e.g. utf-8).
    >
    >> V

    >
    >If nobody knows of a utility to find out what encoding my file is
    >using, I'll just go and look somewhere else then.


    Is the 'file' command any help?

    >
    >Thanks for your interest in my problem.


    --
    Richard Herring
     
    Richard Herring, Sep 9, 2009
    #6
  7. Wolfnoliir

    Wolfnoliir Guest

    Richard Herring wrote:
    > Is the 'file' command any help?


    $file dic-fr.txt
    dic-fr.txt: ISO-8859 text

    Indeed it is. Thank you very much!
     
    Wolfnoliir, Sep 9, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,531
    Joerg Jooss
    Apr 24, 2004
  2. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

    =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006, in forum: C++
    Replies:
    12
    Views:
    6,485
    JustBoo
    Jan 23, 2006
  3. Replies:
    2
    Views:
    700
  4. moonhkt
    Replies:
    18
    Views:
    2,631
    Roedy Green
    Feb 5, 2010
  5. Yohan N. Leder

    How to mark UTF-8 string as being UTF-8

    Yohan N. Leder, Jun 2, 2006, in forum: Perl Misc
    Replies:
    9
    Views:
    158
    Alan J. Flavell
    Jun 5, 2006
Loading...

Share This Page