isspace

Discussion in 'C++' started by gervaz, Jan 28, 2010.

  1. gervaz

    gervaz Guest

    Hi all, is there a C++ function similar to isspace that can handle
    w_chars? Does the regex library handles w_chars?

    Thanks,
    Mattia
    gervaz, Jan 28, 2010
    #1
    1. Advertising

  2. gervaz

    gervaz Guest

    On Jan 28, 9:40 pm, Paavo Helde <> wrote:
    > gervaz <> wrote in news:198ffd0f-8a21-4d23-802f-
    > :
    >
    > > Hi all, is there a C++ function similar to isspace that can handle
    > > w_chars? Does the regex library handles w_chars?

    >
    > Yes, there is a template function declared in <locale> and named
    > std::isspace, curiously enough.
    >
    > There is no regex librar in the official C++ standard yet I think. The
    > Boost regex library is fully templated and ought to support wchar_t as
    > well, but I have not tried this. According to Boost documentation one needs
    > a separate ICU library for full Unicode support though.
    >
    > hth
    > Paavo


    Well, take a look at my snippet:

    std::ifstream infile(argv[1]);

    std::string s;

    while (getline(infile, s))
    {
    s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
    ());
    std::cout << s;
    }

    Using locale on VC++2008 I've got an error reporting that std::isspace
    expects 2 arguments, and still I don't know if the file contains
    unicode characters can be correctly handles.
    The regex library referred to the new C++0x version.

    Mattia
    gervaz, Jan 28, 2010
    #2
    1. Advertising

  3. gervaz

    James Kanze Guest

    On 28 Jan, 21:25, gervaz <> wrote:
    > On Jan 28, 9:40 pm, Paavo Helde <> wrote:
    > > gervaz <> wrote in news:198ffd0f-8a21-4d23-802f-
    > > :


    > > > Hi all, is there a C++ function similar to isspace that
    > > > can handle w_chars? Does the regex library handles
    > > > w_chars?


    > > Yes, there is a template function declared in <locale> and
    > > named std::isspace, curiously enough.


    > > There is no regex librar in the official C++ standard yet I
    > > think. The Boost regex library is fully templated and ought
    > > to support wchar_t as well, but I have not tried this.
    > > According to Boost documentation one needs a separate ICU
    > > library for full Unicode support though.


    > Well, take a look at my snippet:


    > std::ifstream infile(argv[1]);


    > std::string s;


    > while (getline(infile, s))
    > {
    > s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
    > ());
    > std::cout << s;
    > }


    > Using locale on VC++2008 I've got an error reporting that
    > std::isspace expects 2 arguments,


    That's because std::isspace requires two arguments, the
    character to be tested, and the locale.

    > and still I don't know if the file contains unicode characters
    > can be correctly handles.


    The functions in <locale> are pretty useless, since they only
    handle single byte characters. The "approved" solution is to
    read into a wstring using wifstream (embedded with the
    appropriate locale), and use isspace (again with the appropriate
    locale) on the wchar_t in the wstring.

    --
    James Kanze
    James Kanze, Jan 28, 2010
    #3
  4. gervaz

    gervaz Guest

    On Jan 28, 11:46 pm, James Kanze <> wrote:
    > On 28 Jan, 21:25, gervaz <> wrote:
    >
    >
    >
    > > On Jan 28, 9:40 pm, Paavo Helde <> wrote:
    > > > gervaz <> wrote in news:198ffd0f-8a21-4d23-802f-
    > > > :
    > > > > Hi all, is there a C++ function similar to isspace that
    > > > > can handle w_chars? Does the regex library handles
    > > > > w_chars?
    > > > Yes, there is a template function declared in <locale> and
    > > > named std::isspace, curiously enough.
    > > > There is no regex librar in the official C++ standard yet I
    > > > think. The Boost regex library is fully templated and ought
    > > > to support wchar_t as well, but I have not tried this.
    > > > According to Boost documentation one needs a separate ICU
    > > > library for full Unicode support though.

    > > Well, take a look at my snippet:
    > > std::ifstream infile(argv[1]);
    > > std::string s;
    > > while (getline(infile, s))
    > > {
    > >     s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
    > > ());
    > >     std::cout << s;
    > > }
    > > Using locale on VC++2008 I've got an error reporting that
    > > std::isspace expects 2 arguments,

    >
    > That's because std::isspace requires two arguments, the
    > character to be tested, and the locale.
    >
    > > and still I don't know if the file contains unicode characters
    > > can be correctly handles.

    >
    > The functions in <locale> are pretty useless, since they only
    > handle single byte characters.  The "approved" solution is to
    > read into a wstring using wifstream (embedded with the
    > appropriate locale), and use isspace (again with the appropriate
    > locale) on the wchar_t in the wstring.
    >
    > --
    > James Kanze


    Ok, well, suppose I want to use UTF-8 encoding, how do I specify it
    using locale? And where can I find a list of the possible locale
    encoding configuration (e.g. if I wanted to correctly decode a web
    page just parsing the fist bytes looking for 'charset')?

    Thanks, Mattia
    gervaz, Jan 28, 2010
    #4
  5. gervaz

    gervaz Guest

    On 29 Gen, 09:09, Paavo Helde <> wrote:
    > gervaz <> wrote in news:df9f84a1-a933-4063-bca5-
    > :
    >
    >
    >
    > > On Jan 28, 11:46 pm, James Kanze <> wrote:
    > >> On 28 Jan, 21:25, gervaz <> wrote:

    >
    > >> > On Jan 28, 9:40 pm, Paavo Helde <> wrote:
    > >> > > gervaz <> wrote in news:198ffd0f-8a21-4d23-802f-
    > >> > > :
    > >> > > > Hi all, is there a C++ function similar to isspace that
    > >> > > > can handle w_chars? Does the regex library handles
    > >> > > > w_chars?
    > >> > > Yes, there is a template function declared in <locale> and
    > >> > > named std::isspace, curiously enough.
    > >> > > There is no regex librar in the official C++ standard yet I
    > >> > > think. The Boost regex library is fully templated and ought
    > >> > > to support wchar_t as well, but I have not tried this.
    > >> > > According to Boost documentation one needs a separate ICU
    > >> > > library for full Unicode support though.
    > >> > Well, take a look at my snippet:
    > >> > std::ifstream infile(argv[1]);
    > >> > std::string s;
    > >> > while (getline(infile, s))
    > >> > {
    > >> > s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
    > >> > ());
    > >> > std::cout << s;
    > >> > }
    > >> > Using locale on VC++2008 I've got an error reporting that
    > >> > std::isspace expects 2 arguments,

    >
    > >> That's because std::isspace requires two arguments, the
    > >> character to be tested, and the locale.

    >
    > >> > and still I don't know if the file contains unicode characters
    > >> > can be correctly handles.

    >
    > >> The functions in <locale> are pretty useless, since they only
    > >> handle single byte characters. The "approved" solution is to
    > >> read into a wstring using wifstream (embedded with the
    > >> appropriate locale), and use isspace (again with the appropriate
    > >> locale) on the wchar_t in the wstring.

    >
    > >> --
    > >> James Kanze

    >
    > > Ok, well, suppose I want to use UTF-8 encoding, how do I specify it

    >
    > With UTF-8 one is using char, not wchar_t. Note that if char is a signed
    > type, then one must take care to cast char to unsigned char in places
    > where a non-negative value is expected.
    >
    > By historic reasons the locale and encoding stuff has been mixed up. Are
    > you more interested in locales or in encodings? Locales affect such stuff  
    > as the character of representing the decimal point in numbers, look of
    > the dates and whether V and W are sorted together or separately, and
    > whether cyrillic characters are considered alphabetic characters or not.
    > Encoding is a fully different business, specifying for example how those
    > cyrillic characters are encoded in the binary data, if at all.
    >
    > If you just want to translate different encodings, then you do not need
    > any locale stuff at all. When a web page comes in, you do not know if the
    > decimal point used in numbers therein is a dot or a comma, for example,
    > so strictly speaking you cannot set the correct locale for processing the
    > page. What you can do is to look at BOM markers and charset encoding, and
    > to translate the file from its charset to the encoding you are using
    > internally, for example. For that, again no locales are needed, but
    > instead one needs some kind os system-sepcific code or other library like
    > iconv.
    >
    > > using locale? And where can I find a list of the possible locale
    > > encoding configuration (e.g. if I wanted to correctly decode a web
    > > page just parsing the fist bytes looking for 'charset')?

    >
    > http://www.iana.org/assignments/character-sets
    >
    > But you don't want to deal with this by yourself. Use a library like
    > iconv.
    >
    > hth
    > Paavo


    Ok, so suppose I want to split a russian text into words and the base
    method look at every character in order to decide if a space is found,
    what do you suggest?
    gervaz, Jan 29, 2010
    #5
  6. gervaz

    James Kanze Guest

    On Jan 28, 11:54 pm, gervaz <> wrote:
    > On Jan 28, 11:46 pm, James Kanze <> wrote:
    > > On 28 Jan, 21:25, gervaz <> wrote:


    > > > On Jan 28, 9:40 pm, Paavo Helde <> wrote:
    > > > > gervaz <> wrote in news:198ffd0f-8a21-4d23-802f-
    > > > > :
    > > > > > Hi all, is there a C++ function similar to isspace that
    > > > > > can handle w_chars? Does the regex library handles
    > > > > > w_chars?
    > > > > Yes, there is a template function declared in <locale> and
    > > > > named std::isspace, curiously enough.
    > > > > There is no regex librar in the official C++ standard yet I
    > > > > think. The Boost regex library is fully templated and ought
    > > > > to support wchar_t as well, but I have not tried this.
    > > > > According to Boost documentation one needs a separate ICU
    > > > > library for full Unicode support though.
    > > > Well, take a look at my snippet:
    > > > std::ifstream infile(argv[1]);
    > > > std::string s;
    > > > while (getline(infile, s))
    > > > {
    > > > s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
    > > > ());
    > > > std::cout << s;
    > > > }
    > > > Using locale on VC++2008 I've got an error reporting that
    > > > std::isspace expects 2 arguments,


    > > That's because std::isspace requires two arguments, the
    > > character to be tested, and the locale.


    > > > and still I don't know if the file contains unicode
    > > > characters can be correctly handles.


    > > The functions in <locale> are pretty useless, since they
    > > only handle single byte characters. The "approved" solution
    > > is to read into a wstring using wifstream (embedded with the
    > > appropriate locale), and use isspace (again with the
    > > appropriate locale) on the wchar_t in the wstring.


    > Ok, well, suppose I want to use UTF-8 encoding, how do I
    > specify it using locale? And where can I find a list of the
    > possible locale encoding configuration (e.g. if I wanted to
    > correctly decode a web page just parsing the fist bytes
    > looking for 'charset')?


    There are no standard names for locales -- you'll have to read
    your system documentation. Posix defines a standard *format*
    for names under Unix systems. But you'll still have to read the
    documentation to see what is present, *and* what the default
    encoding is, since if UTF-8 is the default, it may not be
    present in the name. (Actually, I can't find a definition of
    this format in the Posix standard. But it is common to Solaris,
    HP-UP, AIX and Linux, at least, and seems to be at least a de
    facto standard. The problem is that it doesn't necessarily
    represent the default encoding, so UTF-8 might be "en_US.utf8"
    or "en_US", the latter only if the default encoding is UTF-8.)

    --
    James Kanze
    James Kanze, Jan 30, 2010
    #6
  7. gervaz

    Stefan Ram Guest

    James Kanze <> writes:
    >There are no standard names for locales


    AFAIK, C90 defines a locale by the name of "C",
    which should also be visible from C++.
    Stefan Ram, Jan 30, 2010
    #7
  8. gervaz

    James Kanze Guest

    On Jan 29, 8:09 am, Paavo Helde <> wrote:

    [...]
    > > Ok, well, suppose I want to use UTF-8 encoding, how do I specify it


    > With UTF-8 one is using char, not wchar_t. Note that if char
    > is a signed type, then one must take care to cast char to
    > unsigned char in places where a non-negative value is
    > expected.


    He didn't make clear whether he meant internal or external
    encoding. One can use UTF-8 externally (and probably should for
    any new projects), and still use wchar_t and UTF-16 or UTF-32
    internally.

    > By historic reasons the locale and encoding stuff has been mixed up.


    The reasons aren't just historical. Functions like isalpha have
    to know the encoding if they are to work. Logically, of course,
    locale and encoding are, or should be, two completely separate
    concepts, but practically, at the technical level, that would
    mean specifying both a locale and an encoding for things like
    isalpha. (Note that the design of <locale> leaves a bit to be
    desired here, since it links isalpha purely to the ctype facet;
    logically, it should depend on both ctype and codecvt.
    Practically, however, I'll admit that I wouldn't like to
    implement a design that handled this correctly.)

    > Are you more interested in locales or in encodings? Locales
    > affect such stuff as the character of representing the
    > decimal point in numbers, look of the dates and whether V and
    > W are sorted together or separately, and whether cyrillic
    > characters are considered alphabetic characters or not.
    > Encoding is a fully different business, specifying for example
    > how those cyrillic characters are encoded in the binary data,
    > if at all.


    The character encoding does affect whether isalpha(0xE9) should
    return true (ISO 8859-1) or false (UTF-8).

    > If you just want to translate different encodings, then you do
    > not need any locale stuff at all. When a web page comes in,
    > you do not know if the decimal point used in numbers therein
    > is a dot or a comma, for example, so strictly speaking you
    > cannot set the correct locale for processing the page. What
    > you can do is to look at BOM markers and charset encoding, and
    > to translate the file from its charset to the encoding you are
    > using internally, for example. For that, again no locales are
    > needed, but instead one needs some kind os system-sepcific
    > code or other library like iconv.


    Strictly speaking, when a web page comes in, you don't even know
    how comma or dot are encoding in it. In practice, all of the
    codesets used in web pages have the first 128 values in common.
    And the header should be written using just those values until
    it's reached the point where it specifies the encoding. (Also
    in practice, a lot of headers don't bother to specify the
    encoding, so it's worthwhile to develop some pragmatic
    heuristics to guess it. If the data starts with a BOM, then
    it's Unicode, and the BOM will allow you to determine the
    format. If the data contains 0's in the first four bytes, it's
    almost certainly some format of UTF-16 or UTF-32, and you can
    determine which by the number and position of the zeros.
    Otherwise, I'd treat it as undetermined ASCII based until I
    encountered a byte value larger than 128---if that byte value
    was part of a legal UTF-8 code, I'd shift to UTF-8, otherwise to
    ISO-8859-1, but that's really just a guess.)

    > > using locale? And where can I find a list of the possible
    > > locale encoding configuration (e.g. if I wanted to correctly
    > > decode a web page just parsing the fist bytes looking for
    > > 'charset')?


    > http://www.iana.org/assignments/character-sets


    But that doesn't tell you what the name of the locale on your
    system might be.

    --
    James Kanze
    James Kanze, Jan 30, 2010
    #8
  9. gervaz

    James Kanze Guest

    On Jan 29, 6:12 pm, Paavo Helde <> wrote:
    > gervaz <> wrote
    > innews::


    [...]
    > > Ok, so suppose I want to split a russian text into words and
    > > the base method look at every character in order to decide
    > > if a space is found, what do you suggest?


    > If you mean space as ASCII character 32, then I would use the text
    > encoded in UTF-8 and compare each byte with ' '.


    > However, if you mean any whitespace, then I would start by
    > finding out at unicode.org site if there are any non-ASCII
    > whitespace characters defined in the standard Russian locale.
    > If there are, and wchar_t on the given platform is wide enough
    > to represent all of them in a single wchar_t, then I could
    > encode the text as UTF-16 or UTF-32 as appropriate for wchar_t
    > on the given platform and use std::isspace<wchar_t>() with the
    > Russian locale.


    > Or I could keep the text in UTF-8 and use my own custom
    > function for checking for the whitespace, checking directly
    > for all Unicode whitespace characters as listed
    > inhttp://en.wikipedia.org/wiki/Whitespace_%
    > 28computer_science%29, this seems to me much less error-prone
    > than worrying if Russian locale and std::isspace are working
    > correctly on all platforms.


    FWIW: I have code floating around which implements all of the
    isxxx functions for UTF-8, using tables which are generated
    automatically from the UnicodeData.txt file. It's in my TODO
    list to get it up at my site, but I'm still really in the
    process of moving and getting reestablished in a new job in a
    new city in a new country (on a new computer as well), so I
    probably won't be getting around to it very soon.

    --
    James Kanze
    James Kanze, Jan 30, 2010
    #9
  10. gervaz

    James Kanze Guest

    On Jan 30, 12:05 pm, -berlin.de (Stefan Ram) wrote:
    > James Kanze <> writes:
    > >There are no standard names for locales


    > AFAIK, C90 defines a locale by the name of "C",
    > which should also be visible from C++.


    And Posix defines "POSIX". Neither of which are really useful
    for anything.

    --
    James Kanze
    James Kanze, Jan 30, 2010
    #10
  11. gervaz

    gervaz Guest

    On Jan 30, 1:15 pm, James Kanze <> wrote:
    > On Jan 30, 12:05 pm, -berlin.de (Stefan Ram) wrote:
    >
    > > James Kanze <> writes:
    > > >There are no standard names for locales

    > >   AFAIK, C90 defines a locale by the name of "C",
    > >   which should also be visible from C++.

    >
    > And Posix defines "POSIX".  Neither of which are really useful
    > for anything.
    >
    > --
    > James Kanze


    Ok, so I think that I will open my file specifying to use UTF-8
    encoding, but how can I do it in C++?
    gervaz, Jan 30, 2010
    #11
  12. gervaz

    gervaz Guest

    On Jan 31, 10:39 am, Paavo Helde <> wrote:
    > Paavo Helde <> wrote innews:Xns9D116950C4paavo256@216.196.109.131:
    >
    >
    >
    > > gervaz <> wrote in news:f9eec1c9-5570-461a-bdec-
    > > :

    >
    > >> On Jan 30, 1:15 pm, James Kanze <> wrote:
    > >>> On Jan 30, 12:05 pm, -berlin.de (Stefan Ram) wrote:

    >
    > >>> > James Kanze <> writes:
    > >>> > >There are no standard names for locales
    > >>> >   AFAIK, C90 defines a locale by the name of "C",
    > >>> >   which should also be visible from C++.

    >
    > >>> And Posix defines "POSIX".  Neither of which are really useful
    > >>> for anything.

    >
    > >>> --
    > >>> James Kanze

    >
    > >> Ok, so I think that I will open my file specifying to use UTF-8
    > >> encoding, but how can I do it in C++?

    >
    > > You can open it as a narrow stream and read in as binary UTF-8, or
    > > (maybe) you can open it as a wide stream and get an automatic
    > > translation from UTF-8 to wchar_t. The following example assumes that
    > > you have a file test1.utf containing valid UTF-8 text. It reads the
    > > file in as a wide stream and prints out the numeric values of all
    > > wchar_t characters.

    >
    > > #include <iostream>
    > > #include <fstream>
    > > #include <locale>
    > > #include <string>

    >
    > > int main() {
    > >     std::wifstream is;
    > >     const std::locale filelocale("en_US.UTF8");
    > >     is.imbue(filelocale);
    > >     is.open("test1.utf8");

    >
    > >     std::wstring s;
    > >     while(std::getline(is, s)) {
    > >         for (std::wstring::size_type j=0; j<s.length(); ++j) {
    > >             std::cout << s[j] << " ";
    > >         }
    > >         std::cout << "\n";
    > >     }
    > > }

    >
    > > (Tested on Linux with a recent gcc, I am not too sure if this works on
    > > Windows. First, wchar_t in MSVC is too narrow for real Unicode, at
    > > best one might get UTF-16 as a result.)

    >
    > For curiosity, I tested this also on Windows with MSVC9, and as expected
    > it did not work, the locale construction immediately threw an exception
    > (bad locale name). Neither did any alterations work ("english.UTF8",
    > ".UTF8", ".utf-8", ".65001").
    >
    > Thus, if one wants any portability it seems the best approach currently
    > is still to read in binary UTF-8 and perform any needed conversions by
    > hand.
    >
    > Paavo


    Under Windows, you have to use const std::locale filelocale
    ("English_Australia.1252") according to http://docs.moodle.org/en/Table_of_locales,
    I've tested it in VC++08 and it works. Any suggestion in how to handle
    the dualism?

    Thanks, Mattia
    gervaz, Jan 31, 2010
    #12
  13. gervaz

    James Kanze Guest

    On Jan 31, 9:39 am, Paavo Helde <> wrote:
    > Paavo Helde <> wrote
    > innews:Xns9D116950C4paavo256@216.196.109.131:


    [...]
    > >> Ok, so I think that I will open my file specifying to use UTF-8
    > >> encoding, but how can I do it in C++?


    > > You can open it as a narrow stream and read in as binary
    > > UTF-8, or (maybe) you can open it as a wide stream and get
    > > an automatic translation from UTF-8 to wchar_t. The
    > > following example assumes that you have a file test1.utf
    > > containing valid UTF-8 text. It reads the file in as a wide
    > > stream and prints out the numeric values of all wchar_t
    > > characters.


    > > #include <iostream>
    > > #include <fstream>
    > > #include <locale>
    > > #include <string>


    > > int main() {
    > > std::wifstream is;
    > > const std::locale filelocale("en_US.UTF8");


    The above line supposes 1) that you're on a Unix platform
    (because it uses the Unix conventions for naming locales), and
    2) that the "en_US.UTF8" locale has been installed---under that
    name. (I've worked on a lot of systems where this was not the
    case.)

    > > is.imbue(filelocale);
    > > is.open("test1.utf8");


    > > std::wstring s;
    > > while(std::getline(is, s)) {
    > > for (std::wstring::size_type j=0; j<s.length(); ++j) {
    > > std::cout << s[j] << " ";
    > > }
    > > std::cout << "\n";
    > > }
    > > }


    > > (Tested on Linux with a recent gcc, I am not too sure if
    > > this works on Windows. First, wchar_t in MSVC is too narrow
    > > for real Unicode, at best one might get UTF-16 as a result.)


    UTF-16 is "real Unicode". Just like UTF-8.

    > For curiosity, I tested this also on Windows with MSVC9, and
    > as expected it did not work, the locale construction
    > immediately threw an exception (bad locale name). Neither did
    > any alterations work ("english.UTF8", ".UTF8", ".utf-8",
    > ".65001").


    That's because Windows uses different conventions for naming
    locales. (Windows Vista and later clames that names conforming
    to RFC 4646 are used, see
    http://msdn.microsoft.com/en-us/library/dd373814(VS.85).aspx.
    Except that RFC 4646 doesn't seem to contain information
    concerning the character encoding. I'm guessing that Windows
    would use the code page for this---65001 for UTF-8. But I don't
    know how it has to be added to the "en-US".)

    > Thus, if one wants any portability it seems the best approach
    > currently is still to read in binary UTF-8 and perform any
    > needed conversions by hand.


    It should be sufficient to find out how the different locales are
    named for each system, and read this information in from some
    sort of configuration file.

    --
    James Kanze
    James Kanze, Jan 31, 2010
    #13
  14. gervaz

    gervaz Guest

    On Jan 31, 2:17 pm, Paavo Helde <> wrote:
    > gervaz <> wrote innews::
    >
    >
    >
    > > On Jan 31, 10:39 am, Paavo Helde <> wrote:
    > >> Paavo Helde <> wrote
    > >> innews:Xns9D116950C4paavo256@2

    > > 16.196.109.131:

    >
    > >> > gervaz <> wrote in news:f9eec1c9-5570-461a-bdec-
    > >> > :

    >
    > >> >> On Jan 30, 1:15 pm, James Kanze <> wrote:
    > >> >>> On Jan 30, 12:05 pm, -berlin.de (Stefan Ram) wrote:

    >
    > >> >>> > James Kanze <> writes:
    > >> >>> > >There are no standard names for locales
    > >> >>> >   AFAIK, C90 defines a locale by the name of "C",
    > >> >>> >   which should also be visible from C++.

    >
    > >> >>> And Posix defines "POSIX".  Neither of which are really useful
    > >> >>> for anything.

    >
    > >> >>> --
    > >> >>> James Kanze

    >
    > >> >> Ok, so I think that I will open my file specifying to use UTF-8
    > >> >> encoding, but how can I do it in C++?

    >
    > >> > You can open it as a narrow stream and read in as binary UTF-8, or
    > >> > (maybe) you can open it as a wide stream and get an automatic
    > >> > translation from UTF-8 to wchar_t. The following example assumes
    > >> > that you have a file test1.utf containing valid UTF-8 text. It
    > >> > reads the file in as a wide stream and prints out the numeric
    > >> > values of all wchar_t characters.

    >
    > >> > #include <iostream>
    > >> > #include <fstream>
    > >> > #include <locale>
    > >> > #include <string>

    >
    > >> > int main() {
    > >> >     std::wifstream is;
    > >> >     const std::locale filelocale("en_US.UTF8");
    > >> >     is.imbue(filelocale);
    > >> >     is.open("test1.utf8");

    >
    > >> >     std::wstring s;
    > >> >     while(std::getline(is, s)) {
    > >> >         for (std::wstring::size_type j=0; j<s.length(); ++j)

    > > {
    > >> >             std::cout << s[j] << " ";
    > >> >         }
    > >> >         std::cout << "\n";
    > >> >     }
    > >> > }

    >
    > >> > (Tested on Linux with a recent gcc, I am not too sure if this works
    > >> > on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
    > >> > at best one might get UTF-16 as a result.)

    >
    > >> For curiosity, I tested this also on Windows with MSVC9, and as
    > >> expected it did not work, the locale construction immediately threw
    > >> an exception (bad locale name). Neither did any alterations work
    > >> ("english.UTF8", ".UTF8", ".utf-8", ".65001").

    >
    > >> Thus, if one wants any portability it seems the best approach
    > >> currently is still to read in binary UTF-8 and perform any needed
    > >> conversions by hand.

    >
    > >> Paavo

    >
    > > Under Windows, you have to use const std::locale filelocale
    > > ("English_Australia.1252") according to
    > >http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
    > > and it works. Any suggestion in how to handle the dualism?

    >
    > Did you actually test the results? It seems this is reading UTF-8 in
    > unaltered, so there is no point to use a wide stream in the first place.
    >
    > Paavo


    Well, yeah, although using an example file like
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt and using
    plain std::string, std::ifstream and std::cout everything works fine,
    if I put the 'w' in front of all this types the sysout fails
    producing:

    UTF-8 encoded sample plain-text file
    Γ

    Why??
    gervaz, Jan 31, 2010
    #14
  15. gervaz

    Öö Tiib Guest

    On Feb 1, 12:27 am, gervaz <> wrote:
    > On Jan 31, 2:17 pm, Paavo Helde <> wrote:
    >
    >
    >
    >
    >
    > > gervaz <> wrote innews::

    >
    > > > On Jan 31, 10:39 am, Paavo Helde <> wrote:
    > > >> Paavo Helde <> wrote
    > > >> innews:Xns9D116950C4paavo256@2
    > > > 16.196.109.131:

    >
    > > >> > gervaz <> wrote in news:f9eec1c9-5570-461a-bdec-
    > > >> > :

    >
    > > >> >> On Jan 30, 1:15 pm, James Kanze <> wrote:
    > > >> >>> On Jan 30, 12:05 pm, -berlin.de (Stefan Ram) wrote:

    >
    > > >> >>> > James Kanze <> writes:
    > > >> >>> > >There are no standard names for locales
    > > >> >>> >   AFAIK, C90 defines a locale by the name of "C",
    > > >> >>> >   which should also be visible from C++.

    >
    > > >> >>> And Posix defines "POSIX".  Neither of which are really useful
    > > >> >>> for anything.

    >
    > > >> >>> --
    > > >> >>> James Kanze

    >
    > > >> >> Ok, so I think that I will open my file specifying to use UTF-8
    > > >> >> encoding, but how can I do it in C++?

    >
    > > >> > You can open it as a narrow stream and read in as binary UTF-8, or
    > > >> > (maybe) you can open it as a wide stream and get an automatic
    > > >> > translation from UTF-8 to wchar_t. The following example assumes
    > > >> > that you have a file test1.utf containing valid UTF-8 text. It
    > > >> > reads the file in as a wide stream and prints out the numeric
    > > >> > values of all wchar_t characters.

    >
    > > >> > #include <iostream>
    > > >> > #include <fstream>
    > > >> > #include <locale>
    > > >> > #include <string>

    >
    > > >> > int main() {
    > > >> >     std::wifstream is;
    > > >> >     const std::locale filelocale("en_US.UTF8");
    > > >> >     is.imbue(filelocale);
    > > >> >     is.open("test1.utf8");

    >
    > > >> >     std::wstring s;
    > > >> >     while(std::getline(is, s)) {
    > > >> >         for (std::wstring::size_type j=0; j<s.length(); ++j)
    > > > {
    > > >> >             std::cout << s[j] << " ";
    > > >> >         }
    > > >> >         std::cout << "\n";
    > > >> >     }
    > > >> > }

    >
    > > >> > (Tested on Linux with a recent gcc, I am not too sure if this works
    > > >> > on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
    > > >> > at best one might get UTF-16 as a result.)

    >
    > > >> For curiosity, I tested this also on Windows with MSVC9, and as
    > > >> expected it did not work, the locale construction immediately threw
    > > >> an exception (bad locale name). Neither did any alterations work
    > > >> ("english.UTF8", ".UTF8", ".utf-8", ".65001").

    >
    > > >> Thus, if one wants any portability it seems the best approach
    > > >> currently is still to read in binary UTF-8 and perform any needed
    > > >> conversions by hand.

    >
    > > >> Paavo

    >
    > > > Under Windows, you have to use const std::locale filelocale
    > > > ("English_Australia.1252") according to
    > > >http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
    > > > and it works. Any suggestion in how to handle the dualism?

    >
    > > Did you actually test the results? It seems this is reading UTF-8 in
    > > unaltered, so there is no point to use a wide stream in the first place..

    >
    > > Paavo

    >
    > Well, yeah, although using an example file likehttp://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txtand using
    > plain std::string, std::ifstream and std::cout everything works fine,
    > if I put the 'w' in front of all this types the sysout fails
    > producing:
    >
    > UTF-8 encoded sample plain-text file
    > Γ
    >
    > Why??


    Because C++ does not convert from UTF-8 to UTF-16 just like that.
    UTF-8 fits into std::string. std::wstring is UTF-16 when sizeof
    (wchar_t) is 2 and UTF-32 when sizeof(wchar_t) is 4. The support for
    character portability is weak in STL, not sure why. Also POSIX
    functions do not help much since most implementations were made before
    Unicode was defined.

    If you really want to convert then use platforms support. Most
    platforms support Unicode (for example MultiByteToWideChar() in
    Windows). If you want portable solution then use library that is
    capable to provide conversions like ICU. http://site.icu-project.org/
    Öö Tiib, Feb 1, 2010
    #15
  16. gervaz

    Jorgen Grahn Guest

    On Fri, 2010-01-29, Paavo Helde wrote:
    ....
    > Or I could keep the text in UTF-8 and use my own custom function for
    > checking for the whitespace, checking directly for all Unicode whitespace
    > characters as listed in http://en.wikipedia.org/wiki/Whitespace_%
    > 28computer_science%29, this seems to me much less error-prone than
    > worrying if Russian locale and std::isspace are working correctly on all
    > platforms.


    Worrying? "I don't support doing analysis of Russian text on a
    platform with broken Russian locales" sounds like something you can
    happily say.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Feb 3, 2010
    #16
  17. gervaz

    gervaz Guest

    On 3 Feb, 09:56, Jorgen Grahn <> wrote:
    > On Fri, 2010-01-29, Paavo Helde wrote:
    >
    > ...
    >
    > > Or I could keep the text in UTF-8 and use my own custom function for
    > > checking for the whitespace, checking directly for all Unicode whitespace
    > > characters as listed inhttp://en.wikipedia.org/wiki/Whitespace_%
    > > 28computer_science%29, this seems to me much less error-prone than
    > > worrying if Russian locale and std::isspace are working correctly on all
    > > platforms.

    >
    > Worrying? "I don't support doing analysis of Russian text on a
    > platform with broken Russian locales" sounds like something you can
    > happily say.
    >
    > /Jorgen
    >
    > --
    >   // Jorgen Grahn <grahn@  Oo  o.   .  .
    > \X/     snipabacken.se>   O  o   .


    Ok, to summarize things learned so far:
    UTF-8 can be handled by simply using std::string (henche char)
    UTF-16 and UTF-32 handled by std::wstring and std::wchar_t but not
    reliable because the type size is implementation-specific
    Now, something like:

    std::ifstream is;
    const std::locale filelocale("Russian_Russia.1251");
    is.imbue(filelocale);
    is.open(argv[1]);

    std::string s;
    while(std::getline(is, s))
    {
    for (std::string::const_iterator it = s.begin(); it != s.end(); +
    +it)
    {
    std::cout << *it;
    if (std::isspace(*it, filelocale)) std::cout << "space found!"
    << std::endl;
    }
    std::cout << std::endl;
    }

    Works if we give as input a Russian text (althought the cout isn't
    able to correctly display the russian characters).
    If we are under Linux, something like

    try
    {
    const std::locale filelocale("Russian_Russia.1251");
    }
    catch
    {
    try
    {
    const std::locale filelocale("ru_utf8");
    }
    catch
    {
    throw();
    }
    }

    Can work? Any suggestion (I don't even know the specif exception that
    have to be catch. Just experimenting...

    Thanks, Mattia
    gervaz, Feb 3, 2010
    #17
  18. gervaz

    Jorgen Grahn Guest

    On Wed, 2010-02-03, Paavo Helde wrote:
    > Jorgen Grahn <> wrote in
    > news::
    >
    >> On Fri, 2010-01-29, Paavo Helde wrote:
    >> ...
    >>> Or I could keep the text in UTF-8 and use my own custom function for
    >>> checking for the whitespace, checking directly for all Unicode
    >>> whitespace characters as listed in
    >>> http://en.wikipedia.org/wiki/Whitespace_% 28computer_science%29, this
    >>> seems to me much less error-prone than worrying if Russian locale and
    >>> std::isspace are working correctly on all platforms.

    >>
    >> Worrying? "I don't support doing analysis of Russian text on a
    >> platform with broken Russian locales" sounds like something you can
    >> happily say.

    >
    > Happily to who? My boss? Or the customer?


    You can't *always* say it, but you cannot always bend over backways
    either in an effort to please some minority. Your posting seemed to
    imply that you should always implement your own logic, in case some
    target platform happens to be broken.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Feb 9, 2010
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. str.isspace()

    , Sep 3, 2006, in forum: Python
    Replies:
    3
    Views:
    491
    George Sakkis
    Sep 3, 2006
  2. Durgesh Sharma
    Replies:
    5
    Views:
    893
    E. Robert Tisdale
    Dec 21, 2004
  3. Adrian
    Replies:
    2
    Views:
    325
    Adrian
    Nov 14, 2006
  4. Army1987

    Can isspace('\0') ever be true?

    Army1987, Jul 15, 2007, in forum: C Programming
    Replies:
    7
    Views:
    347
    Army1987
    Jul 15, 2007
  5. Ian Pilcher

    isspace(0)

    Ian Pilcher, Jan 13, 2014, in forum: C Programming
    Replies:
    1
    Views:
    95
    James Kuyper
    Jan 13, 2014
Loading...

Share This Page