isspace

G

gervaz

Hi all, is there a C++ function similar to isspace that can handle
w_chars? Does the regex library handles w_chars?

Thanks,
Mattia
 
G

gervaz

Yes, there is a template function declared in <locale> and named
std::isspace, curiously enough.

There is no regex librar in the official C++ standard yet I think. The
Boost regex library is fully templated and ought to support wchar_t as
well, but I have not tried this. According to Boost documentation one needs
a separate ICU library for full Unicode support though.

hth
Paavo

Well, take a look at my snippet:

std::ifstream infile(argv[1]);

std::string s;

while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}

Using locale on VC++2008 I've got an error reporting that std::isspace
expects 2 arguments, and still I don't know if the file contains
unicode characters can be correctly handles.
The regex library referred to the new C++0x version.

Mattia
 
J

James Kanze

Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,

That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode characters
can be correctly handles.

The functions in <locale> are pretty useless, since they only
handle single byte characters. The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.
 
G

gervaz

(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
    s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
    std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,

That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode characters
can be correctly handles.

The functions in <locale> are pretty useless, since they only
handle single byte characters.  The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.

Ok, well, suppose I want to use UTF-8 encoding, how do I specify it
using locale? And where can I find a list of the possible locale
encoding configuration (e.g. if I wanted to correctly decode a web
page just parsing the fist bytes looking for 'charset')?

Thanks, Mattia
 
G

gervaz

(e-mail address removed):


(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode characters
can be correctly handles.
The functions in <locale> are pretty useless, since they only
handle single byte characters. The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.
Ok, well, suppose I want to use UTF-8 encoding, how do I specify it

With UTF-8 one is using char, not wchar_t. Note that if char is a signed
type, then one must take care to cast char to unsigned char in places
where a non-negative value is expected.

By historic reasons the locale and encoding stuff has been mixed up. Are
you more interested in locales or in encodings? Locales affect such stuff  
as the character of representing the decimal point in numbers, look of
the dates and whether V and W are sorted together or separately, and
whether cyrillic characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example how those
cyrillic characters are encoded in the binary data, if at all.

If you just want to translate different encodings, then you do not need
any locale stuff at all. When a web page comes in, you do not know if the
decimal point used in numbers therein is a dot or a comma, for example,
so strictly speaking you cannot set the correct locale for processing the
page. What you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are using
internally, for example. For that, again no locales are needed, but
instead one needs some kind os system-sepcific code or other library like
iconv.
using locale? And where can I find a list of the possible locale
encoding configuration (e.g. if I wanted to correctly decode a web
page just parsing the fist bytes looking for 'charset')?

http://www.iana.org/assignments/character-sets

But you don't want to deal with this by yourself. Use a library like
iconv.

hth
Paavo

Ok, so suppose I want to split a russian text into words and the base
method look at every character in order to decide if a space is found,
what do you suggest?
 
J

James Kanze

(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode
characters can be correctly handles.
The functions in <locale> are pretty useless, since they
only handle single byte characters. The "approved" solution
is to read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the
appropriate locale) on the wchar_t in the wstring.
Ok, well, suppose I want to use UTF-8 encoding, how do I
specify it using locale? And where can I find a list of the
possible locale encoding configuration (e.g. if I wanted to
correctly decode a web page just parsing the fist bytes
looking for 'charset')?

There are no standard names for locales -- you'll have to read
your system documentation. Posix defines a standard *format*
for names under Unix systems. But you'll still have to read the
documentation to see what is present, *and* what the default
encoding is, since if UTF-8 is the default, it may not be
present in the name. (Actually, I can't find a definition of
this format in the Posix standard. But it is common to Solaris,
HP-UP, AIX and Linux, at least, and seems to be at least a de
facto standard. The problem is that it doesn't necessarily
represent the default encoding, so UTF-8 might be "en_US.utf8"
or "en_US", the latter only if the default encoding is UTF-8.)
 
J

James Kanze

[...]
With UTF-8 one is using char, not wchar_t. Note that if char
is a signed type, then one must take care to cast char to
unsigned char in places where a non-negative value is
expected.

He didn't make clear whether he meant internal or external
encoding. One can use UTF-8 externally (and probably should for
any new projects), and still use wchar_t and UTF-16 or UTF-32
internally.
By historic reasons the locale and encoding stuff has been mixed up.

The reasons aren't just historical. Functions like isalpha have
to know the encoding if they are to work. Logically, of course,
locale and encoding are, or should be, two completely separate
concepts, but practically, at the technical level, that would
mean specifying both a locale and an encoding for things like
isalpha. (Note that the design of <locale> leaves a bit to be
desired here, since it links isalpha purely to the ctype facet;
logically, it should depend on both ctype and codecvt.
Practically, however, I'll admit that I wouldn't like to
implement a design that handled this correctly.)
Are you more interested in locales or in encodings? Locales
affect such stuff as the character of representing the
decimal point in numbers, look of the dates and whether V and
W are sorted together or separately, and whether cyrillic
characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example
how those cyrillic characters are encoded in the binary data,
if at all.

The character encoding does affect whether isalpha(0xE9) should
return true (ISO 8859-1) or false (UTF-8).
If you just want to translate different encodings, then you do
not need any locale stuff at all. When a web page comes in,
you do not know if the decimal point used in numbers therein
is a dot or a comma, for example, so strictly speaking you
cannot set the correct locale for processing the page. What
you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are
using internally, for example. For that, again no locales are
needed, but instead one needs some kind os system-sepcific
code or other library like iconv.

Strictly speaking, when a web page comes in, you don't even know
how comma or dot are encoding in it. In practice, all of the
codesets used in web pages have the first 128 values in common.
And the header should be written using just those values until
it's reached the point where it specifies the encoding. (Also
in practice, a lot of headers don't bother to specify the
encoding, so it's worthwhile to develop some pragmatic
heuristics to guess it. If the data starts with a BOM, then
it's Unicode, and the BOM will allow you to determine the
format. If the data contains 0's in the first four bytes, it's
almost certainly some format of UTF-16 or UTF-32, and you can
determine which by the number and position of the zeros.
Otherwise, I'd treat it as undetermined ASCII based until I
encountered a byte value larger than 128---if that byte value
was part of a legal UTF-8 code, I'd shift to UTF-8, otherwise to
ISO-8859-1, but that's really just a guess.)

But that doesn't tell you what the name of the locale on your
system might be.
 
J

James Kanze


[...]
If you mean space as ASCII character 32, then I would use the text
encoded in UTF-8 and compare each byte with ' '.
However, if you mean any whitespace, then I would start by
finding out at unicode.org site if there are any non-ASCII
whitespace characters defined in the standard Russian locale.
If there are, and wchar_t on the given platform is wide enough
to represent all of them in a single wchar_t, then I could
encode the text as UTF-16 or UTF-32 as appropriate for wchar_t
on the given platform and use std::isspace<wchar_t>() with the
Russian locale.
Or I could keep the text in UTF-8 and use my own custom
function for checking for the whitespace, checking directly
for all Unicode whitespace characters as listed
inhttp://en.wikipedia.org/wiki/Whitespace_%
28computer_science%29, this seems to me much less error-prone
than worrying if Russian locale and std::isspace are working
correctly on all platforms.

FWIW: I have code floating around which implements all of the
isxxx functions for UTF-8, using tables which are generated
automatically from the UnicodeData.txt file. It's in my TODO
list to get it up at my site, but I'm still really in the
process of moving and getting reestablished in a new job in a
new city in a new country (on a new computer as well), so I
probably won't be getting around to it very soon.
 
J

James Kanze

AFAIK, C90 defines a locale by the name of "C",
which should also be visible from C++.

And Posix defines "POSIX". Neither of which are really useful
for anything.
 
G

gervaz

And Posix defines "POSIX".  Neither of which are really useful
for anything.

Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
 
G

gervaz

(e-mail address removed):
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes that
you have a file test1.utf containing valid UTF-8 text. It reads the
file in as a wide stream and prints out the numeric values of all
wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
    std::wifstream is;
    const std::locale filelocale("en_US.UTF8");
    is.imbue(filelocale);
    is.open("test1.utf8");
    std::wstring s;
    while(std::getline(is, s)) {
        for (std::wstring::size_type j=0; j<s.length(); ++j) {
            std::cout << s[j] << " ";
        }
        std::cout << "\n";
    }
}
(Tested on Linux with a recent gcc, I am not too sure if this works on
Windows. First, wchar_t in MSVC is too narrow for real Unicode, at
best one might get UTF-16 as a result.)

For curiosity, I tested this also on Windows with MSVC9, and as expected
it did not work, the locale construction immediately threw an exception
(bad locale name). Neither did any alterations work ("english.UTF8",
".UTF8", ".utf-8", ".65001").

Thus, if one wants any portability it seems the best approach currently
is still to read in binary UTF-8 and perform any needed conversions by
hand.

Paavo

Under Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to http://docs.moodle.org/en/Table_of_locales,
I've tested it in VC++08 and it works. Any suggestion in how to handle
the dualism?

Thanks, Mattia
 
J

James Kanze


[...]
The above line supposes 1) that you're on a Unix platform
(because it uses the Unix conventions for naming locales), and
2) that the "en_US.UTF8" locale has been installed---under that
name. (I've worked on a lot of systems where this was not the
case.)
is.imbue(filelocale);
is.open("test1.utf8");
std::wstring s;
while(std::getline(is, s)) {
for (std::wstring::size_type j=0; j<s.length(); ++j) {
std::cout << s[j] << " ";
}
std::cout << "\n";
}
}
(Tested on Linux with a recent gcc, I am not too sure if
this works on Windows. First, wchar_t in MSVC is too narrow
for real Unicode, at best one might get UTF-16 as a result.)

UTF-16 is "real Unicode". Just like UTF-8.
For curiosity, I tested this also on Windows with MSVC9, and
as expected it did not work, the locale construction
immediately threw an exception (bad locale name). Neither did
any alterations work ("english.UTF8", ".UTF8", ".utf-8",
".65001").

That's because Windows uses different conventions for naming
locales. (Windows Vista and later clames that names conforming
to RFC 4646 are used, see
http://msdn.microsoft.com/en-us/library/dd373814(VS.85).aspx.
Except that RFC 4646 doesn't seem to contain information
concerning the character encoding. I'm guessing that Windows
would use the code page for this---65001 for UTF-8. But I don't
know how it has to be added to the "en-US".)
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any
needed conversions by hand.

It should be sufficient to find out how the different locales are
named for each system, and read this information in from some
sort of configuration file.
 
G

gervaz

innews:Xns9D116950C4paavo256@2 16.196.109.131:
(e-mail address removed):
On Jan 30, 12:05 pm, (e-mail address removed)-berlin.de (Stefan Ram) wrote:
There are no standard names for locales
  AFAIK, C90 defines a locale by the name of "C",
  which should also be visible from C++.
And Posix defines "POSIX".  Neither of which are really useful
for anything.
--
James Kanze
Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes
that you have a file test1.utf containing valid UTF-8 text. It
reads the file in as a wide stream and prints out the numeric
values of all wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
    std::wifstream is;
    const std::locale filelocale("en_US.UTF8");
    is.imbue(filelocale);
    is.open("test1.utf8");
    std::wstring s;
    while(std::getline(is, s)) {
        for (std::wstring::size_type j=0; j<s.length(); ++j) {
            std::cout << s[j] << " ";
        }
        std::cout << "\n";
    }
}
(Tested on Linux with a recent gcc, I am not too sure if this works
on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
at best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and as
expected it did not work, the locale construction immediately threw
an exception (bad locale name). Neither did any alterations work
("english.UTF8", ".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any needed
conversions by hand.
Paavo
Under Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to
http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
and it works. Any suggestion in how to handle the dualism?

Did you actually test the results? It seems this is reading UTF-8 in
unaltered, so there is no point to use a wide stream in the first place.

Paavo

Well, yeah, although using an example file like
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt and using
plain std::string, std::ifstream and std::cout everything works fine,
if I put the 'w' in front of all this types the sysout fails
producing:

UTF-8 encoded sample plain-text file
Γ

Why??
 
Ö

Öö Tiib

in16.196.109.131:
(e-mail address removed):
On Jan 30, 12:05 pm, (e-mail address removed)-berlin.de (Stefan Ram) wrote:
There are no standard names for locales
  AFAIK, C90 defines a locale by the name of "C",
  which should also be visible from C++.
And Posix defines "POSIX".  Neither of which are really useful
for anything.
--
James Kanze
Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes
that you have a file test1.utf containing valid UTF-8 text. It
reads the file in as a wide stream and prints out the numeric
values of all wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
    std::wifstream is;
    const std::locale filelocale("en_US.UTF8");
    is.imbue(filelocale);
    is.open("test1.utf8");
    std::wstring s;
    while(std::getline(is, s)) {
        for (std::wstring::size_type j=0; j<s.length(); ++j)
{
            std::cout << s[j] << " ";
        }
        std::cout << "\n";
    }
}
(Tested on Linux with a recent gcc, I am not too sure if this works
on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
at best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and as
expected it did not work, the locale construction immediately threw
an exception (bad locale name). Neither did any alterations work
("english.UTF8", ".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any needed
conversions by hand.
Paavo
Under Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to
http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
and it works. Any suggestion in how to handle the dualism?
Did you actually test the results? It seems this is reading UTF-8 in
unaltered, so there is no point to use a wide stream in the first place..

Well, yeah, although using an example file likehttp://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txtand using
plain std::string, std::ifstream and std::cout everything works fine,
if I put the 'w' in front of all this types the sysout fails
producing:

UTF-8 encoded sample plain-text file
Γ

Why??

Because C++ does not convert from UTF-8 to UTF-16 just like that.
UTF-8 fits into std::string. std::wstring is UTF-16 when sizeof
(wchar_t) is 2 and UTF-32 when sizeof(wchar_t) is 4. The support for
character portability is weak in STL, not sure why. Also POSIX
functions do not help much since most implementations were made before
Unicode was defined.

If you really want to convert then use platforms support. Most
platforms support Unicode (for example MultiByteToWideChar() in
Windows). If you want portable solution then use library that is
capable to provide conversions like ICU. http://site.icu-project.org/
 
J

Jorgen Grahn

.
Or I could keep the text in UTF-8 and use my own custom function for
checking for the whitespace, checking directly for all Unicode whitespace
characters as listed in http://en.wikipedia.org/wiki/Whitespace_%
28computer_science%29, this seems to me much less error-prone than
worrying if Russian locale and std::isspace are working correctly on all
platforms.

Worrying? "I don't support doing analysis of Russian text on a
platform with broken Russian locales" sounds like something you can
happily say.

/Jorgen
 
G

gervaz

On Fri, 2010-01-29, Paavo Helde wrote:

...


Worrying? "I don't support doing analysis of Russian text on a
platform with broken Russian locales" sounds like something you can
happily say.

/Jorgen

Ok, to summarize things learned so far:
UTF-8 can be handled by simply using std::string (henche char)
UTF-16 and UTF-32 handled by std::wstring and std::wchar_t but not
reliable because the type size is implementation-specific
Now, something like:

std::ifstream is;
const std::locale filelocale("Russian_Russia.1251");
is.imbue(filelocale);
is.open(argv[1]);

std::string s;
while(std::getline(is, s))
{
for (std::string::const_iterator it = s.begin(); it != s.end(); +
+it)
{
std::cout << *it;
if (std::isspace(*it, filelocale)) std::cout << "space found!"
<< std::endl;
}
std::cout << std::endl;
}

Works if we give as input a Russian text (althought the cout isn't
able to correctly display the russian characters).
If we are under Linux, something like

try
{
const std::locale filelocale("Russian_Russia.1251");
}
catch
{
try
{
const std::locale filelocale("ru_utf8");
}
catch
{
throw();
}
}

Can work? Any suggestion (I don't even know the specif exception that
have to be catch. Just experimenting...

Thanks, Mattia
 
J

Jorgen Grahn

Happily to who? My boss? Or the customer?

You can't *always* say it, but you cannot always bend over backways
either in an effort to please some minority. Your posting seemed to
imply that you should always implement your own logic, in case some
target platform happens to be broken.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top