G
gervaz
Hi all, is there a C++ function similar to isspace that can handle
w_chars? Does the regex library handles w_chars?
Thanks,
Mattia
w_chars? Does the regex library handles w_chars?
Thanks,
Mattia
Yes, there is a template function declared in <locale> and named
std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I think. The
Boost regex library is fully templated and ought to support wchar_t as
well, but I have not tried this. According to Boost documentation one needs
a separate ICU library for full Unicode support though.
hth
Paavo
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
and still I don't know if the file contains unicode characters
can be correctly handles.
Well, take a look at my snippet:(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode characters
can be correctly handles.
The functions in <locale> are pretty useless, since they only
handle single byte characters. The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.
(e-mail address removed):
(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
That's because std::isspace requires two arguments, the
character to be tested, and the locale.
and still I don't know if the file contains unicode characters
can be correctly handles.
The functions in <locale> are pretty useless, since they only
handle single byte characters. The "approved" solution is to
read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the appropriate
locale) on the wchar_t in the wstring.Ok, well, suppose I want to use UTF-8 encoding, how do I specify it
With UTF-8 one is using char, not wchar_t. Note that if char is a signed
type, then one must take care to cast char to unsigned char in places
where a non-negative value is expected.
By historic reasons the locale and encoding stuff has been mixed up. Are
you more interested in locales or in encodings? Locales affect such stuff
as the character of representing the decimal point in numbers, look of
the dates and whether V and W are sorted together or separately, and
whether cyrillic characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example how those
cyrillic characters are encoded in the binary data, if at all.
If you just want to translate different encodings, then you do not need
any locale stuff at all. When a web page comes in, you do not know if the
decimal point used in numbers therein is a dot or a comma, for example,
so strictly speaking you cannot set the correct locale for processing the
page. What you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are using
internally, for example. For that, again no locales are needed, but
instead one needs some kind os system-sepcific code or other library like
iconv.
using locale? And where can I find a list of the possible locale
encoding configuration (e.g. if I wanted to correctly decode a web
page just parsing the fist bytes looking for 'charset')?
http://www.iana.org/assignments/character-sets
But you don't want to deal with this by yourself. Use a library like
iconv.
hth
Paavo
That's because std::isspace requires two arguments, the(e-mail address removed):
Hi all, is there a C++ function similar to isspace that
can handle w_chars? Does the regex library handles
w_chars?
Yes, there is a template function declared in <locale> and
named std::isspace, curiously enough.
There is no regex librar in the official C++ standard yet I
think. The Boost regex library is fully templated and ought
to support wchar_t as well, but I have not tried this.
According to Boost documentation one needs a separate ICU
library for full Unicode support though.
Well, take a look at my snippet:
std::ifstream infile(argv[1]);
std::string s;
while (getline(infile, s))
{
s.erase(std::remove_if(s.begin(), s.end(), std::isspace), s.end
());
std::cout << s;
}
Using locale on VC++2008 I've got an error reporting that
std::isspace expects 2 arguments,
character to be tested, and the locale.The functions in <locale> are pretty useless, since theyand still I don't know if the file contains unicode
characters can be correctly handles.
only handle single byte characters. The "approved" solution
is to read into a wstring using wifstream (embedded with the
appropriate locale), and use isspace (again with the
appropriate locale) on the wchar_t in the wstring.
Ok, well, suppose I want to use UTF-8 encoding, how do I
specify it using locale? And where can I find a list of the
possible locale encoding configuration (e.g. if I wanted to
correctly decode a web page just parsing the fist bytes
looking for 'charset')?
James Kanze said:There are no standard names for locales
With UTF-8 one is using char, not wchar_t. Note that if char
is a signed type, then one must take care to cast char to
unsigned char in places where a non-negative value is
expected.
By historic reasons the locale and encoding stuff has been mixed up.
Are you more interested in locales or in encodings? Locales
affect such stuff as the character of representing the
decimal point in numbers, look of the dates and whether V and
W are sorted together or separately, and whether cyrillic
characters are considered alphabetic characters or not.
Encoding is a fully different business, specifying for example
how those cyrillic characters are encoded in the binary data,
if at all.
If you just want to translate different encodings, then you do
not need any locale stuff at all. When a web page comes in,
you do not know if the decimal point used in numbers therein
is a dot or a comma, for example, so strictly speaking you
cannot set the correct locale for processing the page. What
you can do is to look at BOM markers and charset encoding, and
to translate the file from its charset to the encoding you are
using internally, for example. For that, again no locales are
needed, but instead one needs some kind os system-sepcific
code or other library like iconv.
innews:[email protected]:
If you mean space as ASCII character 32, then I would use the text
encoded in UTF-8 and compare each byte with ' '.
However, if you mean any whitespace, then I would start by
finding out at unicode.org site if there are any non-ASCII
whitespace characters defined in the standard Russian locale.
If there are, and wchar_t on the given platform is wide enough
to represent all of them in a single wchar_t, then I could
encode the text as UTF-16 or UTF-32 as appropriate for wchar_t
on the given platform and use std::isspace<wchar_t>() with the
Russian locale.
Or I could keep the text in UTF-8 and use my own custom
function for checking for the whitespace, checking directly
for all Unicode whitespace characters as listed
inhttp://en.wikipedia.org/wiki/Whitespace_%
28computer_science%29, this seems to me much less error-prone
than worrying if Russian locale and std::isspace are working
correctly on all platforms.
AFAIK, C90 defines a locale by the name of "C",
which should also be visible from C++.
And Posix defines "POSIX". Neither of which are really useful
for anything.
(e-mail address removed):You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes that
you have a file test1.utf containing valid UTF-8 text. It reads the
file in as a wide stream and prints out the numeric values of all
wchar_t characters.#include <iostream>
#include <fstream>
#include <locale>
#include <string>int main() {
std::wifstream is;
const std::locale filelocale("en_US.UTF8");
is.imbue(filelocale);
is.open("test1.utf8");std::wstring s;
while(std::getline(is, s)) {
for (std::wstring::size_type j=0; j<s.length(); ++j) {
std::cout << s[j] << " ";
}
std::cout << "\n";
}
}(Tested on Linux with a recent gcc, I am not too sure if this works on
Windows. First, wchar_t in MSVC is too narrow for real Unicode, at
best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and as expected
it did not work, the locale construction immediately threw an exception
(bad locale name). Neither did any alterations work ("english.UTF8",
".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach currently
is still to read in binary UTF-8 and perform any needed conversions by
hand.
Paavo
innews:[email protected]:
is.imbue(filelocale);
is.open("test1.utf8");
std::wstring s;
while(std::getline(is, s)) {
for (std::wstring::size_type j=0; j<s.length(); ++j) {
std::cout << s[j] << " ";
}
std::cout << "\n";
}
}
(Tested on Linux with a recent gcc, I am not too sure if
this works on Windows. First, wchar_t in MSVC is too narrow
for real Unicode, at best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and
as expected it did not work, the locale construction
immediately threw an exception (bad locale name). Neither did
any alterations work ("english.UTF8", ".UTF8", ".utf-8",
".65001").
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any
needed conversions by hand.
innews:Xns9D116950C4paavo256@2 16.196.109.131:(e-mail address removed):
On Jan 30, 12:05 pm, (e-mail address removed)-berlin.de (Stefan Ram) wrote:
There are no standard names for locales
 AFAIK, C90 defines a locale by the name of "C",
 which should also be visible from C++.
And Posix defines "POSIX". Â Neither of which are really useful
for anything.
--
James Kanze
Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes
that you have a file test1.utf containing valid UTF-8 text. It
reads the file in as a wide stream and prints out the numeric
values of all wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
  std::wifstream is;
  const std::locale filelocale("en_US.UTF8");
  is.imbue(filelocale);
  is.open("test1.utf8");
  std::wstring s;
  while(std::getline(is, s)) {
    for (std::wstring::size_type j=0; j<s.length(); ++j) {
      std::cout << s[j] << " ";
    }
    std::cout << "\n";
  }
}
(Tested on Linux with a recent gcc, I am not too sure if this works
on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
at best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and as
expected it did not work, the locale construction immediately threw
an exception (bad locale name). Neither did any alterations work
("english.UTF8", ".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any needed
conversions by hand.
PaavoUnder Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to
http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
and it works. Any suggestion in how to handle the dualism?
Did you actually test the results? It seems this is reading UTF-8 in
unaltered, so there is no point to use a wide stream in the first place.
Paavo
in16.196.109.131:
(e-mail address removed):
On Jan 30, 12:05 pm, (e-mail address removed)-berlin.de (Stefan Ram) wrote:
There are no standard names for locales
 AFAIK, C90 defines a locale by the name of "C",
 which should also be visible from C++.
And Posix defines "POSIX". Â Neither of which are really useful
for anything.
--
James Kanze
Ok, so I think that I will open my file specifying to use UTF-8
encoding, but how can I do it in C++?
You can open it as a narrow stream and read in as binary UTF-8, or
(maybe) you can open it as a wide stream and get an automatic
translation from UTF-8 to wchar_t. The following example assumes
that you have a file test1.utf containing valid UTF-8 text. It
reads the file in as a wide stream and prints out the numeric
values of all wchar_t characters.
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
int main() {
  std::wifstream is;
  const std::locale filelocale("en_US.UTF8");
  is.imbue(filelocale);
  is.open("test1.utf8");
  std::wstring s;
  while(std::getline(is, s)) {
    for (std::wstring::size_type j=0; j<s.length(); ++j)
{
      std::cout << s[j] << " ";
    }
    std::cout << "\n";
  }
}
(Tested on Linux with a recent gcc, I am not too sure if this works
on Windows. First, wchar_t in MSVC is too narrow for real Unicode,
at best one might get UTF-16 as a result.)
For curiosity, I tested this also on Windows with MSVC9, and as
expected it did not work, the locale construction immediately threw
an exception (bad locale name). Neither did any alterations work
("english.UTF8", ".UTF8", ".utf-8", ".65001").
Thus, if one wants any portability it seems the best approach
currently is still to read in binary UTF-8 and perform any needed
conversions by hand.
Paavo
Under Windows, you have to use const std::locale filelocale
("English_Australia.1252") according to
http://docs.moodle.org/en/Table_of_locales, I've tested it in VC++08
and it works. Any suggestion in how to handle the dualism?Did you actually test the results? It seems this is reading UTF-8 in
unaltered, so there is no point to use a wide stream in the first place..Paavo
Well, yeah, although using an example file likehttp://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txtand using
plain std::string, std::ifstream and std::cout everything works fine,
if I put the 'w' in front of all this types the sysout fails
producing:
UTF-8 encoded sample plain-text file
Γ
Why??
Or I could keep the text in UTF-8 and use my own custom function for
checking for the whitespace, checking directly for all Unicode whitespace
characters as listed in http://en.wikipedia.org/wiki/Whitespace_%
28computer_science%29, this seems to me much less error-prone than
worrying if Russian locale and std::isspace are working correctly on all
platforms.
On Fri, 2010-01-29, Paavo Helde wrote:
...
Worrying? "I don't support doing analysis of Russian text on a
platform with broken Russian locales" sounds like something you can
happily say.
/Jorgen
Happily to who? My boss? Or the customer?
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.