Problem with UTF-8

C

Charles

I'm designing a C++ application for the web (with FastCGI) and it has
to use UTF-8 because there will be users who will type Asian glyphs.
When I compile the application, if I use ANSI, no problem, it compiles
properly. But if I save the files as UTF-8, I get this error message:

%g++ -o cgi-bin/test.fcgi test.cpp
test.csp.cpp:1: error: stray '\239' in program
test.csp.cpp:1: error: stray '\187' in program
test.csp.cpp:1: error: stray '\191' in program
test.csp.cpp:1: error: invalid token
test.csp.cpp:1: error: expected constructor, destructor, or type
conversion before '<' token
test.csp.cpp: In function `int main()':
test.csp.cpp:5: error: `cout' was not declared in this scope
test.csp.cpp:5: error: `endl' was not declared in this scope
%

I guess this is because UTF-8 format adds some extra info in the
header of the file. Do you know how I could use UTF-8 with my
application? Other than that, do some of you use C++ and FastCGI? What
do you think? So far I've been really pleased with the low resource
usage and with the outstanding speed. Thanks.

Charles.
 
N

Nemanja Trifunovic

I'm designing a C++ application for the web (with FastCGI) and it has
to use UTF-8 because there will be users who will type Asian glyphs.
When I compile the application, if I use ANSI, no problem, it compiles
properly. But if I save the files as UTF-8, I get this error message:

%g++ -o cgi-bin/test.fcgi test.cpp
test.csp.cpp:1: error: stray '\239' in program
test.csp.cpp:1: error: stray '\187' in program
test.csp.cpp:1: error: stray '\191' in program
test.csp.cpp:1: error: invalid token
test.csp.cpp:1: error: expected constructor, destructor, or type
conversion before '<' token
test.csp.cpp: In function `int main()':
test.csp.cpp:5: error: `cout' was not declared in this scope
test.csp.cpp:5: error: `endl' was not declared in this scope
%

I guess this is because UTF-8 format adds some extra info in the
header of the file. Do you know how I could use UTF-8 with my
application?

You should process UTF-8 encoded data wthout a need to save your
source files in that encoding. For instance, take a look at
http://utfcpp.sourceforge.net/
 
C

Charles

You should process UTF-8 encoded data wthout a need to save your
source files in that encoding. For instance, take a look athttp://utfcpp.sourceforge.net/

Nice, thanks.

Charles.
 
J

James Kanze

I'm designing a C++ application for the web (with FastCGI) and it has
to use UTF-8 because there will be users who will type Asian glyphs.
When I compile the application, if I use ANSI, no problem, it compiles
properly. But if I save the files as UTF-8, I get this error message:
%g++ -o cgi-bin/test.fcgi test.cpp
test.csp.cpp:1: error: stray '\239' in program
test.csp.cpp:1: error: stray '\187' in program
test.csp.cpp:1: error: stray '\191' in program
test.csp.cpp:1: error: invalid token
test.csp.cpp:1: error: expected constructor, destructor, or type
conversion before '<' token
test.csp.cpp: In function `int main()':
test.csp.cpp:5: error: `cout' was not declared in this scope
test.csp.cpp:5: error: `endl' was not declared in this scope
%

Something funny is going on. First, of course, if the file only
contains characters in the basic source character set, whether
it is UTF-8 or ASCII shouldn't make a difference---all of the
characters in the basic source character set are identical in
the two encodings. Even stranger, however, are the error
messages: g++ normally displays the uninterpretable character in
*octal*. But octal with an 8 or 9 in it? Something is very
strange about your g++.
I guess this is because UTF-8 format adds some extra info in
the header of the file.

It shouldn't.
Do you know how I could use UTF-8 with my application?

My editor at home is configured to use UTF-8, and it saves my
C++ files in "UTF-8". And I've never had any problems. (When I
write the comments in French, they look funny on my machine at
work, because it doesn't have any UTF-8 fonts installed, but
other than that, the compiler doesn't complain.)

Before anything else, however, I'd try to find out why your
installation of g++ is inserting 8's and 9's into its octal.
Then I'd write a very, very simple program (hello, world) with
my editor, and look at a hex dump of it, to see what it is
actually writing to the file---if the editor automatically
inserts junk you didn't insert, it may not be usable for program
development.
 
C

Charles

Before anything else, however, I'd try to find out why your
installation of g++ is inserting 8's and 9's into its octal.
Then I'd write a very, very simple program (hello, world) with
my editor, and look at a hex dump of it, to see what it is
actually writing to the file---if the editor automatically
inserts junk you didn't insert, it may not be usable for program
development.


Thanks James, will do.

Charles.
 
R

Ron Natalie

Charles said:
%g++ -o cgi-bin/test.fcgi test.cpp
test.csp.cpp:1: error: stray '\239' in program
test.csp.cpp:1: error: stray '\187' in program
test.csp.cpp:1: error: stray '\191' in program
test.csp.cpp:1: error: invalid token
test.csp.cpp:1: error: expected constructor, destructor, or type
conversion before '<' token
test.csp.cpp: In function `int main()':
test.csp.cpp:5: error: `cout' was not declared in this scope
test.csp.cpp:5: error: `endl' was not declared in this scope
%

I guess this is because UTF-8 format adds some extra info in the
header of the file. Do you know how I could use UTF-8 with my
application? Other than that, do some of you use C++ and FastCGI? What
do you think? So far I've been really pleased with the low resource
usage and with the outstanding speed. Thanks.


The character set of the execution is INDEPENDANT of the character
set the program is written in. C++ only has barely adequate half-assed
wide character support. You must make sure that you have no characters
not in the basic set in the source file (outside of string/character
literals).

It looks like the first line has so cruft in it. Delete it and
retype it being careful not to use any characters not in the basic
set. You may need to use a different text editor.
 
O

Ole Nielsby

Charles said:
I'm designing a C++ application for the web (with FastCGI) and it has
to use UTF-8 because there will be users who will type Asian glyphs.
When I compile the application, if I use ANSI, no problem, it compiles
properly. But if I save the files as UTF-8, I get this error message:

%g++ -o cgi-bin/test.fcgi test.cpp
test.csp.cpp:1: error: stray '\239' in program
test.csp.cpp:1: error: stray '\187' in program
test.csp.cpp:1: error: stray '\191' in program

Notepad inserts these very bytes in front of all utf-8 files. See

http://en.wikipedia.org/wiki/Utf-8#Windows
 
J

James Kanze

Notepad inserts these very bytes in front of all utf-8 files. See

Interesting. However, in that case, I would expect to see
'\357', '\273' and '\277' as the stray bytes, rather than the
rather wierd values he saw. (I wonder: is g++ assigning these
to a signed char, and doing the conversion to octal without
noticing that its dealing with a negative value. But I see the
correct values when I try it with g++.)

As far as I can tell, even if the compiler processed the file as
UTF-8, a BOM is illegal in a C++ program, unless the compiler
were to simply eliminate it in phase 1 (where it maps the
physical source file characters to the basic source character
set---in an implementation defined manner). It might be worth
modifying the standard to require a few more characters to be
recognized as white space: requiring '\r' and the BOM would make
life a lot easier in practice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top