Strange Spanish Question Mark (¿) Appearance at the Very Beginning of Output HTML Files Using C++

M

maria

I only use C++ with Visual Studio 6.0 for string manipulations in
thousands of HTML pages on my website. Many times, the output files of
many of my C++ programs contain a spanish question mark (¿) as their
first character. What creates it? How do we avoid it?
Thanks!

maria
 
J

Jack Klein

I only use C++ with Visual Studio 6.0 for string manipulations in
thousands of HTML pages on my website. Many times, the output files of
many of my C++ programs contain a spanish question mark (¿) as their
first character. What creates it? How do we avoid it?
Thanks!

My crystal ball says that the error is on line 42.

If you, who can look at the source code all you want, haven't found
the error, how do you expect anyone else to do so, without seeing the
code?

That character is there because your program writes it there. The
answer can only be found in your program.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html
 
J

James Kanze

My crystal ball says that the error is on line 42.
If you, who can look at the source code all you want, haven't found
the error, how do you expect anyone else to do so, without seeing the
code?
That character is there because your program writes it there. The
answer can only be found in your program.

More likely, the error is related to his environment, and
character coding issues. If he writes his code supposing one
encoding (perhaps unintentionally, he might just suppose the
encoding of his editor), and views the outpu in another, wierd
things can and will happen.

Of course, you're right that we don't have enough information to
even start guessing, and knowing exactly what he's trying to
output would certainly help. As would knowning how he's viewing
the files, to determine the presense of this character; even
more useful would be a hex dump of the start of the file. (In
Unicode and ISO 8859-1, the inverted question mark is 0xBF. I
don't see anything off hand which might insert such a character
arbitrarily, but perhaps the actual value in the file is
something else, which his viewing environment displays as an
inverted question mark.)
 
M

maria

More likely, the error is related to his environment, and
character coding issues. If he writes his code supposing one
encoding (perhaps unintentionally, he might just suppose the
encoding of his editor), and views the outpu in another, wierd
things can and will happen.

Of course, you're right that we don't have enough information to
even start guessing, and knowing exactly what he's trying to
output would certainly help. As would knowning how he's viewing
the files, to determine the presense of this character; even
more useful would be a hex dump of the start of the file. (In
Unicode and ISO 8859-1, the inverted question mark is 0xBF. I
don't see anything off hand which might insert such a character
arbitrarily, but perhaps the actual value in the file is
something else, which his viewing environment displays as an
inverted question mark.)

That is very correct, James. There is no way I have told my C++
programs to write a Spanish question mark at the beginning of a page.
The code is very simple:

string entry;
.... ...
while (getline(in,entry)) {
out.write(entry.c_str(),entry.size());
out.put('\n');
...
...
}
.... ...

The viewing environment is determined by the "UEdit"
and "Search and Replace" programs, and Firefox. They all see/create
a Spanish Question Mark or other little symbols, like
an inverted exclamation mark accompanied by a European-style quotation
mark, only AT THE BEGINNING of the file.
All these marks can be eliminated by the "Search and Replace" program.
After their elimination, they do not show up again anywhere before
the files get generated again by using C++ with Visual Studio 6.0.
I was just wondering why they are created to begin with, and why they
only show up at the beginnig of the output file.
Thank you very much!

maria
 
A

Alf P. Steinbach

* maria:
I was just wondering why they are created to begin with, and why they
only show up at the beginnig of the output file.

The last may be a clue. Are you sure that it isn't a Unicode UTF-8 byte
order mark (BOF)? It would appear as "". Check out
<http://en.wikipedia.org/wiki/Byte_order_mark>. As to how it gets
there, and whether you should leave it there or remove it, well,
anyone's guess without knowing much about your program, data, processing
etc., but it's most likely not a C++ issue.

Cheers, & hth.,

- Alf
 
M

maria

* maria:

The last may be a clue. Are you sure that it isn't a Unicode UTF-8 byte
order mark (BOF)? It would appear as "". Check out
<http://en.wikipedia.org/wiki/Byte_order_mark>. As to how it gets
there, and whether you should leave it there or remove it, well,
anyone's guess without knowing much about your program, data, processing
etc., but it's most likely not a C++ issue.

Cheers, & hth.,

- Alf

Alf,

You are absolutely right. I get either a spanish question mark
or the BOF mark as you mentioned above.
I believe that you have answered my question completely!
Thank you!

maria
 
M

maria

* maria:

The last may be a clue. Are you sure that it isn't a Unicode UTF-8 byte
order mark (BOF)? It would appear as "". Check out
<http://en.wikipedia.org/wiki/Byte_order_mark>. As to how it gets
there, and whether you should leave it there or remove it, well,
anyone's guess without knowing much about your program, data, processing
etc., but it's most likely not a C++ issue.

Cheers, & hth.,

- Alf

Alf,

Let me add that newer versions of UEdit allow you to save a file
without the BOM. I think this started with version 11.20. It is now
version 13.20. The reason I am interested in this is that
my Firefox browser does show these little characters on my pages,
occasionally.
Thanks again.

maria
 
A

Alf P. Steinbach

* maria:
Let me add that newer versions of UEdit allow you to save a file
without the BOM. I think this started with version 11.20. It is now
version 13.20. The reason I am interested in this is that
my Firefox browser does show these little characters on my pages,
occasionally.
Thanks again.

I'm sorry, discussing this further is off-topic in clc++, and I'm not
sure about the right group for discussing HTML/XML/whatever the document
is, so, sorry about no hint for right group. The FAQ only mentions
groups for other things. Perhaps either some Usenet HTML or XML group,
or a Mozilla community forum, or simply check whether adhering to strict
standards (check out the W3C pages, <url: http://www.w3.org/>) helps.

Cheers, & hth.,

- Alf
 
J

James Kanze

The last may be a clue. Are you sure that it isn't a Unicode
UTF-8 byte order mark (BOF)? It would appear as "?".

That's the first thing that occurred to me as well---that's why I
asked about his toolset. If he's outputting wchar_t Unicode,
the library could very well insert a BOM at the start of the
file; who knows how another tool, which thought it was reading
char data, might interpret this.

Your suggestion of a BOM somehow converted to UTF-8 is a good
one, however, since one of the bytes would be 0xBF, which is the
inverted question mark. In UTF-8, a BOM is illegal, and should
never be output, but if he's outputting UTF-16, which is being
naïvely converted to UTF-8, I wouldn't be surprised to see it.
Check out <http://en.wikipedia.org/wiki/Byte_order_mark>. As
to how it gets there, and whether you should leave it there or
remove it, well, anyone's guess without knowing much about
your program, data, processing etc., but it's most likely not
a C++ issue.

Well, it's sort of a C++ issue, in that the committee is
addressing it: the next version of C++ will (conditionally?)
have two types, char16_t and char32_t, which are guaranteed to
be UTF-16 and UTF-32 (if they are present). But anything
involving character encodings will of necessity go beyond
language issues---there's no way C++ can have anything to do
with e.g. the encodings of the fonts in your printer.
 
A

Alf P. Steinbach

* James Kanze -> Alf:
This is off-topic, but since it challenges what I wrote I respond
anyhow. I'm setting follow-ups to [comp.programming] because it's
slightly less off-topic there. I think. ;-)

Your suggestion of a BOM somehow converted to UTF-8 is a good
one,

No, I meant precisely an UTF-8 BOM, not anything "somehow". The Unicode
BOM is simply encoded using whatever encoding is used. And with UTF-8
that means it ends up the same no matter the byte order, whereas with
UTF-16 and UTF-32 it indicates the byte order by ending up differently
depending on the byte order.

however, since one of the bytes would be 0xBF, which is the
inverted question mark. In UTF-8, a BOM is illegal,

Not according to e.g. <url: http://unicode.org/faq/utf_bom.html#29>:
"Yes, UTF-8 can contain a BOM".


Cheers,

- Alf
 
P

Pete Becker

Well, it's sort of a C++ issue, in that the committee is
addressing it: the next version of C++ will (conditionally?)
have two types, char16_t and char32_t, which are guaranteed to
be UTF-16 and UTF-32 (if they are present).

Not conditionally. They're built-in types, equivalent in size, etc. to
uint_least16_t and uint_least32_t, respectively. For a few more
details, see the article at my web site.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top