UTF-8 encoding problem

shreshth.luthra · Oct 18, 2006

Hi All,

I am having a GUI which accepts a Unicode string and searches a given
set of xml files for that string.

Now, i have 2 XML files both of them saved in UTF-8 format, having
characters of different language.

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Please help.

Regards,
Shreshth

Ron Natalie · Oct 18, 2006

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

What does this have to do with C++ at all?
UTF-8 is a multibyte encoding of the Unicode (which effectively
is a 32 bit character space) but I doubt that's your problem.
Your problem is your document isn't conforming with the document
rules that the search program is using.

shreshth.luthra · Oct 18, 2006

I know this has nothing to do with C++ in particular but where better
to ask such a question.

Anyways,

your problem is your document isn't conforming with the document
rules that the search program is using.

I am not able to understand what you are trying to say by this.
Ofcourse i cannot do anything about the Search Program (Which is for
sure using Unicode)

But the question is that if both the file are in UTF-8 format why is it
(search program) working only for the one having UTF-8 in its XML
declaration as well.
Does it really make any difference in this regard.

Thanks for your reply.

Shreshth

loufoque · Oct 18, 2006

Although both of them are having UTF-8 as BoM, but only first file is
having UTF-8 defined in XML declration at the top of the XML file as
well.

BOMs are quite useless for UTF-8. They're nothing but facultative.
And according to the XML spec (AFAIK), the default encoding when no
encoding is declared is UTF-8.

Now, when i search for some different langauge character in that
directory using a third party GUI for desktop search, it shows that the
charcter exist in the first file (in which XML declation was also
there), but not in the second file (having only BoM)

OK, so you have a problem with your broken third party application.
How is that related with C++?

Initilally i thought that the problem is mainly because of UTF-8 being
supporting both MultiBye and Unicode, but could not find much on it.

Like most of your message, what you say just doesn't make much sense.

Please help.

Getting a basic understanding of what Unicode and its encoding formats
are would surely help.

loufoque · Oct 18, 2006

Ron said:
the Unicode (which effectively
is a 32 bit character space)

Unicode only reserves 2^20 + 2^16 mappings.
21 bits is more than enough to store that.

Peter Jansson · Oct 18, 2006

I know this has nothing to do with C++ in particular but where better
to ask such a question.

The statement above is the best I have seen in a long time here.

If you know your question have "nothing to do with C++ in particular"
then why do you ask in a newsgroup dedicated to the C++ language? That
is like asking for help with you car in a bicycle shop.

You will probably get much better response if you ask in a forum
dedicated to your problem.

Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/

Bhushan · Oct 19, 2006

Check your 3rd party search tool documentation about how it searches
XML files.

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
UTF-8 encoding problem	1	Oct 18, 2006
UTF-8 and strings	44	Jun 7, 2011
UTF-8 vs w_char	48	Nov 3, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
codec for UTF-8 with BOM	3	May 2, 2011
UTF-8 read & print?	6	Nov 25, 2012

UTF-8 encoding problem

shreshth.luthra

Ron Natalie

shreshth.luthra

loufoque

loufoque

Peter Jansson

Bhushan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads