UTF-8 and strings

  • Thread starter John M. Dlugosz
  • Start date
M

Miles Bader

MikeP said:
I guess I have a hard time seeing how anything multi-byte is a boon. But,
and it's a big but (not to be confused with a phat azz!), if one doesn't
need "internationalization" (I mean other than English), it's a waste of
effort. Yes?

But that's the thing: if you're just doing things casually, but,
e.g., want to use a few special chars here and there, or allow users
more freedom in what filenames they're allowed to use, then UTF-8
_doesn't_ require much effort, it's a fairly easy tweak to ASCII-only
code. If the bulk of strings are English, then UTF-8 is also very
space-efficient.

It's UTF-16, which requires even the most trivial parts of
string-handling paths to be completely replaced, that's a pain in the
butt -- and then really offers almost no advantage to offset the
various disadvantages!

The only reasons I can see to use UTF-16, are: (1) you're writing
windows-only code, never expect to port it, and want to fit better
with windows library functions that expect UTF-16 strings, or (2)
you're writing an app to handle absolutely _massive_ amounts of CJK
text, and the space savings for CJK text in UTF-16 compared to UTF-8
are critical for you.

Very, very, few people are doing (2), so basically that leaves (1).

-Miles
 
J

John M. Dlugosz

I guess I have a hard time seeing how anything multi-byte is a boon. But,
and it's a big but (not to be confused with a phat azz!), if one doesn't
need "internationalization" (I mean other than English), it's a waste of
effort. Yes?

Since ASCII is a proper subset of UTF-8, you can write plain English
and get one byte per character. So there is no special effort on your
part.

It's rare that you would not care about internationalization. Even if
you don't plan to change your displayed UI into other languages,
people will try using file names and enter strings in their own
language.
 
J

John M. Dlugosz

But if you KNOW that all you need is what's in the BMP, why not exploit
that, right?

Sure, the project is specified to be nationalized into 7 languages,
and they all happen to be serviced by the Latin-1 character set. So
you decide to use 8-bit chars and assume the Windows program is
running on a system that uses code page 1252 as the default for a
process.

Then one day the boss comes in and says that the next version will be
marketed to China as well.

It is my experience that software projects only get more complex over
time. Plan for it, unless you are planning to be unsuccessful.

—John
 
A

Asger-P

Hi ruben

which is why it took 40 plus years to even think about it...

BTW - what you wrote is actually incorrect. I'm not an expert on utf-8
but god knows I've followed enough arguing about it, especially on Rik
Moens conspire mailing list, to understand this basic fact.

What part is incorrect ?

Have a look at:
http://en.wikipedia.org/wiki/UTF-8
and You will see that the first 127 characters are actually ASCII


Best regards
Asger-P
 
A

Asger-P

Hi ruben

Strangely enough, this is a specific problem for a specific kind of app,
like a word processor.

I think You are narrowing it a bit to much.
Most applications that interact with the user and their keyboard
need to consider codepages at some level, if they want to be used
outside the region where they were designed.

A simple thing like comparing two strings case insensitive will often
not work on non ASCII characters if You use the standard c functions.

This is an interesting page to read:
http://cppcms.sourceforge.net/boost_locale/html/appendix.html

If You live in a country where english is the language You
probably haven't seen the errors Your self, but fortunately
I live in Denmark so I have had to deal with this issue from
day one.


Best regards
Asger-P
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top