strstr for Unicode characters

K

Kelvin Moss

Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..
 
I

Ioannis Papadopoulos

Kelvin said:
Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..

You could always use memcpy() or memmove().
 
S

SM Ryan

# Hi all,
#
# How could one write an strstr function to work with unicode characters?
#
# Are there existing implementations/solutions/api for doing so?

String functions should work just fine on UTF-8 encoded unicode
characters - minding that nonASCII characters will have codes greater
than 127 (or less than zero) and might be represented by multiple bytes.
For something like strstr which should only be looking for byte
sequences without embedded zeros, it should be fine, while strchr
can be problematically. There is also wide character (wc...) type
and functions becoming available which will probably be 16 bit or
wider unicode characters.
 
?

=?ISO-8859-1?Q?Une_b=E9vue?=

SM Ryan said:
There is also wide character (wc...) type
and functions becoming available which will probably be 16 bit or
wider unicode characters.

for example as UTF16 used on Mac OS X File System ???
 
K

Kelvin Moss

SM said:
# Hi all,
#
# How could one write an strstr function to work with unicode characters?
#
# Are there existing implementations/solutions/api for doing so?

String functions should work just fine on UTF-8 encoded unicode
characters - minding that nonASCII characters will have codes greater
than 127 (or less than zero) and might be represented by multiple bytes.
For something like strstr which should only be looking for byte
sequences without embedded zeros, it should be fine, while strchr
can be problematically.

Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
say that as long as I don't have embedded zeroes in the strings strstr
should be fine. Right? I think this assumption may not work quite well
in real applications. Your thoughts?

Thanks ..
 
S

Stephen Sprunk

Kelvin Moss said:
Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
say that as long as I don't have embedded zeroes in the strings strstr
should be fine. Right? I think this assumption may not work quite
well
in real applications. Your thoughts?

UTF-8 won't have any embedded zeroes by definition; the encoding was
specifically designed to work transparently with C code that assumed
ASCII or some 8-bit ASCII-based encoding.

S
 
I

Ioannis Papadopoulos

Ioannis said:
You could always use memcpy() or memmove().

I do not know what I was thinking at the time. I thought you wanted a
function for strcpy. Please ignore my previous reply.

I tried strstr() for unicode chars and seems to work.
 
M

micans

Kelvin said:
Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Do you want to deal with issues such as normalization? E.g. combining
characters can be represented in (many) different ways. In that case,
I've previously worked in a project that used the (IBM) ICU libraries
(licensed under the X license, GPL compatible).

Stijn
 
S

SM Ryan

(e-mail address removed) (=?ISO-8859-1?Q?Une_b=E9vue?=) wrote:
#
# > There is also wide character (wc...) type
# > and functions becoming available which will probably be 16 bit or
# > wider unicode characters.
#
# for example as UTF16 used on Mac OS X File System ???

MacOSX file paths are UTF-8 encoding of Unicode (16 bit I think).
The file name length limit is the number of UTF-8 bytes.
 
S

SM Ryan

#
# SM Ryan wrote:
# > # Hi all,
# > #
# > # How could one write an strstr function to work with unicode characters?
# > #
# > # Are there existing implementations/solutions/api for doing so?
# >
# > String functions should work just fine on UTF-8 encoded unicode
# > characters - minding that nonASCII characters will have codes greater
# > than 127 (or less than zero) and might be represented by multiple bytes.
# > For something like strstr which should only be looking for byte
# > sequences without embedded zeros, it should be fine, while strchr
# > can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?
#
# Thanks ..
#
#
#
 
S

SM Ryan

#
# SM Ryan wrote:
# > # Hi all,
# > #
# > # How could one write an strstr function to work with unicode characters?
# > #
# > # Are there existing implementations/solutions/api for doing so?
# >
# > String functions should work just fine on UTF-8 encoded unicode
# > characters - minding that nonASCII characters will have codes greater
# > than 127 (or less than zero) and might be represented by multiple bytes.
# > For something like strstr which should only be looking for byte
# > sequences without embedded zeros, it should be fine, while strchr
# > can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?

UTF-8 bytes are ASCII characters plus nonzero bytes; UTF-8 encoding
does not insert zero bytes where none existed before. As long as all
you're doing is shuffling bytes around, you can use most str* functions.
Functions like strchr which expect one char to be one character
will only work on the ASCII subset.

In FILEs, you have to negotiate with other programs how they will
interpret byte sequences. If all the applications assume UTF-8
encodings in FILEs, and they handle UTF-8 internally, then everything
will be fine.
 
S

SM Ryan

#
# SM Ryan wrote:
# > # Hi all,
# > #
# > # How could one write an strstr function to work with unicode characters?
# > #
# > # Are there existing implementations/solutions/api for doing so?
# >
# > String functions should work just fine on UTF-8 encoded unicode
# > characters - minding that nonASCII characters will have codes greater
# > than 127 (or less than zero) and might be represented by multiple bytes.
# > For something like strstr which should only be looking for byte
# > sequences without embedded zeros, it should be fine, while strchr
# > can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?

UTF-8 bytes are ASCII characters plus nonzero bytes; UTF-8 encoding
does not insert zero bytes where none existed before. As long as all
you're doing is shuffling bytes around, you can use most str* functions.
Functions like strchr which expect one char to be one character
will only work on the ASCII subset.

In FILEs, you have to negotiate with other programs how they will
interpret byte sequences. If all the applications assume UTF-8
encodings in FILEs, and they handle UTF-8 internally, then everything
will be fine.
 
?

=?ISO-8859-1?Q?Une_b=E9vue?=

SM Ryan said:
MacOSX file paths are UTF-8 encoding of Unicode (16 bit I think).
The file name length limit is the number of UTF-8 bytes.

you're right is a extract on TN 2078 "migrating from FSSpecs to FSRefs"
:

struct FSRef {
UInt8 hidden[80]; /* private to File Manager*/
};

however at paragraph "FSRefs and long Unicode file names" they wrote :

OSErr FSRefGetName( const FSRef *fsRef, HFSUniStr255 *name )
{
return( FSGetCatalogInfo(fsRef, kFSCatInfoNone, NULL, name, NULL,
NULL) );
}

An HFSUniStr255 is defined as:

struct HFSUniStr255 {
UInt16 length; /* number of unicode characters */
UniChar unicode[255]; /* unicode characters */
};

How file names are encoded

HFS+ disks store file names as UTF-16 in an Apple-modified form of
-------------------------------^^^^^^^
Normalization Form D (decomposed). This form excludes certain
compatibility decompositions and parts of the symbol blocks, in order to
assure round-trip of file names to Mac OS encodings (applications using
the HFS APIs assume they get the same bytes out that they put in).

did I miss somethong ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top