scanning UTF-8 characters

Kamal R. Prasad · Apr 2, 2006

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal

Jack Klein · Apr 6, 2006

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal

Neither lex nor UTF-8 is defined by the C language. Information on
UTF-8 can be obtained from http://www.unicode.org. Questions about
lex can be asked in

Micah Cowan · Apr 6, 2006

Kamal R. Prasad said:
Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

Not really topical here in clc and clcm, I'm afraid. I've redirected
to comp.unix.programmer, where I believe you'll find more people able
to answer your question.

The /first/ non-ascii character's byte will be > 0xC0. But, yeah, you
should test for the high-bit. /All/ of the bytes in a
non-single-byte-character will be greater than 0x7f. The first byte
also has encoded information about how many bytes there are, total,
for this character.

As to how this fits in with lex, I'm not really qualified to say
much. Is it sufficient to look for the high bit? It depends on what
you intend to do after you've found one. And to be locale agnostic,
you'll probably need something to convert the locale's encoding into
UTF8 before scanning.

Yang Jiao · Apr 6, 2006

I don't know if a lexer (for me, I get flex in hand) could do anything to
identify the UTF-8 char, I m afraid u should do the job by ur own code.

Jasen Betts · Apr 6, 2006

["Followup-To:" header set to comp.lang.c.moderated.]

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice?

in most cases, there's a thing called windowing that can IIRC substitute
other symbols into the 0x00 to 0x7f range.

Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

if you treat characters above 7f as if they were ordinary letters and make
no assumption of word-length or display width you should be fairly safe,

if you're hoping to identify digits and punctuation in unusual scripts
(Chinese, Sinhala, Sanscrit, Klingon etc) you'll need to do convert your
UTF-8 stream to unicode glyphs and pass them to the lexer.

Bye.
Jasen

Douglas A. Gwyn · Apr 6, 2006

Kamal R. Prasad said:
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

You need to check that your version of "lex" supports wide characters,
which most do not. Otherwise you have to lex every possible character
into a token, which is almost certainly not what you want to do.

In most situations, it is easier to hand-code a lexer than to use "lex",
and here is a case where this is even more likely to be the case.

Convert the UTF-8 to 31-bit "Unicode" points and handle characters
solely as "wide" characters throughout.

Unicode (UTF-8) in C	13	Mar 16, 2014
XMLRPC (REXML) incorrectly handles UTF-8 data	6	Nov 16, 2010
UTF-8 and strings	44	Jun 7, 2011
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
require fails when requiring scripts with utf-8 filenames.	4	Jun 13, 2010
CGI and UTF-8	14	Sep 28, 2009
UTF-8 and diacritics combining characters	5	Dec 19, 2008

scanning UTF-8 characters

Kamal R. Prasad

Jack Klein

Micah Cowan

Yang Jiao

Jasen Betts

Douglas A. Gwyn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads