Wide character, portable function to parse words like O'Clock asone word?

David Mathog · Mar 21, 2008

In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation. So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support? I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate. Is there such a function?

Thanks,

David Mathog

Eric Sosman · Mar 21, 2008

David said:
In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation. So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support? I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate. Is there such a function?

Nothing in the Standard suite, certainly. The problem
is a difficult one, because it seems to require knowledge
beyond the merely lexical. For example, the Boy Scouts were
founded by Robert Baden-Powell, and it's clear that the hyphen
does not separate two things: He had a compound name. But if
we encounter a reference to Cheyne-Stokes breathing, John Cheyne
and William Stokes were distinct people with independent names.

The whole business gives me the heebie-jeebies.

Morris Dovey · Mar 21, 2008

Eric said:
Nothing in the Standard suite, certainly. The problem
is a difficult one, because it seems to require knowledge
beyond the merely lexical. For example, the Boy Scouts were
founded by Robert Baden-Powell, and it's clear that the hyphen
does not separate two things: He had a compound name. But if
we encounter a reference to Cheyne-Stokes breathing, John Cheyne
and William Stokes were distinct people with independent names.

The whole business gives me the heebie-jeebies.

Might we not consider "Cheyne-Stokes" to be a single (compound)
adjective?

Eric Sosman · Mar 21, 2008

Morris said:
Might we not consider "Cheyne-Stokes" to be a single (compound)
adjective?

We might. But then, we might not! I think an attempt
to formulate an all-inclusive, one-size-fits-all lexical rule
for every circumstance is just chasing a will-o'-the-wisp.

Morris Dovey · Mar 21, 2008

Eric said:
We might. But then, we might not! I think an attempt
to formulate an all-inclusive, one-size-fits-all lexical rule
for every circumstance is just chasing a will-o'-the-wisp.

I agree that the one size fits all approach would be at least
difficult.

Just out of curiosity, what is your objection to hyphenated
compound adjectives?

Eric Sosman · Mar 21, 2008

Morris said:
I agree that the one size fits all approach would be at least
difficult.

Just out of curiosity, what is your objection to hyphenated
compound adjectives?

<off-topic>

None; I use 'em all the time. I was just trying to point
out that they raise problems for purely lexical attempts to
divide text into a stream of sensible "words." For example,
in a spelling checker you would probably want to decompose
"one-size-fits-all" into four words and check them separately,
while "will-o'-the-wisp" should probably be handled as a
single unit. A mixed strategy might be fruitful: Search the
dictionary for "tic-tac-toe," then for "tic" and "tac-toe,"
"tic-tac" and "toe," and finally for "tic," "tac," and "toe"
individually. (Even that's not perfect, because it would
accept "tit-tax-too," which is probably a misspelling but
could refer to a special fee levied on strip clubs.)

It's been said that the hardest part of a spell checker
is tokenizing the text stream into a word stream.

</off-topic>

user923005 · Mar 21, 2008

In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation. So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support? I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate. Is there such a function?

Ispell does unicode. I guess it would be a good starting point to see
how that sort of thing might be done:
http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html

taking a "word" as input	11	Apr 29, 2008
How to determine whether a compiler supports wide character or not	1	Sep 16, 2003
Standard function to convert "\t" to '\t' (etc.)?	18	Jul 13, 2009
Function to output words in a vector and the occurrence.	6	Apr 20, 2007
Trying to parse/match a C string literal	12	Sep 24, 2009
[QUIZ] Numbers Can Be Words (#133)	11	Aug 3, 2007
[QUIZ] Banned Words (#9)	28	Nov 26, 2004
How to write a portable function	5	Jul 27, 2004

Wide character, portable function to parse words like O'Clock asone word?

David Mathog

Eric Sosman

Morris Dovey

Eric Sosman

Morris Dovey

Eric Sosman

user923005

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads