Wide character, portable function to parse words like O'Clock asone word?

D

David Mathog

In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation. So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support? I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate. Is there such a function?

Thanks,

David Mathog
 
E

Eric Sosman

David said:
In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation. So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support? I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate. Is there such a function?

Nothing in the Standard suite, certainly. The problem
is a difficult one, because it seems to require knowledge
beyond the merely lexical. For example, the Boy Scouts were
founded by Robert Baden-Powell, and it's clear that the hyphen
does not separate two things: He had a compound name. But if
we encounter a reference to Cheyne-Stokes breathing, John Cheyne
and William Stokes were distinct people with independent names.

The whole business gives me the heebie-jeebies.
 
M

Morris Dovey

Eric said:
Nothing in the Standard suite, certainly. The problem
is a difficult one, because it seems to require knowledge
beyond the merely lexical. For example, the Boy Scouts were
founded by Robert Baden-Powell, and it's clear that the hyphen
does not separate two things: He had a compound name. But if
we encounter a reference to Cheyne-Stokes breathing, John Cheyne
and William Stokes were distinct people with independent names.

The whole business gives me the heebie-jeebies.

Might we not consider "Cheyne-Stokes" to be a single (compound)
adjective?
 
E

Eric Sosman

Morris said:
Might we not consider "Cheyne-Stokes" to be a single (compound)
adjective?

We might. But then, we might not! I think an attempt
to formulate an all-inclusive, one-size-fits-all lexical rule
for every circumstance is just chasing a will-o'-the-wisp.
 
M

Morris Dovey

Eric said:
We might. But then, we might not! I think an attempt
to formulate an all-inclusive, one-size-fits-all lexical rule
for every circumstance is just chasing a will-o'-the-wisp.

I agree that the one size fits all approach would be at least
difficult.

Just out of curiosity, what is your objection to hyphenated
compound adjectives?
 
E

Eric Sosman

Morris said:
I agree that the one size fits all approach would be at least
difficult.

Just out of curiosity, what is your objection to hyphenated
compound adjectives?

<off-topic>

None; I use 'em all the time. I was just trying to point
out that they raise problems for purely lexical attempts to
divide text into a stream of sensible "words." For example,
in a spelling checker you would probably want to decompose
"one-size-fits-all" into four words and check them separately,
while "will-o'-the-wisp" should probably be handled as a
single unit. A mixed strategy might be fruitful: Search the
dictionary for "tic-tac-toe," then for "tic" and "tac-toe,"
"tic-tac" and "toe," and finally for "tic," "tac," and "toe"
individually. (Even that's not perfect, because it would
accept "tit-tax-too," which is probably a misspelling but
could refer to a special fee levied on strip clubs.)

It's been said that the hardest part of a spell checker
is tokenizing the text stream into a word stream.

</off-topic>
 
U

user923005

In English words like "O'Clock" contain an embedded character
which the C function iswpunct() classifies as punctuation.  So
in order to tokenize a string of text containing this type of
word properly one cannot simply use wcstok(), special
rules like "a quote immediately followed and preceded by an alphabet
character is not treated as punctuation" must be added.

What I'm wondering is if there is a standard function to do this
somewhere in the "w" set of functions which were added for multilingual
support?  I mean, I know what the rules are for English, but the whole
point of the wide characters is to support other languages portable, and
it would seem the somewhere in the LC_CTYPE information set this
information should be present and accessible. That said, I have yet to
find anything in there which seems appropriate.  Is there such a function?

Ispell does unicode. I guess it would be a good starting point to see
how that sort of thing might be done:
http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,269
Latest member
vinaykumar_nevatia23

Latest Threads

Top