Need help parsing with pyparsing...

Just Another Victim of the Ambient Morality · Oct 22, 2007

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...

Paul McGuire · Oct 22, 2007

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...

Yes, whitespace skipping does get in the way sometimes. In your case,
you need to clarify that each word that is parsed must be followed by
whitespace. See the options and comments in the code below:

from pyparsing import *

data = "UPPER CASE WORDS And Title Like Words"

# Option 1 - qualify Word instance with asKeyword=True
upperCaseWord = Word(alphas.upper(), asKeyword=True)
titleLikeWord = Word(alphas.upper(), alphas.lower(), asKeyword=True)

# Option 2 - explicitly state that each word must be followed by
whitespace
upperCaseWord = Word(alphas.upper()) + FollowedBy(White())
titleLikeWord = Word(alphas.upper(), alphas.lower()) +
FollowedBy(White())

# Option 3 - use regex's - note, still have to use lookahead to avoid
matching
# 'A' in 'And'
upperCaseWord = Regex(r"[A-Z]+(?=\s)")
titleLikeWord = Regex(r"[A-Z][a-z]*")

# create grammar, with some friendly results names
grammar = (OneOrMore(upperCaseWord)("allCaps") +
OneOrMore(titleLikeWord)("title"))

# dump out the parsed results
print grammar.parseString(data).dump()

All three options print out:

['UPPER', 'CASE', 'WORDS', 'And', 'Title', 'Like', 'Words']
- allCaps: ['UPPER', 'CASE', 'WORDS']
- title: ['And', 'Title', 'Like', 'Words']

Once you have this, you can rejoin the words with " ".join, or
whatever you like.

-- Paul

Paul McGuire · Oct 22, 2007

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...

By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

-- Paul

Just Another Victim of the Ambient Morality · Oct 22, 2007

Paul McGuire said:
By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

Thank you for your kind help!
Unfortunately, there are some ambiguities but, hopefully and surely,
they'll be very rare. There will always be an uppercase section followed by
a non-uppercase section. So, your examples will parse like so:

A
Line With No Upper Case Words

...the second example will result in a parse error...

SOME UPPER CASE WORDS A
Title That Begins With A One Letter Word

Occasional errors can be tolerated. My problem was that my posted
problem happened all the time which, of course, is not tolerable. The
ambiguities you bring up, especially the last one, are interesting and I'm
not sure how to deal with them without an English grammatical analysis,
which is too much, especially if I'm to integrate it with pyparsing.
Another problem involves the ambiguity of numbers. Some more examples,
if you're interested:

FAHRENHEIT 451 2000 Copies Sold
1984 Book Of The Year

The last example is actually okay but the first one is honestly
ambiguous.
Thanks again...

Hendrik van Rooyen · Oct 23, 2007

By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

That last one is a killer, and comes under the heading of "cruel and unusual".

try this:

THIS IS NONSENSE, SAY I A Fellow Needs A Break

I can't think of a way to handle these without explicitly programming for them.

- Hendrik

Hendrik van Rooyen · Oct 23, 2007

Just Another Victim of the Ambient Morality said:
FAHRENHEIT 451 2000 Copies Sold
1984 Book Of The Year

The last example is actually okay but the first one is honestly
ambiguous.

hey - Fahrenheit 451 - if my memory serves me correctly, by
Ray Bradbury, is a classic of SF. - firemen burn stuff instead of
putting fires out, and they have a marvellous mechanical dog
equipped with a syringe, for hunting down survivors...

And the title purports to be the temp at which paper spontaneously
combusts.

Worth a read, if you haven't yet.

So 'I' have no problem parsing the title, but I have to agree with the
statement above.

- Hendrik

Dennis Lee Bieber · Oct 23, 2007

hey - Fahrenheit 451 - if my memory serves me correctly, by
Ray Bradbury, is a classic of SF. - firemen burn stuff instead of
putting fires out, and they have a marvellous mechanical dog

Specifically -- they burn BOOKS (I believe any books, not just
fiction, though as I recall, the ending is one of the "survivors" with a
favorite fiction novel being picked up by a group of rebels -- who
retain the books via memory and oral tellings)

Which, in a way, leads one to 1984... as both "governments"
basically seek total control over the thought processes of the people:
F451 through limiting knowledge to what is presented via monitors [think
of the internet as the only source of information AND ALL content
supplied by the government]; 1984 through cameras monitoring all, and
"newspeak" [the ultimate in "politically correct" language].

And Python (the comedy group) would be the first victims of either
society <G>

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

[ANN] pyparsing 1.5.3 released	0	Jun 25, 2010
ANN: pyparsing 1.5.6 released!	1	Jul 1, 2011
Getting pyparsing to backtrack	4	Jul 5, 2010
Pyparsing help	9	Mar 22, 2008
help with pyparsing	3	Dec 10, 2007
Need help with writing C and assembler 32bit small apps	0	Apr 6, 2022
Pyparsing...	2	Sep 21, 2004
ANN: pyparsing 1.4.8 released	0	Oct 7, 2007

Need help parsing with pyparsing...

Just Another Victim of the Ambient Morality

Paul McGuire

Paul McGuire

Just Another Victim of the Ambient Morality

Hendrik van Rooyen

Hendrik van Rooyen

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads