Need help parsing with pyparsing...

  • Thread starter Just Another Victim of the Ambient Morality
  • Start date
J

Just Another Victim of the Ambient Morality

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...
 
P

Paul McGuire

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...

Yes, whitespace skipping does get in the way sometimes. In your case,
you need to clarify that each word that is parsed must be followed by
whitespace. See the options and comments in the code below:

from pyparsing import *

data = "UPPER CASE WORDS And Title Like Words"

# Option 1 - qualify Word instance with asKeyword=True
upperCaseWord = Word(alphas.upper(), asKeyword=True)
titleLikeWord = Word(alphas.upper(), alphas.lower(), asKeyword=True)

# Option 2 - explicitly state that each word must be followed by
whitespace
upperCaseWord = Word(alphas.upper()) + FollowedBy(White())
titleLikeWord = Word(alphas.upper(), alphas.lower()) +
FollowedBy(White())

# Option 3 - use regex's - note, still have to use lookahead to avoid
matching
# 'A' in 'And'
upperCaseWord = Regex(r"[A-Z]+(?=\s)")
titleLikeWord = Regex(r"[A-Z][a-z]*")

# create grammar, with some friendly results names
grammar = (OneOrMore(upperCaseWord)("allCaps") +
OneOrMore(titleLikeWord)("title"))

# dump out the parsed results
print grammar.parseString(data).dump()


All three options print out:

['UPPER', 'CASE', 'WORDS', 'And', 'Title', 'Like', 'Words']
- allCaps: ['UPPER', 'CASE', 'WORDS']
- title: ['And', 'Title', 'Like', 'Words']

Once you have this, you can rejoin the words with " ".join, or
whatever you like.

-- Paul
 
P

Paul McGuire

I'm trying to parse with pyparsing but the grammar I'm using is somewhat
unorthodox. I need to be able to parse something like the following:

UPPER CASE WORDS And Title Like Words

...into two sentences:

UPPER CASE WORDS
And Title Like Words

I'm finding this surprisingly hard to do. The problem is that pyparsing
implicitly assumes whitespace are ignorable characters and is (perhaps
necessarily) greedy with its term matching. All attempts to do the
described parsing either fails to parse or incorrectly parses so:

UPPER CASE WORDS A
nd Title Like Words

Frankly, I'm stuck. I don't know how to parse this grammar with
pyparsing.
Does anyone know how to accomplish what I'm trying to do?
Thank you...

By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

-- Paul
 
J

Just Another Victim of the Ambient Morality

Paul McGuire said:
By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

Thank you for your kind help!
Unfortunately, there are some ambiguities but, hopefully and surely,
they'll be very rare. There will always be an uppercase section followed by
a non-uppercase section. So, your examples will parse like so:

A
Line With No Upper Case Words

...the second example will result in a parse error...

SOME UPPER CASE WORDS A
Title That Begins With A One Letter Word

Occasional errors can be tolerated. My problem was that my posted
problem happened all the time which, of course, is not tolerable. The
ambiguities you bring up, especially the last one, are interesting and I'm
not sure how to deal with them without an English grammatical analysis,
which is too much, especially if I'm to integrate it with pyparsing.
Another problem involves the ambiguity of numbers. Some more examples,
if you're interested:

FAHRENHEIT 451 2000 Copies Sold
1984 Book Of The Year

The last example is actually okay but the first one is honestly
ambiguous.
Thanks again...
 
H

Hendrik van Rooyen

By the way, are these possible data lines?:

A Line With No Upper Case Words
A LINE WITH NO TITLE CASE WORDS
SOME UPPER CASE WORDS A Title That Begins With A One Letter Word

That last one is a killer, and comes under the heading of "cruel and unusual".

try this:

THIS IS NONSENSE, SAY I A Fellow Needs A Break

I can't think of a way to handle these without explicitly programming for them.

- Hendrik
 
H

Hendrik van Rooyen

Just Another Victim of the Ambient Morality said:
FAHRENHEIT 451 2000 Copies Sold
1984 Book Of The Year

The last example is actually okay but the first one is honestly
ambiguous.

hey - Fahrenheit 451 - if my memory serves me correctly, by
Ray Bradbury, is a classic of SF. - firemen burn stuff instead of
putting fires out, and they have a marvellous mechanical dog
equipped with a syringe, for hunting down survivors...

And the title purports to be the temp at which paper spontaneously
combusts.

Worth a read, if you haven't yet.

So 'I' have no problem parsing the title, but I have to agree with the
statement above.

- Hendrik
 
D

Dennis Lee Bieber

hey - Fahrenheit 451 - if my memory serves me correctly, by
Ray Bradbury, is a classic of SF. - firemen burn stuff instead of
putting fires out, and they have a marvellous mechanical dog

Specifically -- they burn BOOKS (I believe any books, not just
fiction, though as I recall, the ending is one of the "survivors" with a
favorite fiction novel being picked up by a group of rebels -- who
retain the books via memory and oral tellings)

Which, in a way, leads one to 1984... as both "governments"
basically seek total control over the thought processes of the people:
F451 through limiting knowledge to what is presented via monitors [think
of the internet as the only source of information AND ALL content
supplied by the government]; 1984 through cameras monitoring all, and
"newspeak" [the ultimate in "politically correct" language].

And Python (the comedy group) would be the first victims of either
society <G>

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top