Using Groups inside Braces with Regular Expressions

Chris · Jul 14, 2008

I'm trying to delimit sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?

Thanks,
Chris

MRAB · Jul 14, 2008

I'm trying to delimit sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?

What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
matching everything up to the end of the sentence?

[...] is a character class, so Python is parsing the character class
as:

[^(?:[A-Z]|$)]
^^^^^^^^^^

Chris · Jul 14, 2008

On Jul 14 said:
On Jul 14 said:

end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

Click to expand...

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

Click to expand...

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?

Click to expand...

What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
matching everything up to the end of the sentence?

[...] is a character class, so Python is parsing the character class
as:

[^(?:[A-Z]|$)]
^^^^^^^^^^

It was meant to include everything except the end-of-sentence pattern.
However, I just realized that I can simply replace it with ".*?"

John Machin · Jul 14, 2008

Misleading subject.

[] brackets or "square brackets"
{} braces or "curly brackets"
() parentheses or "round brackets"

I'm trying to delimit sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

.... which has at least two problems:

(1) You are insisting on at least one space between the period and the
end-of-string (this can be overcome, see later).
(2) Periods are often dropped in after abbreviations and contractions
e.g. "Mr. Geo. Smith". You will get three "sentences" out of that.

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part.

It's nice to know that Python is consistent with its error messages.

If this isn't valid regex syntax,

If? It definitely isn't valid syntax. The brackets should delimit a
character class. You are trying to cram a somewhat complicated
expression into a character class, or you should be using parentheses.
However it's a bit hard to determine what you really meant that part
of the pattern to achieve.

how else would I match
a block of text that doesn't the delimiter pattern?

Start from the top down:
A sentence is:
anything (with some qualifications)
followed by (but not including):
a period
followed by
either
1 or more whitespaces then a capital letter
or
0 or more whitespaces then end-of-string

So something like this might do the trick:

sep = re.compile(r'\.(?:\s+(?=[A-Z])|\s*(?=\Z))')
sep.split('Hello. Mr. Chris X\nis here.\nIP addr 1.2.3.4. ')

Click to expand...

Click to expand...

['Hello', 'Mr', 'Chris X\nis here', 'IP addr 1.2.3.4', '']

Regular expressions, capture repeated groups	4	Jul 8, 2010
Problems with using event handlers for button and textarea input	1	Nov 29, 2021
using regular expressions...	1	Nov 11, 2008
Parsing Log records with regular expressions	2	Feb 3, 2011
Password check with regular expressions	19	Feb 11, 2009
FAQ 6.1 How can I hope to use regular expressions without creating illegible and unmaintainable code	0	Feb 25, 2011
PyWart: Python regular expression syntax is not intuitive.	18	Jan 25, 2012
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011

Using Groups inside Braces with Regular Expressions

Chris

MRAB

Chris

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads