question about nasty regex

Peter · Apr 3, 2006

I'm wondering if someone can tell me whether the following set of
regex substitutions is possible. I want to convert parallel legal
citations into single citations. What I mean is, I want to change, e.g.:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."

I am thinking this is impossible, but I thought that if it were
possible to translate this into Python code, someone here could put
me on the right track.

Thanks.

Tim Chase · Apr 3, 2006

What I mean is, I want to change, e.g.:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."

>>> import re
>>> tests = ['Doremus v. Board of Education of Hawthorne,

Click to expand...

Click to expand...

342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475
(1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314
U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).',
'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97
L.Ed. 459.']

>>> r= re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\s+((?:\d+,\s*)+)\s*(.*?)($\d{4}$)?\.$')
>>> results = [r.match(x) for x in tests]
>>> for x in range(0,3):

Click to expand...

Click to expand...

.... print "Test %i" % x
.... print "="*20
.... print "\n".join(["%s: %s" % (a,results[x].group(b))
for a,b in zip(["Party1", "Party2", "Court", "Pages",
"Extra", "Year"], range(1,7))])
....
Test 0
====================
Party1: Doremus
Party2: Board of Education of Hawthorne,
Court: 342
Pages: 429, 434,
Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475
Year: (1952)
Test 1
====================
Party1: Joe
Party2: Volcano, Fork, 123 Internet, et. al,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: (2005)
Test 2
====================
Party1: Grandma
Party2: RIAA,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: None

Things get a little messy if one of the parties has digits
followed by whitespace, followed by "U.S" in their name,
such as a ficticious "99 U.S. Luftballoons". Caveat
regextor. There are also some places where trailing commas
end up in items if there are multiple parties. You may want
to strip them off too before reassembling them.

Reassemble the pieces as needed. Season to taste. Bake at
350 for 20-25 minutes until golden brown.

HTH, or at least gets you on the path to regexp mangling.

-tkc

Peter · Apr 4, 2006

[snip regular expressions lesson]
Whoa. That is super-duper extra cool. Thank you *very* much.

Paul Rubin · Apr 4, 2006

Peter said:
[snip regular expressions lesson]
Whoa. That is super-duper extra cool. Thank you *very* much.

"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ

Lawrence D'Oliveiro · Apr 4, 2006

Paul Rubin said:
"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ

Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)

Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...

Peter Hansen · Apr 4, 2006

Lawrence said:
Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)

But "need a solution quickly" in this group is usually interpreted as
saving programmer time, not CPU time. I wouldn't have been able to come
up with that monstrosity nearly as quickly as Tim did, and I wouldn't
even be able to understand it without significant study, and I
definitely would have trouble maintaining it a few months later when I
found a test case which it didn't handle properly. I also wouldn't even
have confidence that it worked perfectly without throwing a dozen test
cases at it...

On the other hand, I could code a hybrid or entirely non-regex solution
in five or ten minutes (with tests!), and it would be quite readable.

Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...

The number of problems for which Python and Perl aren't fast enough is
far smaller than most people think, as is the number of problems for
which regular expressions are really a suitable solution.

-Peter

Puzzled about this regex	0	Apr 18, 2009
Complex regex question	1	Sep 26, 2009
regex question about ?, *, and $1	8	Dec 19, 2007
Tasks	1	Nov 29, 2022
Concept question about JUnit Failures	10	May 18, 2010
Simple question about freeze	2	Feb 17, 2006
Question About Design Strategy	3	Sep 27, 2007
A question about searching with multiple strings	3	Oct 21, 2005

question about nasty regex

Peter

Tim Chase

Peter

Paul Rubin

Lawrence D'Oliveiro

Peter Hansen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads