question about nasty regex

P

Peter

I'm wondering if someone can tell me whether the following set of
regex substitutions is possible. I want to convert parallel legal
citations into single citations. What I mean is, I want to change, e.g.:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."

I am thinking this is impossible, but I thought that if it were
possible to translate this into Python code, someone here could put
me on the right track.

Thanks.
 
T

Tim Chase

What I mean is, I want to change, e.g.:
"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
S. Ct. 394, 397, 96 L.Ed. 475 (1952)."

into:

"Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."

Generally, the beginning pattern would consist of:

1. Two names, consisting of one or more words, always separated by a
"v."

2. One, two, or three citations, each of which always has a volume
number ("342") followed by a name, consisting of one or two word
units always ending with "." ("U.S."), followed by a page number ("429")

3. Each citation may contain a comma and a second page number (", 434")

4. Optionally, a parenthesized year ("(1952)")

5. A final "."
>>> import re
>>> tests = ['Doremus v. Board of Education of Hawthorne,
342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475
(1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314
U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).',
'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97
L.Ed. 459.']
>>> r= re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\s+((?:\d+,\s*)+)\s*(.*?)(\(\d{4}\))?\.$')
>>> results = [r.match(x) for x in tests]
>>> for x in range(0,3):
.... print "Test %i" % x
.... print "="*20
.... print "\n".join(["%s: %s" % (a,results[x].group(b))
for a,b in zip(["Party1", "Party2", "Court", "Pages",
"Extra", "Year"], range(1,7))])
....
Test 0
====================
Party1: Doremus
Party2: Board of Education of Hawthorne,
Court: 342
Pages: 429, 434,
Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475
Year: (1952)
Test 1
====================
Party1: Joe
Party2: Volcano, Fork, 123 Internet, et. al,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: (2005)
Test 2
====================
Party1: Grandma
Party2: RIAA,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: None


Things get a little messy if one of the parties has digits
followed by whitespace, followed by "U.S" in their name,
such as a ficticious "99 U.S. Luftballoons". Caveat
regextor. There are also some places where trailing commas
end up in items if there are multiple parties. You may want
to strip them off too before reassembling them.

Reassemble the pieces as needed. Season to taste. Bake at
350 for 20-25 minutes until golden brown.

HTH, or at least gets you on the path to regexp mangling.

-tkc
 
P

Peter

[snip regular expressions lesson]
Whoa. That is super-duper extra cool. Thank you *very* much.
 
P

Paul Rubin

Peter said:
[snip regular expressions lesson]
Whoa. That is super-duper extra cool. Thank you *very* much.

"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ
 
L

Lawrence D'Oliveiro

Paul Rubin said:
"Some people, when confronted with a problem, think ``I know, I'll use
regular expressions.'' Now they have two problems." --JWZ

Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)

Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...
 
P

Peter Hansen

Lawrence said:
Regexes are good if you need a solution quickly, and you're not
processing large amounts of data on a regular basis. (How large is
large? When you're chewing through appreciable amounts of CPU time doing
it.)

But "need a solution quickly" in this group is usually interpreted as
saving programmer time, not CPU time. I wouldn't have been able to come
up with that monstrosity nearly as quickly as Tim did, and I wouldn't
even be able to understand it without significant study, and I
definitely would have trouble maintaining it a few months later when I
found a test case which it didn't handle properly. I also wouldn't even
have confidence that it worked perfectly without throwing a dozen test
cases at it...

On the other hand, I could code a hybrid or entirely non-regex solution
in five or ten minutes (with tests!), and it would be quite readable.
Once you get to that point, it would be more efficient to hand-code your
own state machine to do the parsing. Of course, doing it in an (even
partially) interpreted language like Python or Perl would defeat the
point...

The number of problems for which Python and Perl aren't fast enough is
far smaller than most people think, as is the number of problems for
which regular expressions are really a suitable solution. :)

-Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top