NewB question on text manipulation

P

ProvoWallis

I'm totally stumped by this problem so I'm hoping someone can give me a
little advice or point me in the right direction.

I have a file that looks like this:

<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC
GUIDE<XC>92<LT>1(b)(1)

(i.e., <<SC>[chapter name]<XC>[multiple or single book page
ranges]<SC>[chapter name]<XC>[multiple or single book page
ranges]<LT>
Code:
but I want to change it so that it looks like this

<1><SC>APPEAL<XC>40-24<LT>1(b)(1)
<1><SC>APPEAL<XC>40-46<LT>1(b)(1)
<1><SC>APPEAL<XC>42-46<LT>1(b)(1)
<1><SC>APPEAL<XC>42-48<LT>1(b)(1)
<1><SC>APPEAL<XC>42-62<LT>1(b)(1)
<1><SC>APPEAL<XC>42-63<LT>1(b)(1)
<1><SC>PROC GUIDE<XC>92<LT>1(b)(1)

but I'm not at all sure how to do it.

I've come up with a simlple function that will change the order of the
text but I'm not sure how to break out

     def Switch(m):

          return '%s<LT>%s' % (m.group(2), m.group(1))

     data = re.sub(r'''<1>(.*?)<LT>(.*?)\n''', Switch, data)

But I'm still a long way from what I need.

Any pointers would be greatly appreciated.

Thanks,

Greg
 
S

Steve R. Hastings

I have a file that looks like this:

<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC
GUIDE<XC>92<LT>1(b)(1)

(i.e., <<SC>[chapter name]<XC>[multiple or single book page
ranges]<SC>[chapter name]<XC>[multiple or single book page
ranges]<LT>
Code:
but I want to change it so that it looks like this

<1><SC>APPEAL<XC>40-24<LT>1(b)(1)
<1><SC>APPEAL<XC>40-46<LT>1(b)(1)
<1><SC>APPEAL<XC>42-46<LT>1(b)(1)
<1><SC>APPEAL<XC>42-48<LT>1(b)(1)
<1><SC>APPEAL<XC>42-62<LT>1(b)(1)
<1><SC>APPEAL<XC>42-63<LT>1(b)(1)
<1><SC>PROC GUIDE<XC>92<LT>1(b)(1)[/QUOTE]

I'll show my code first, then explain it.

-- cut here -- cut here -- cut here -- cut here -- cut here --
import re

s = "<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \
    "<SC>PROC GUIDE<XC>92<LT>1(b)(1)"

s_space = " "  # a single space
s_empty = ""  # empty string

pat = re.compile("\s*<SC>([^<]+)<XC>([^<]+)")

lst = []

while True:
    m = pat.search(s)
    if not m:
        break

    title = m.group(1).strip()
    xc = m.group(2)
    xc = xc.replace(s_space, s_empty)
    tup = (title, xc)
    lst.append(tup)
    s = pat.sub(s_empty, s, 1)

lt = s.strip()

for title, xc in lst:
    lst_pp = xc.split(";")
    for pp in lst_pp:
        print "<1><SC>%s<XC>%s%s" % (title, pp, lt)
-- cut here -- cut here -- cut here -- cut here -- cut here --

My strategy here is to divide the problem into two separate parts: first,
I collect all the data we need; then, I reformat the collected data and
print it in the desired format.

"pat" is a compiled regular expression.  It recognizes the SC and XC
codes, and collects the strings enclosed by those codes:

([^<]+)

The above regular expression means "any character that is not a '<'", "one
or more of them", and since it's in parentheses it's remembered so we can
collect it later.

So we collect title and the XC page ranges.  We tidy them up a bit:
title.strip() will remove any leading or trailing white space from the
title.  The replace() on the XC string gets rid of any spaces; I'm
assuming that the spaces are optional and the semicolons are the real
separators here.

Now, we could save the title and XC string in two lists, but that would be
silly in Python.  It's easier to pair them up in a tuple, and save the
tuple in a single list.  You can do it in one line, but I made the tuple
explicit ("tup").

After we collect them, we use a sub() to chop the collected data out of
the source string.

A while loop runs until all the SC and XC values are collected; anything
left over is assumed to be the LT.

Now, we have all the data; it's easy enough to rearrange it.

We can convert the XC string into a list of page ranges just by calling
..split(";"), which will split on semicolons.  Loop over this list,
printing each time, and there you go.

I'll leave packaging these up into tidy functions, reading the data from
the file, etc. as exercises for the reader. :-)

If you have any questions on how this works or why I did things the way I
did, ask away.

Good luck!
 
P

ProvoWallis

Thanks very much for this I really appreciate it. I've pasted what I've
got now thanks to you.

I only have one issue that I can't figure out. When I print the new
string I'm getting all of the values in the lt list rather than just
the one that corresponds to the original entry.

E.g.,

My original data looks like this:

<1><SC>FAM LAW ENF<XC>259-232<LT>-687

<1><SC>APPEAL<XC>40-38; 40-44; 44-18; 45-15<LT>1

I want my output to look like this:

<1><SC>FAM LAW ENF<XC>259-232<LT>-687
<1><SC>APPEAL<XC>40-38<LT>1
<1><SC>APPEAL<XC>40-44<LT>1
<1><SC>APPEAL<XC>44-18<LT>1
<1><SC>APPEAL<XC>45-15<LT>1

But istead I'm getting this -- all of the entries in the lt list are
being added to my string when I just want one. I'm not sure how to
select just the entry in the lt list that I want.

<1><SC>FAM LAW ENF<XC>259-232<LT>-687<LT>1
<1><SC>APPEAL<XC>40-38<LT>-687<LT>1
<1><SC>APPEAL<XC>40-44<LT>-687<LT>1
<1><SC>APPEAL<XC>44-18<LT>-687<LT>1
<1><SC>APPEAL<XC>45-15<LT>-687<LT>1


###


Here's what I've got so far:


s_space = " " # a single space
s_empty = "" # empty string

pat = re.compile("\s*<SC>([^<]+)<XC>([^<]+)")

lst = []

while True:
m = pat.search(s)
if not m:
break

title = m.group(1).strip()
xc = m.group(2)
xc = xc.replace(s_space, s_empty)
tup = (title, xc)
lst.append(tup)
s = pat.sub(s_empty, s, 1)

lt = s.strip()

for title, xc in lst:
lst_pp = xc.split(";")
for pp in lst_pp:
print "<1><SC>%s<XC>%s%s" % (title, pp, lt)
 
S

Steve R. Hastings

I only have one issue that I can't figure out. When I print the new
string I'm getting all of the values in the lt list rather than just
the one that corresponds to the original entry.

I did not realize that each entry would have its own LT value. I had
thought that there were several sets of <SC> and <XC> with one <LT>. You
only showed one example...

I have modified the program to collect LT values at the same time it
collects SC and XC values. Also, it now collects whatever code appears
before the first SC code. I don't know what this code is for so I just
called the variable "before".

Notes on the code:

* Instead of doing this:

title = m.group(2)
title = title.strip()


I just do this:

title = m.group(2).strip()


You can apply string methods on any string, and it's convenient to do it
all in one line. There are several lines like that.


* There are two patterns to detect the LT code. The first one is for
finding it, and the second one is only for removing it. The second one
uses '^' to anchor the pattern, so it will only remove the LT code if the
LT code is the first thing in the string. The first pattern does not have
the '^' anchor so it will look ahead, past any number of <SC> codes, to
find the next <LT> code.

* Otherwise this is pretty much like the first version. It collects data,
saves it in a list, and then prints its output from the list.


I am busy now, so I won't have any time to make any more versions of this
for you. I hope you can study what I have done and understand how to apply
the ideas to your problems. Good luck!


-- cut here -- cut here -- cut here -- cut here -- cut here --
import re

s = "<1><SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \
"<1><SC>PROC GUIDE<XC>92<LT>1(b)(1)" + \
"<1><SC>FAM LAW ENF<XC>259-232<LT>-687" + \
"<1><SC>APPEAL<XC>40-38; 40-44; 44-18; 45-15<LT>1"

s_space = " " # a single space
s_empty = "" # empty string

pat_sc = re.compile("\s*(<[^<]+)<SC>([^<]+)<XC>([^<]+)")
pat_lt = re.compile("<LT>([^<]+)")
pat_lt_remove = re.compile("^<LT>([^<]+)")

lst = []
lt = None

while True:
m = pat_sc.search(s)
if not m:
break

before = m.group(1).strip()
title = m.group(2).strip()
xc = m.group(3).replace(s_space, s_empty)

s = pat_sc.sub(s_empty, s, 1)

m = pat_lt.search(s)
if m:
lt = m.group(1)
lt = lt.strip()

s = pat_lt_remove.sub(s_empty, s, 1)

tup = (before, title, xc, lt)
lst.append(tup)

for before, title, xc, lt in lst:
lst_pp = xc.split(";")
for pp in lst_pp:
print "%s<SC>%s<XC>%s<LT>%s" % (before, title, pp, lt)
-- cut here -- cut here -- cut here -- cut here -- cut here --
 
P

ProvoWallis

Thanks again and sorry about the lack of examples. It didn't even occur
to me that my example wasn't comprehensive enough when I posted my
first message but I can see the issue now.

Your solution is really helpful for me to see. I can't tell you how
much I apprecaite it. I thought that adding more values to the tuple
was the way to go but couldn't get my mind around how to capture the
info that I needed.

Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top