Python Regular Expressions

A

Andy Barnes

Hi,

I am hoping someone here can help me with a problem I'd like to
resolve with Python. I have used it before for some other projects but
have never needed to use Regular Expressions before. It's quite
possible I am following completley the wrong tack for this task (so
any advice appreciated).

I have a source file in csv format that follows certain rules. I
basically want to parse the source file and spit out a second file
built from some rules and the content of the first file.

Source File Format:

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
Theurgic Lore

The input file I have has over 700 unique entries. I have tried to
cover the four main exceptions above. Before I detail them - this is
what I would like the above input file, to be output as (dot
diagramming language incase anyone recognises it):

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

DistilMana [label="{ Distil Mana |{Type|7}|{70|38|21}}"];
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;
TalismanticLore [label="{ Talismantic Lore |{Lore|n/a}|{150|100|
50}}"];
AdvanvedTalismanticLore [label="{ Advanced Talismantic Lore |{Lore|n/
a}|{100|60|30}}"];
TalismanticLore -> AdvanvedTalismanticLore;
TheurgicLore -> AdvanvedTalismanticLore;

It's quite a complicated find and replace operation that can be broken
down into some easy stages. The main thing the sample above showed was
that some of the entries won't list any prerequisits - these only need
the descriptor entry creating. Some of them have more than one
prerequisite. A line is needed for each prerequisite listed, linking
it to it's parent.

You can also see that the 'name' needs to have spaces removed and it's
repeated a few times in the process. I Hope it's easy to see what I am
trying to achieve from the above. I'd be very happy to accept
assistance in automating the conversion of my ever expanding csv file,
into the dot format described above.

Andy
 
A

Andy Barnes

to expand. I have parsed one of the lines manually to try and break
the process I'm trying to automate down.

source:
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana

output:
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;

This is the steps I would take to do this conversion manually:

1) Take everything prior to the first comma and remove all the spaces,
insert it into a newline:

TheurgicLore

2) append the following string ' [label="{ '

TheurgicLore [label="{

3) append everything prior to the first comma (this time we don't need
to remove the spaces)

TheurgicLore [label="{ Theurgic Lore

4) append the following string ' |{'

TheurgicLore [label="{ Theurgic Lore |{

5) append everything between the 1st and 2nd comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|

6) append everything between the 2nd and 3rd comma of the source file
followed by a '}|{'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{

7) append everything between the 3rd and 4th comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|

8) append everything between the 4th and 5th comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|

9) append everything between the 5th and 6th comma of the source file
followed by a '}}"];'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];

Those 9 steps spit out my fist line of output file as above
"TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];" I
now have to parse the dependancies onto a newline.

# this next process needs to be repeated for each prerequisite, so if
there are two pre-requisites it would need to keep parsing for more
comma's.
1a) take everything between the 6th and 7th comma and put it at the
start of a new line (remove spaces)

DistilMana

2a) append '-> '

DistilMana ->

3a) append everything prior to the first comma, with spaces removed

DistilMana -> TheurgicLore

This should now be all the steps to spit out:

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;
 
N

Neil Cerutti

to expand. I have parsed one of the lines manually to try and break
the process I'm trying to automate down.

source:
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana

output:
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;

This is the steps I would take to do this conversion manually:

It seems to me that parsing the file into an intermediate model
and then using that model to serialize your output would be
easier to understand and more robust than modifying the csv
entries in place. It decouples deciphering the meaning of the
data from emitting the data, which is more robust and expansable.

The amount of ingenuity required is less, though. ;)
 
P

Peter Otten

Andy said:
Hi,

I am hoping someone here can help me with a problem I'd like to
resolve with Python. I have used it before for some other projects but
have never needed to use Regular Expressions before. It's quite
possible I am following completley the wrong tack for this task (so
any advice appreciated).

I have a source file in csv format that follows certain rules. I
basically want to parse the source file and spit out a second file
built from some rules and the content of the first file.

Source File Format:

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
Theurgic Lore

The input file I have has over 700 unique entries. I have tried to
cover the four main exceptions above. Before I detail them - this is
what I would like the above input file, to be output as (dot
diagramming language incase anyone recognises it):

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

DistilMana [label="{ Distil Mana |{Type|7}|{70|38|21}}"];
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;
TalismanticLore [label="{ Talismantic Lore |{Lore|n/a}|{150|100|
50}}"];
AdvanvedTalismanticLore [label="{ Advanced Talismantic Lore |{Lore|n/
a}|{100|60|30}}"];
TalismanticLore -> AdvanvedTalismanticLore;
TheurgicLore -> AdvanvedTalismanticLore;

It's quite a complicated find and replace operation that can be broken
down into some easy stages. The main thing the sample above showed was
that some of the entries won't list any prerequisits - these only need
the descriptor entry creating. Some of them have more than one
prerequisite. A line is needed for each prerequisite listed, linking
it to it's parent.

You can also see that the 'name' needs to have spaces removed and it's
repeated a few times in the process. I Hope it's easy to see what I am
trying to achieve from the above. I'd be very happy to accept
assistance in automating the conversion of my ever expanding csv file,
into the dot format described above.

Forget about regexes. If there's any complexity it's in writing the output
rather than reading the input file. You can tackle that by putting your data
into a dictionary and using a format string:

import sys

def camelized(s):
return "".join(s.split())

template = """%(camel)s [label="{ %(name)s |{%(type)s|%(create)s}|
{%(study)s|%(read)s|%(teach)s}}"];"""

def process(instream, outstream):
instream = (line for line in instream if not (line.isspace() or
line.startswith("#")))
rows = (map(str.strip, line.split(",")) for line in instream)
headers = map(str.lower, next(rows))
for row in rows:
rowdict = dict(zip(headers, row))
camel = rowdict["camel"] = camelized(rowdict["name"])
print template % rowdict
for for_lack_of_better_name in row[len(headers)-1:]:
print "%s -> %s;" % (camelized(for_lack_of_better_name), camel)

if __name__ == "__main__":
from StringIO import StringIO
instream = StringIO("""\
Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
Theurgic Lore
""")
process(instream, sys.stdout)
 
D

Dennis Lee Bieber

I have a source file in csv format that follows certain rules. I

So why not use the CSV module to read&split the fields...
It's quite a complicated find and replace operation that can be broken
down into some easy stages. The main thing the sample above showed was
that some of the entries won't list any prerequisits - these only need
the descriptor entry creating. Some of them have more than one
prerequisite. A line is needed for each prerequisite listed, linking
it to it's parent.
That's output formatting, I won't bother trying to create an
algorithm for that...
You can also see that the 'name' needs to have spaces removed and it's
repeated a few times in the process. I Hope it's easy to see what I am

strippedName = "".join(originalName.split())
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,053
Messages
2,570,431
Members
47,075
Latest member
TysonV438

Latest Threads

Top