Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.
Ok, not homework.
Since I cant show the actual output file lets say I had an output file
that looked like this:
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
Now I want to put (and all recurrences of "Person: Jimmy")
Person: Jimmy
Current Location: Denver
Next Location: Chicago
in a file called jimmy.txt
and the same for Sarah in sarah.txt
The code I currently have looks something like this:
import re
import sys
person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt
f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)
#closes all files
person_jimmy.close()
person_sarah.close()
f.close()
However this only would produces output files that look like this:
jimmy.txt:
aaaaa bbbbb Person: Jimmy
sarah.txt:
aaaaa bbbbb Person: Sarah
My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
and
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'
Any help is greatly appreciated.
Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.
----< extractfilesegs.py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
where source is -tf for test file, a file name, or an open file
outdir is a directory prefix that will be joined to output file names
startpat is a regular expression with group 1 giving the extracted file name
endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os
def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
rxstart = re.compile(start)
rxstop = re.compile(stop)
if isinstance(linesrc, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search(line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(outdir, filename)
#print 'opening file %r'%filename
files.append(filename)
fout = open(filename, 'a') # append in case repeats?
fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(data_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data_line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files
def get_testfile():
from StringIO import StringIO
return StringIO("""\
....irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...
with a blank line
""")
if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__doc__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs(fin)
elif len(args)==1:
files = extractFileSegs(fin, args[0])
elif len(args)==2:
files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
else:
files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print ' "%s"'% fname
if tf == '-tf':
for fpath in files:
print '====< %s >====\n%s============'%(fpath, open(fpath).read())
----------------------------------------------------------------------------------
Running on your test data:
[15:19] C:\pywk\clp>md extracteddata
[15:19] C:\pywk\clp>py24 extractfilesegs.py -tf
Files created:
"extracteddata\jimmy.txt"
"extracteddata\sarah.txt"
====< extracteddata\jimmy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============
[15:20] C:\pywk\clp>md xd
[15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----
Files created:
"xd\jimmy.txt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============
[15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----
Files created:
"xd\sarah.txt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============
[15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"
Files created:
"xd\irrelevant.txt"
====< xd\irrelevant.txt >====
irrelevant
trailing stuff ...
============
HTH, NO WARRANTIES ;-)
Regards,
Bengt Richter