Parsing text

S

sicvic

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

Thanks,
Victor
 
P

Peter Hansen

sicvic said:
I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.

That's a good start. Maybe you could post the code that you've already
got that does this, and people could comment on it and help you along.
(I'm suggesting that partly because this almost sounds like homework,
but you'll benefit more by doing it this way than just by having an
answer handed to you whether this is homework or not.)

-Peter
 
N

Noah

sicvic said:
I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"
...
Thanks,
Victor

You did not specify the "key phrase" that you are looking for, so for
the sake
of this example I will assume that it is "key phrase".
I assume that you don't want "key phrase" or "---------------------" to
be returned
as part of your match, so we use minimal group matching (.*?)
You also want your regular expression to use the re.DOTALL flag because
this
is how you match across multiple lines. The simplest way to set this
flag is
to simply put it at the front of your regular expression using the (?s)
notation.

This gives you something like this:
print re.findall ("(?s)key phrase(.*?)---------------------",
your_string_to_search) [0]

So what that basically says is:
1. Match multiline -- that is, match across lines (?s)
2. match "key phrase"
3. Capture the group matching everything (?.*)
4. Match "---------------------"
5. Print the first match in the list [0]

Yours,
Noah
 
B

Bengt Richter

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of "---------------------"

Right now I can only have python write just the line the key phrase is
found in.
This sounds like homework, so just a (big) hint: have a look at itertools
dropwhile and takewhile. The solution is potentially a one-liner, depending
on your matching criteria (e.g., case-sensitive fixed string vs regular expression).

Regards,
Bengt Richter
 
S

sicvic

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York


Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.
 
R

rzed

Not homework...not even in school (do any universities even
teach classes using python?). Just not a programmer. Anyways I
should probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an
output file that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like
this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded
loop where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York


Basically I need to add statements that after finding that line
copy all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.

Something like this, maybe?

"""
This iterates through a file, with subloops to handle the
special cases. I'm assuming that Jimmy and Sarah are not the
only people of interest. I'm also assuming (for no very good
reason) that you do want the separator lines, but do not want
the "Person:" lines in the output file. It is easy enough to
adjust those assumptions to taste.

Each "Person:" line will cause a file to be opened (if it is
not already open, and will write the subsequent lines to it
until the separator is found. Be aware that all files remain
open unitl the loop at the end closes them all.
"""

outfs = {}
f = open('shouldBeDatabase.txt')
for line in f:
if line.find('Person:') >= 0:
ofkey = line[line.find('Person:')+7:].strip()
if not ofkey in outfs:
outfs[ofkey] = open('%s.txt' % ofkey, 'w')
outf = outfs[ofkey]
while line.find('-----------------------------') < 0:
line = f.next()
outf.write('%s' % line)
f.close()
for k,v in outfs.items():
v.close()
 
D

Dennis Lee Bieber

The code I currently have looks something like this:

import re

For a "non-programmer" you jumped into using a module I've never
made use of...
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt
This presupposes that only these two names are of interest
f = open(sys.argv[1]) #opens output file

Pardon, isn't that the input file?
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)

Well, if you want all lines up to some terminator, shouldn't you be
writing them said:
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

I have not tested this; nor is it the most optimal coding -- I tried
to keep each line simple... (hope your font reads better... Agent uses
one in which lower-L and upper-I look alike: iIlL; ln is lower-L+n, fIn
is f+upper-I+n [best to cut&paste rather than type by hand])

-=-=-=-=-=-=-=-
import sys
import os.path

START_FLAG = "Person: "
END_FLAG = "----------------------------------------------"

def personFile(s):
pName = s[s.find(START_FLAG) + len(START_FLAG):]
pFID = pName + ".txt"
if os.path.exists(pFID):
pOut = open(pFID, "a")
else:
pOut = open(pFID, "w")
pOut.write(START_FLAG)
pOut.write(pName)
pOut.write("\n")
return pOut

def processFile(fIn):
pOut = None
for ln in fIn:
ln = ln.strip() #get rid of trailing line ending, etc.
if pOut and ln == END_FLAG:
pOut.close()
pOut = None
elif not pOut and ln.find(START_FLAG) != -1:
pOut = personFile(ln)
elif pOut:
pOut.write(ln)
pOut.write("\n")
else:
# No output file, not a start flag... skip the line
pass

if __name__ == "__main__":
if sys.argv[1]:
dIn = open(sys.argv[1], "r")
processFile(dIn)
dIn.close()
else:
print "\n\nUsage: whatever Input_File_Name\n\n"
-=-=-=-=-=-=-=-=-
--
 
G

Gerard Flanagan

sicvic said:
Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver

It may be the output of another process but it's the input file as far
as the parsing code is concerned.

The code below gives the following output, if that's any help ( just
adapting Noah's idea above). Note that it deals with the input as a
single string rather than line by line.


Jimmy
Jimmy.txt

Current Location: Denver
Next Location: Chicago

Sarah
Sarah.txt

Current Location: San Diego
Next Location: Miami
Next Location: New York

data='''
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
'''

import StringIO
import re


src = StringIO.StringIO(data)

for name in ['Jimmy', 'Sarah']:
exp = "(?s)Person: %s(.*?)--" % name
filename = "%s.txt" % name
info = re.findall(exp, src.getvalue())[0]
print name
print filename
print info



hth

Gerard
 
S

Scott David Daniels

sicvic said:
Not homework...not even in school (do any universities even teach
classes using python?).
Yup, at least 6, and 20 wouldn't surprise me.
The code I currently have looks something like this:
...
f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)
Using re here seems pretty excessive.
How about:
...
f = open(sys.argv[1]) # opens input file ### get comments right
source = iter(f) # files serve lines at their own pace. Let them
for line in source:
if line.endswith('Person: Jimmy\n'):
dest = person_jimmy
elif line.endswith('Person: Sarah\n'):
dest = person_sarah
else:
continue
while line != '---------------\n':
dest.write(line)
line = source.next()
f.close()
person_jimmy.close()
person_sarah.close()

--Scott David Daniels
(e-mail address removed)
 
S

sicvic

Thank you everyone!!!

I got a lot more information then I expected. You guys got my brain
thinking in the right direction and starting to like programming.
You've got a great community here. Keep it up.

Thanks,
Victor
 
B

Bengt Richter

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.
Ok, not homework.
Since I cant show the actual output file lets say I had an output file
that looked like this:

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------

Now I want to put (and all recurrences of "Person: Jimmy")

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

aaaaa bbbbb Person: Jimmy

sarah.txt:

aaaaa bbbbb Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York


Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'----------------------------------------------'

Any help is greatly appreciated.
Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.

----< extractfilesegs.py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
where source is -tf for test file, a file name, or an open file
outdir is a directory prefix that will be joined to output file names
startpat is a regular expression with group 1 giving the extracted file name
endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os

def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
rxstart = re.compile(start)
rxstop = re.compile(stop)
if isinstance(linesrc, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search(line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(outdir, filename)
#print 'opening file %r'%filename
files.append(filename)
fout = open(filename, 'a') # append in case repeats?
fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(data_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data_line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files

def get_testfile():
from StringIO import StringIO
return StringIO("""\
....irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...

with a blank line
""")

if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__doc__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs(fin)
elif len(args)==1:
files = extractFileSegs(fin, args[0])
elif len(args)==2:
files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
else:
files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print ' "%s"'% fname
if tf == '-tf':
for fpath in files:
print '====< %s >====\n%s============'%(fpath, open(fpath).read())
----------------------------------------------------------------------------------

Running on your test data:

[15:19] C:\pywk\clp>md extracteddata

[15:19] C:\pywk\clp>py24 extractfilesegs.py -tf

Files created:
"extracteddata\jimmy.txt"
"extracteddata\sarah.txt"
====< extracteddata\jimmy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:20] C:\pywk\clp>md xd

[15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----

Files created:
"xd\jimmy.txt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============

[15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----

Files created:
"xd\sarah.txt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"

Files created:
"xd\irrelevant.txt"
====< xd\irrelevant.txt >====
irrelevant
trailing stuff ...
============

HTH, NO WARRANTIES ;-)


Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top