Parsing text

Discussion in 'Python' started by sicvic, Dec 19, 2005.

  1. sicvic

    sicvic Guest

    I was wondering if theres a way where python can read through the lines
    of a text file searching for a key phrase then writing that line and
    all lines following it up to a certain point, such as until it sees a
    string of "---------------------"

    Right now I can only have python write just the line the key phrase is
    found in.

    Thanks,
    Victor
     
    sicvic, Dec 19, 2005
    #1
    1. Advertising

  2. sicvic

    Peter Hansen Guest

    sicvic wrote:
    > I was wondering if theres a way where python can read through the lines
    > of a text file searching for a key phrase then writing that line and
    > all lines following it up to a certain point, such as until it sees a
    > string of "---------------------"
    >
    > Right now I can only have python write just the line the key phrase is
    > found in.


    That's a good start. Maybe you could post the code that you've already
    got that does this, and people could comment on it and help you along.
    (I'm suggesting that partly because this almost sounds like homework,
    but you'll benefit more by doing it this way than just by having an
    answer handed to you whether this is homework or not.)

    -Peter
     
    Peter Hansen, Dec 20, 2005
    #2
    1. Advertising

  3. sicvic

    Noah Guest

    sicvic wrote:
    > I was wondering if theres a way where python can read through the lines
    > of a text file searching for a key phrase then writing that line and
    > all lines following it up to a certain point, such as until it sees a
    > string of "---------------------"
    >...
    > Thanks,
    > Victor


    You did not specify the "key phrase" that you are looking for, so for
    the sake
    of this example I will assume that it is "key phrase".
    I assume that you don't want "key phrase" or "---------------------" to
    be returned
    as part of your match, so we use minimal group matching (.*?)
    You also want your regular expression to use the re.DOTALL flag because
    this
    is how you match across multiple lines. The simplest way to set this
    flag is
    to simply put it at the front of your regular expression using the (?s)
    notation.

    This gives you something like this:
    print re.findall ("(?s)key phrase(.*?)---------------------",
    your_string_to_search) [0]

    So what that basically says is:
    1. Match multiline -- that is, match across lines (?s)
    2. match "key phrase"
    3. Capture the group matching everything (?.*)
    4. Match "---------------------"
    5. Print the first match in the list [0]

    Yours,
    Noah
     
    Noah, Dec 20, 2005
    #3
  4. On 19 Dec 2005 15:15:10 -0800, "sicvic" <> wrote:

    >I was wondering if theres a way where python can read through the lines
    >of a text file searching for a key phrase then writing that line and
    >all lines following it up to a certain point, such as until it sees a
    >string of "---------------------"
    >
    >Right now I can only have python write just the line the key phrase is
    >found in.
    >

    This sounds like homework, so just a (big) hint: have a look at itertools
    dropwhile and takewhile. The solution is potentially a one-liner, depending
    on your matching criteria (e.g., case-sensitive fixed string vs regular expression).

    Regards,
    Bengt Richter
     
    Bengt Richter, Dec 20, 2005
    #4
  5. sicvic

    sicvic Guest

    Not homework...not even in school (do any universities even teach
    classes using python?). Just not a programmer. Anyways I should
    probably be more clear about what I'm trying to do.

    Since I cant show the actual output file lets say I had an output file
    that looked like this:

    aaaaa bbbbb Person: Jimmy
    Current Location: Denver
    Next Location: Chicago
    ----------------------------------------------
    aaaaa bbbbb Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York
    ----------------------------------------------

    Now I want to put (and all recurrences of "Person: Jimmy")

    Person: Jimmy
    Current Location: Denver
    Next Location: Chicago

    in a file called jimmy.txt

    and the same for Sarah in sarah.txt

    The code I currently have looks something like this:

    import re
    import sys

    person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
    person_sarah = open('sarah.txt', 'w') #creates sarah.txt

    f = open(sys.argv[1]) #opens output file
    #loop that goes through all lines and parses specified text
    for line in f.readlines():
    if re.search(r'Person: Jimmy', line):
    person_jimmy.write(line)
    elif re.search(r'Person: Sarah', line):
    person_sarah.write(line)

    #closes all files

    person_jimmy.close()
    person_sarah.close()
    f.close()

    However this only would produces output files that look like this:

    jimmy.txt:

    aaaaa bbbbb Person: Jimmy

    sarah.txt:

    aaaaa bbbbb Person: Sarah

    My question is what else do I need to add (such as an embedded loop
    where the if statements are?) so the files look like this

    aaaaa bbbbb Person: Jimmy
    Current Location: Denver
    Next Location: Chicago

    and

    aaaaa bbbbb Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York


    Basically I need to add statements that after finding that line copy
    all the lines following it and stopping when it sees
    '----------------------------------------------'

    Any help is greatly appreciated.
     
    sicvic, Dec 20, 2005
    #5
  6. sicvic

    rzed Guest

    "sicvic" <> wrote in
    news::

    > Not homework...not even in school (do any universities even
    > teach classes using python?). Just not a programmer. Anyways I
    > should probably be more clear about what I'm trying to do.
    >
    > Since I cant show the actual output file lets say I had an
    > output file that looked like this:
    >
    > aaaaa bbbbb Person: Jimmy
    > Current Location: Denver
    > Next Location: Chicago
    > ----------------------------------------------
    > aaaaa bbbbb Person: Sarah
    > Current Location: San Diego
    > Next Location: Miami
    > Next Location: New York
    > ----------------------------------------------
    >
    > Now I want to put (and all recurrences of "Person: Jimmy")
    >
    > Person: Jimmy
    > Current Location: Denver
    > Next Location: Chicago
    >
    > in a file called jimmy.txt
    >
    > and the same for Sarah in sarah.txt
    >
    > The code I currently have looks something like this:
    >
    > import re
    > import sys
    >
    > person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
    > person_sarah = open('sarah.txt', 'w') #creates sarah.txt
    >
    > f = open(sys.argv[1]) #opens output file
    > #loop that goes through all lines and parses specified text
    > for line in f.readlines():
    > if re.search(r'Person: Jimmy', line):
    > person_jimmy.write(line)
    > elif re.search(r'Person: Sarah', line):
    > person_sarah.write(line)
    >
    > #closes all files
    >
    > person_jimmy.close()
    > person_sarah.close()
    > f.close()
    >
    > However this only would produces output files that look like
    > this:
    >
    > jimmy.txt:
    >
    > aaaaa bbbbb Person: Jimmy
    >
    > sarah.txt:
    >
    > aaaaa bbbbb Person: Sarah
    >
    > My question is what else do I need to add (such as an embedded
    > loop where the if statements are?) so the files look like this
    >
    > aaaaa bbbbb Person: Jimmy
    > Current Location: Denver
    > Next Location: Chicago
    >
    > and
    >
    > aaaaa bbbbb Person: Sarah
    > Current Location: San Diego
    > Next Location: Miami
    > Next Location: New York
    >
    >
    > Basically I need to add statements that after finding that line
    > copy all the lines following it and stopping when it sees
    > '----------------------------------------------'
    >
    > Any help is greatly appreciated.
    >


    Something like this, maybe?

    """
    This iterates through a file, with subloops to handle the
    special cases. I'm assuming that Jimmy and Sarah are not the
    only people of interest. I'm also assuming (for no very good
    reason) that you do want the separator lines, but do not want
    the "Person:" lines in the output file. It is easy enough to
    adjust those assumptions to taste.

    Each "Person:" line will cause a file to be opened (if it is
    not already open, and will write the subsequent lines to it
    until the separator is found. Be aware that all files remain
    open unitl the loop at the end closes them all.
    """

    outfs = {}
    f = open('shouldBeDatabase.txt')
    for line in f:
    if line.find('Person:') >= 0:
    ofkey = line[line.find('Person:')+7:].strip()
    if not ofkey in outfs:
    outfs[ofkey] = open('%s.txt' % ofkey, 'w')
    outf = outfs[ofkey]
    while line.find('-----------------------------') < 0:
    line = f.next()
    outf.write('%s' % line)
    f.close()
    for k,v in outfs.items():
    v.close()

    --
    rzed
     
    rzed, Dec 20, 2005
    #6
  7. On 20 Dec 2005 08:06:39 -0800, "sicvic" <>
    declaimed the following in comp.lang.python:

    > The code I currently have looks something like this:
    >
    > import re


    For a "non-programmer" you jumped into using a module I've never
    made use of...

    > import sys
    >
    > person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
    > person_sarah = open('sarah.txt', 'w') #creates sarah.txt
    >

    This presupposes that only these two names are of interest

    > f = open(sys.argv[1]) #opens output file


    Pardon, isn't that the input file?

    > #loop that goes through all lines and parses specified text
    > for line in f.readlines():
    > if re.search(r'Person: Jimmy', line):
    > person_jimmy.write(line)


    Well, if you want all lines up to some terminator, shouldn't you be
    writing them <G>

    > elif re.search(r'Person: Sarah', line):
    > person_sarah.write(line)
    >
    > #closes all files
    >
    > person_jimmy.close()
    > person_sarah.close()
    > f.close()
    >


    I have not tested this; nor is it the most optimal coding -- I tried
    to keep each line simple... (hope your font reads better... Agent uses
    one in which lower-L and upper-I look alike: iIlL; ln is lower-L+n, fIn
    is f+upper-I+n [best to cut&paste rather than type by hand])

    -=-=-=-=-=-=-=-
    import sys
    import os.path

    START_FLAG = "Person: "
    END_FLAG = "----------------------------------------------"

    def personFile(s):
    pName = s[s.find(START_FLAG) + len(START_FLAG):]
    pFID = pName + ".txt"
    if os.path.exists(pFID):
    pOut = open(pFID, "a")
    else:
    pOut = open(pFID, "w")
    pOut.write(START_FLAG)
    pOut.write(pName)
    pOut.write("\n")
    return pOut

    def processFile(fIn):
    pOut = None
    for ln in fIn:
    ln = ln.strip() #get rid of trailing line ending, etc.
    if pOut and ln == END_FLAG:
    pOut.close()
    pOut = None
    elif not pOut and ln.find(START_FLAG) != -1:
    pOut = personFile(ln)
    elif pOut:
    pOut.write(ln)
    pOut.write("\n")
    else:
    # No output file, not a start flag... skip the line
    pass

    if __name__ == "__main__":
    if sys.argv[1]:
    dIn = open(sys.argv[1], "r")
    processFile(dIn)
    dIn.close()
    else:
    print "\n\nUsage: whatever Input_File_Name\n\n"
    -=-=-=-=-=-=-=-=-
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
     
    Dennis Lee Bieber, Dec 20, 2005
    #7
  8. sicvic wrote:

    > Since I cant show the actual output file lets say I had an output file
    > that looked like this:
    >
    > aaaaa bbbbb Person: Jimmy
    > Current Location: Denver


    It may be the output of another process but it's the input file as far
    as the parsing code is concerned.

    The code below gives the following output, if that's any help ( just
    adapting Noah's idea above). Note that it deals with the input as a
    single string rather than line by line.


    Jimmy
    Jimmy.txt

    Current Location: Denver
    Next Location: Chicago

    Sarah
    Sarah.txt

    Current Location: San Diego
    Next Location: Miami
    Next Location: New York

    >>>


    data='''
    aaaaa bbbbb Person: Jimmy
    Current Location: Denver
    Next Location: Chicago
    ----------------------------------------------
    aaaaa bbbbb Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York
    ----------------------------------------------
    '''

    import StringIO
    import re


    src = StringIO.StringIO(data)

    for name in ['Jimmy', 'Sarah']:
    exp = "(?s)Person: %s(.*?)--" % name
    filename = "%s.txt" % name
    info = re.findall(exp, src.getvalue())[0]
    print name
    print filename
    print info



    hth

    Gerard
     
    Gerard Flanagan, Dec 20, 2005
    #8
  9. sicvic wrote:
    > Not homework...not even in school (do any universities even teach
    > classes using python?).

    Yup, at least 6, and 20 wouldn't surprise me.

    > The code I currently have looks something like this:
    > ...
    > f = open(sys.argv[1]) #opens output file
    > #loop that goes through all lines and parses specified text
    > for line in f.readlines():
    > if re.search(r'Person: Jimmy', line):
    > person_jimmy.write(line)
    > elif re.search(r'Person: Sarah', line):
    > person_sarah.write(line)

    Using re here seems pretty excessive.
    How about:
    ...
    f = open(sys.argv[1]) # opens input file ### get comments right
    source = iter(f) # files serve lines at their own pace. Let them
    for line in source:
    if line.endswith('Person: Jimmy\n'):
    dest = person_jimmy
    elif line.endswith('Person: Sarah\n'):
    dest = person_sarah
    else:
    continue
    while line != '---------------\n':
    dest.write(line)
    line = source.next()
    f.close()
    person_jimmy.close()
    person_sarah.close()

    --Scott David Daniels
     
    Scott David Daniels, Dec 20, 2005
    #9
  10. sicvic

    sicvic Guest

    Thank you everyone!!!

    I got a lot more information then I expected. You guys got my brain
    thinking in the right direction and starting to like programming.
    You've got a great community here. Keep it up.

    Thanks,
    Victor
     
    sicvic, Dec 20, 2005
    #10
  11. On 20 Dec 2005 08:06:39 -0800, "sicvic" <> wrote:

    >Not homework...not even in school (do any universities even teach
    >classes using python?). Just not a programmer. Anyways I should
    >probably be more clear about what I'm trying to do.

    Ok, not homework.

    >
    >Since I cant show the actual output file lets say I had an output file
    >that looked like this:
    >
    >aaaaa bbbbb Person: Jimmy
    >Current Location: Denver
    >Next Location: Chicago
    >----------------------------------------------
    >aaaaa bbbbb Person: Sarah
    >Current Location: San Diego
    >Next Location: Miami
    >Next Location: New York
    >----------------------------------------------
    >
    >Now I want to put (and all recurrences of "Person: Jimmy")
    >
    >Person: Jimmy
    >Current Location: Denver
    >Next Location: Chicago
    >
    >in a file called jimmy.txt
    >
    >and the same for Sarah in sarah.txt
    >
    >The code I currently have looks something like this:
    >
    >import re
    >import sys
    >
    >person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
    >person_sarah = open('sarah.txt', 'w') #creates sarah.txt
    >
    >f = open(sys.argv[1]) #opens output file
    >#loop that goes through all lines and parses specified text
    >for line in f.readlines():
    > if re.search(r'Person: Jimmy', line):
    > person_jimmy.write(line)
    > elif re.search(r'Person: Sarah', line):
    > person_sarah.write(line)
    >
    >#closes all files
    >
    >person_jimmy.close()
    >person_sarah.close()
    >f.close()
    >
    >However this only would produces output files that look like this:
    >
    >jimmy.txt:
    >
    >aaaaa bbbbb Person: Jimmy
    >
    >sarah.txt:
    >
    >aaaaa bbbbb Person: Sarah
    >
    >My question is what else do I need to add (such as an embedded loop
    >where the if statements are?) so the files look like this
    >
    >aaaaa bbbbb Person: Jimmy
    >Current Location: Denver
    >Next Location: Chicago
    >
    >and
    >
    >aaaaa bbbbb Person: Sarah
    >Current Location: San Diego
    >Next Location: Miami
    >Next Location: New York
    >
    >
    >Basically I need to add statements that after finding that line copy
    >all the lines following it and stopping when it sees
    >'----------------------------------------------'
    >
    >Any help is greatly appreciated.
    >

    Ok, I generalized on your theme of extracting file chunks to named files,
    where the beginning line has the file name. I made '.txt' hardcoded extension.
    I provided a way to direct the output to a (I guess not necessarily sub) directory
    Not tested beyond what you see. Tweak to suit.

    ----< extractfilesegs.py >--------------------------------------------------------
    """
    Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
    where source is -tf for test file, a file name, or an open file
    outdir is a directory prefix that will be joined to output file names
    startpat is a regular expression with group 1 giving the extracted file name
    endpat is a regular expression whose match line is excluded and ends the segment
    """
    import re, os

    def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
    rxstart = re.compile(start)
    rxstop = re.compile(stop)
    if isinstance(linesrc, basestring): linesrc = open(linesrc)
    lineit = iter(linesrc)
    files = []
    for line in lineit:
    match = rxstart.search(line)
    if not match: continue
    name = match.group(1)
    filename = name.lower() + '.txt'
    filename = os.path.join(outdir, filename)
    #print 'opening file %r'%filename
    files.append(filename)
    fout = open(filename, 'a') # append in case repeats?
    fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
    for data_line in lineit:
    if rxstop.search(data_line):
    #print 'closing file %r'%filename
    fout.close() # don't write line with ending mark
    fout = None
    break
    else:
    fout.write(data_line)
    if fout:
    fout.close()
    print 'file %r ended with source file EOF, not stop mark'%filename
    return files

    def get_testfile():
    from StringIO import StringIO
    return StringIO("""\
    ....irrelevant leading
    stuff ...
    aaaaa bbbbb Person: Jimmy
    Current Location: Denver
    Next Location: Chicago
    ----------------------------------------------
    aaaaa bbbbb Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York
    ----------------------------------------------
    irrelevant
    trailing stuff ...

    with a blank line
    """)

    if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    if not args: raise SystemExit(__doc__)
    tf = args.pop(0)
    if tf=='-tf': fin = get_testfile()
    else: fin = tf
    if not args:
    files = extractFileSegs(fin)
    elif len(args)==1:
    files = extractFileSegs(fin, args[0])
    elif len(args)==2:
    files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
    else:
    files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
    print '\nFiles created:'
    for fname in files:
    print ' "%s"'% fname
    if tf == '-tf':
    for fpath in files:
    print '====< %s >====\n%s============'%(fpath, open(fpath).read())
    ----------------------------------------------------------------------------------

    Running on your test data:

    [15:19] C:\pywk\clp>md extracteddata

    [15:19] C:\pywk\clp>py24 extractfilesegs.py -tf

    Files created:
    "extracteddata\jimmy.txt"
    "extracteddata\sarah.txt"
    ====< extracteddata\jimmy.txt >====
    Person: Jimmy
    Current Location: Denver
    Next Location: Chicago
    ============
    ====< extracteddata\sarah.txt >====
    Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York
    ============

    [15:20] C:\pywk\clp>md xd

    [15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----

    Files created:
    "xd\jimmy.txt"
    ====< xd\jimmy.txt >====
    Jimmy
    Current Location: Denver
    Next Location: Chicago
    ============

    [15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----

    Files created:
    "xd\sarah.txt"
    ====< xd\sarah.txt >====
    Person: Sarah
    Current Location: San Diego
    Next Location: Miami
    Next Location: New York
    ============

    [15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"

    Files created:
    "xd\irrelevant.txt"
    ====< xd\irrelevant.txt >====
    irrelevant
    trailing stuff ...
    ============

    HTH, NO WARRANTIES ;-)


    Regards,
    Bengt Richter
     
    Bengt Richter, Dec 20, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    927
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    609
    Naren
    May 11, 2004
  3. Christopher Diggins
    Replies:
    0
    Views:
    626
    Christopher Diggins
    Jul 9, 2007
  4. Kai Schlamp
    Replies:
    1
    Views:
    430
    Arne Vajhøj
    Mar 27, 2008
  5. Domenico Discepola

    Assistance parsing text file using Text::CSV_XS

    Domenico Discepola, Sep 1, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    477
    Domenico Discepola
    Sep 2, 2004
Loading...

Share This Page