Simple text parsing gets difficult when line continues to next line

Discussion in 'Python' started by Jacob Rael, Nov 28, 2006.

  1. Jacob Rael

    Jacob Rael Guest

    Hello,

    I have a simple script to parse a text file (a visual basic program)
    and convert key parts to tcl. Since I am only working on specific
    sections and I need it quick, I decided not to learn/try a full blown
    parsing module. My simple script works well until it runs into
    functions that straddle multiple lines. For example:

    Call mass_write(&H0, &HF, &H4, &H0, &H5, &H0, &H6, &H0, &H7, &H0,
    &H8, &H0, _
    &H9, &H0, &HA, &H0, &HB, &H0, &HC, &H0, &HD, &H0, &HE,
    &H0, &HF, &H0, -1)


    I read in each line with:

    for line in open(fileName).readlines():

    I would line to identify if a line continues (if line.endswith('_'))
    and concate with the next line:

    line = line + nextLine

    How can I get the next line when I am in a for loop using readlines?

    jr
    Jacob Rael, Nov 28, 2006
    #1
    1. Advertising

  2. Jacob Rael

    Larry Bates Guest

    Re: Simple text parsing gets difficult when line continues to nextline

    Jacob Rael wrote:
    > Hello,
    >
    > I have a simple script to parse a text file (a visual basic program)
    > and convert key parts to tcl. Since I am only working on specific
    > sections and I need it quick, I decided not to learn/try a full blown
    > parsing module. My simple script works well until it runs into
    > functions that straddle multiple lines. For example:
    >
    > Call mass_write(&H0, &HF, &H4, &H0, &H5, &H0, &H6, &H0, &H7, &H0,
    > &H8, &H0, _
    > &H9, &H0, &HA, &H0, &HB, &H0, &HC, &H0, &HD, &H0, &HE,
    > &H0, &HF, &H0, -1)
    >
    >
    > I read in each line with:
    >
    > for line in open(fileName).readlines():
    >
    > I would line to identify if a line continues (if line.endswith('_'))
    > and concate with the next line:
    >
    > line = line + nextLine
    >
    > How can I get the next line when I am in a for loop using readlines?
    >
    > jr
    >

    Something like (not tested):

    fp=open(filename, 'r')
    for line in fp:
    while line.rstrip().endswith('_'):
    line+=fp.next()


    fp.close()

    -Larry
    Larry Bates, Nov 28, 2006
    #2
    1. Advertising

  3. Jacob Rael wrote:
    [...]
    > I would line to identify if a line continues (if line.endswith('_'))
    > and concate with the next line:
    >
    > line = line + nextLine
    >
    > How can I get the next line when I am in a for loop using readlines?


    Don't use readlines.

    # NOT TESTED
    program = open(fileName)
    for line in program:
    while line.rstrip("\n").endswith("_"):
    line = line.rstrip("_ \n") + program.readline()
    do_the_magic()

    Cheers,
    --
    Roberto Bonvallet
    Roberto Bonvallet, Nov 28, 2006
    #3
  4. On 28 Nov 2006 09:59:41 -0800, "Jacob Rael" <>
    declaimed the following in comp.lang.python:

    >
    > I read in each line with:
    >
    > for line in open(fileName).readlines():
    >
    > I would line to identify if a line continues (if line.endswith('_'))
    > and concate with the next line:
    >
    > line = line + nextLine
    >
    > How can I get the next line when I am in a for loop using readlines?
    >

    Well, besides the stereotypical ("Doctor, it hurts when I do this";
    "Don't do that")...

    UNTESTED

    line = ""
    inCont = False
    for ln in open(whatever).readlines():
    if inCont:
    line = line + ln #stripped of newlines, of course
    else:
    line = ln

    inCont = line.endswith("_")

    if not inCont:
    #process completed line
    line = ""


    Though I suspect newer versions of Python, where file objects can be
    used as iterators, would be better...

    REALLY UNTESTED

    f = open(whatever)
    for ln in f:
    #strip newline, of course
    while ln.endswith("_"):
    ln = ln + f.readline() #strip the newline here too
    #process ln

    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
    Dennis Lee Bieber, Nov 28, 2006
    #4
  5. Jacob Rael

    John Machin Guest

    Jacob Rael wrote:
    > Hello,
    >
    > I have a simple script to parse a text file (a visual basic program)
    > and convert key parts to tcl. Since I am only working on specific
    > sections and I need it quick, I decided not to learn/try a full blown
    > parsing module. My simple script works well until it runs into
    > functions that straddle multiple lines. For example:
    >
    > Call mass_write(&H0, &HF, &H4, &H0, &H5, &H0, &H6, &H0, &H7, &H0,
    > &H8, &H0, _
    > &H9, &H0, &HA, &H0, &HB, &H0, &HC, &H0, &HD, &H0, &HE,
    > &H0, &HF, &H0, -1)
    >
    >
    > I read in each line with:
    >
    > for line in open(fileName).readlines():
    >
    > I would line to identify if a line continues (if line.endswith('_'))
    > and concate with the next line:
    >
    > line = line + nextLine
    >
    > How can I get the next line when I am in a for loop using readlines?


    Don't do that. I'm rather dubious about approaches that try to grab the
    next line on the fly e.g. fp.next(). Here's a function that takes a
    list of lines and returns another with all trailing whitespace removed
    and the continued lines glued together. It uses a simple state machine
    approach.

    def continue_join(linesin):
    linesout = []
    buff = ""
    NORMAL = 0
    PENDING = 1
    state = NORMAL
    for line in linesin:
    line = line.rstrip()
    if state == NORMAL:
    if line.endswith('_'):
    buff = line[:-1]
    state = PENDING
    else:
    linesout.append(line)
    else:
    if line.endswith('_'):
    buff += line[:-1]
    else:
    buff += line
    linesout.append(buff)
    buff = ""
    state = NORMAL
    if state == PENDING:
    raise ValueError("last line is continued: %r" % line)
    return linesout

    import sys
    fp = open(sys.argv[1])
    rawlines = fp.readlines()
    cleanlines = continue_join(rawlines)
    for line in cleanlines:
    print repr(line)
    ===
    Tested with following files:
    C:\junk>type contlinet1.txt
    only one line

    C:\junk>type contlinet2.txt
    line 1
    line 2

    C:\junk>type contlinet3.txt
    line 1
    line 2a _
    line 2b _
    line 2c
    line 3

    C:\junk>type contlinet4.txt
    line 1
    _
    _
    line 2c
    line 3

    C:\junk>type contlinet5.txt
    line 1
    _
    _
    line 2c
    line 3 _

    C:\junk>

    HTH,
    John
    John Machin, Nov 28, 2006
    #5
  6. Jacob Rael

    Tim Hochberg Guest

    Re: Simple text parsing gets difficult when line continues to nextline

    John Machin wrote:
    > Jacob Rael wrote:
    >> Hello,
    >>
    >> I have a simple script to parse a text file (a visual basic program)
    >> and convert key parts to tcl. Since I am only working on specific
    >> sections and I need it quick, I decided not to learn/try a full blown
    >> parsing module. My simple script works well until it runs into
    >> functions that straddle multiple lines. For example:
    >>
    >> Call mass_write(&H0, &HF, &H4, &H0, &H5, &H0, &H6, &H0, &H7, &H0,
    >> &H8, &H0, _
    >> &H9, &H0, &HA, &H0, &HB, &H0, &HC, &H0, &HD, &H0, &HE,
    >> &H0, &HF, &H0, -1)
    >>
    >>
    >> I read in each line with:
    >>
    >> for line in open(fileName).readlines():
    >>
    >> I would line to identify if a line continues (if line.endswith('_'))
    >> and concate with the next line:
    >>
    >> line = line + nextLine
    >>
    >> How can I get the next line when I am in a for loop using readlines?

    >
    > Don't do that. I'm rather dubious about approaches that try to grab the
    > next line on the fly e.g. fp.next(). Here's a function that takes a
    > list of lines and returns another with all trailing whitespace removed
    > and the continued lines glued together. It uses a simple state machine
    > approach.


    I agree that mixing the line assembly and parsing is probably a mistake
    although using next explicitly is fine as long as your careful with it.
    For instance, I would be wary to use the mixed for-loop, next strategy
    that some of the previous posts suggested. Here's a different,
    generator-based implementation of the same idea that, for better or for
    worse is considerably less verbose:

    def continue_join_2(linesin):
    getline = iter(linesin).next
    while True:
    buffer = getline().rstrip()
    try:
    while buffer.endswith('_'):
    buffer = buffer[:-1] + getline().rstrip()
    except StopIteration:
    raise ValueError("last line is continued: %r" % line)
    yield buffer

    -tim

    [SNIP]
    Tim Hochberg, Nov 28, 2006
    #6
  7. Jacob Rael

    John Machin Guest

    Tim Hochberg wrote:
    [snip]
    > I agree that mixing the line assembly and parsing is probably a mistake
    > although using next explicitly is fine as long as your careful with it.
    > For instance, I would be wary to use the mixed for-loop, next strategy
    > that some of the previous posts suggested. Here's a different,
    > generator-based implementation of the same idea that, for better or for
    > worse is considerably less verbose:
    >

    [snip]

    Here's a somewhat less verbose version of the state machine gadget.

    def continue_join_3(linesin):
    linesout = []
    buff = ""
    pending = 0
    for line in linesin:
    # remove *all* trailing whitespace
    line = line.rstrip()
    if line.endswith('_'):
    buff += line[:-1]
    pending = 1
    else:
    linesout.append(buff + line)
    buff = ""
    pending = 0
    if pending:
    raise ValueError("last line is continued: %r" % line)
    return linesout

    FWIW, it works all the way back to Python 2.1

    Cheers,
    John,
    John Machin, Nov 28, 2006
    #7
  8. Jacob Rael

    Jacob Rael Guest

    Thanks all. I think I'll follow the "don't do that" advice.

    jr

    Jacob Rael wrote:
    > Hello,
    >
    > I have a simple script to parse a text file (a visual basic program)
    > and convert key parts to tcl. Since I am only working on specific
    > sections and I need it quick, I decided not to learn/try a full blown
    > parsing module. My simple script works well until it runs into
    > functions that straddle multiple lines. For example:
    >
    > Call mass_write(&H0, &HF, &H4, &H0, &H5, &H0, &H6, &H0, &H7, &H0,
    > &H8, &H0, _
    > &H9, &H0, &HA, &H0, &HB, &H0, &HC, &H0, &HD, &H0, &HE,
    > &H0, &HF, &H0, -1)
    >
    >
    > I read in each line with:
    >
    > for line in open(fileName).readlines():
    >
    > I would line to identify if a line continues (if line.endswith('_'))
    > and concate with the next line:
    >
    > line = line + nextLine
    >
    > How can I get the next line when I am in a for loop using readlines?
    >
    > jr
    Jacob Rael, Nov 28, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    5
    Views:
    505
    Steven Saunderson
    Jul 10, 2006
  2. Deniz Bahar
    Replies:
    2
    Views:
    449
    Andrey Tarasevich
    Mar 9, 2005
  3. John Joyce

    gets gets

    John Joyce, Mar 26, 2007, in forum: Ruby
    Replies:
    2
    Views:
    332
    John Joyce
    Mar 26, 2007
  4. John Joyce

    Return of gets gets

    John Joyce, Apr 23, 2007, in forum: Ruby
    Replies:
    0
    Views:
    178
    John Joyce
    Apr 23, 2007
  5. libsfan01
    Replies:
    5
    Views:
    228
    Jeff North
    Dec 20, 2006
Loading...

Share This Page