deleting texts between patterns

micklee74 · May 12, 2006

hi
say i have a text file

line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out? or any other better methods?
thanks

Ravi Teja · May 12, 2006

[email protected] said:
hi
say i have a text file

line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out? or any other better methods?
thanks

In other words ...
lines = open('test.txt').readlines()
for line in lines[lines.index('abc\n') + 1:lines.index('xyz\n')]:
lines.remove(line)
for line in lines:
print line,

Regular expressions are better in this case
import re
pat = re.compile('abc\n.*?xyz\n', re.DOTALL)
print re.sub(pat, '', open('test.txt').read())

Duncan Booth · May 12, 2006

wrote:

hi
say i have a text file

line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out? or any other better methods?
thanks

Something like this (untested code):

def filtered(f, stop, restart):
f = iter(f)
for line in f:
yield line
if line==stop:
break
for line in f:
if line==restart:
yield line
break
for line in f:
yield line

for line in filtered(open('thefile'), "abc\n", "xyz\n"):
print line

Fredrik Lundh · May 12, 2006

hi
say i have a text file

line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out? or any other better methods?

what's wrong with a simple

emit = True
for line in open("q.txt"):
if line == "xyz\n":
emit = True
if emit:
print line,
if line == "abc\n":
emit = False

loop ? (this is also easy to tweak for cases where you don't want to include
the patterns in the output).

to print to a file instead of stdout, just replace the print line with a f.write call.

</F>

bruno at modulix · May 12, 2006

hi
say i have a text file

line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out?

Would be somewhat inefficient IMHO - at least for big files, since it
implies reading the whole file in memory.

or any other better methods?

Don't know if it's better for your actual use case, but this avoids
reading up the whole file:

def skip(iterable, skipfrom, skipuntil):
""" example usage : """
skip = False
for line in iterable:
if skip:
if line == skipuntil:
skip = False
continue
else:
if line == skipfrom:
skip = True
continue
yield line

def main():
lines = """
line1
line2
line3
line4
line5
line6
abc
line8 <---to be delete
line9 <---to be delete
line10 <---to be delete
line11 <---to be delete
line12 <---to be delete
line13 <---to be delete
xyz
line15
line16
line17
line18
""".strip().split()
for line in skip(lines, 'abc', 'xyz'):
print line

HTH

John Machin · May 12, 2006

[email protected] said:
[email protected] said:

hi
say i have a text file

line1 [snip]
line6
abc
line8 <---to be delete [snip]
line13 <---to be delete
xyz
line15 [snip]
line18

I wish to delete lines that are in between 'abc' and 'xyz' and print
the rest of the lines. Which is the best way to do it? Should i get
everything into a list, get the index of abc and xyz, then pop the
elements out? or any other better methods?
thanks

Click to expand...

In other words ...
lines = open('test.txt').readlines()
for line in lines[lines.index('abc\n') + 1:lines.index('xyz\n')]:
lines.remove(line)

I don't think that's what you really meant.

>>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
>>> for line in lines[lines.index('abc\n') + 1:lines.index('xyz\n')]:

Click to expand...

Click to expand...

.... lines.remove(line)
....['abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']

Uh-oh.

Try this:

>>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
>>> del lines[lines.index('abc\n') + 1:lines.index('xyz\n')]
>>> lines ['blah', 'fubar', 'abc\n', 'xyz\n', 'xyzzy']
>>>

Click to expand...

Click to expand...

Of course wrapping it in try/except would be a good idea, not for the
slicing, which behaves itself and does nothing if the 'abc\n' appears
AFTER the 'xyz\n', but for the index() in case the sought markers aren't
there. Perhaps it might be a good idea even to do it carefully one piece
at a time: is the abc there? is the xyz there? is the xyz after the abc
-- then del[index1+1:index2].

I wonder what the OP wants to happen in a case like this:

guff1 xyz guff2 abc guff2 xyz guff3
or this:
guff1 abc guff2 abc guff2 xyz guff3

for line in lines:
print line,

Regular expressions are better in this case

Famous last words.

import re
pat = re.compile('abc\n.*?xyz\n', re.DOTALL)
print re.sub(pat, '', open('test.txt').read())

I don't think you really meant that either.

>>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
>>> linestr = "".join(lines)
>>> linestr 'blahfubarabc\nblahfubarxyz\nxyzzy'
>>> import re
>>> pat = re.compile('abc\n.*?xyz\n', re.DOTALL)
>>> print re.sub(pat, '', linestr) blahfubarxyzzy
>>>

Click to expand...

Click to expand...

Uh-oh.

Try this:
'blahfubarabc\nxyz\nxyzzy'

.... and I can't imagine why you're using the confusing [IMHO]
undocumented [AFAICT] feature that the first arg of the module-level
functions like sub and friends can be a compiled regular expression
object. Why not use this:

One-liner fanboys might prefer this:

HTH,
John

bruno at modulix · May 12, 2006

Fredrik Lundh wrote:
(snip)

to print to a file instead of stdout, just replace the print line with a f.write call.

Or redirect stdout to a file when calling the program !-)

bruno at modulix · May 12, 2006

bruno said:
(e-mail address removed) wrote:
(snip)

Don't know if it's better for your actual use case, but this avoids
reading up the whole file:

def skip(iterable, skipfrom, skipuntil):
""" example usage :
"""

(snip code)

Forgot to say this will also skip markers. If you want to keep them, see
the effbot answer...

Tim Chase · May 12, 2006

I wish to delete lines that are in between 'abc' and

'xyz' and print the rest of the lines. Which is the best
way to do it?

While this *is* the python list, you don't specify whether
this is the end goal, or whether it's part of a larger
program. If it *is* the end goal (namely, you just want the
filtered output someplace), and you're not adverse to using
other tools, you can do something like

sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt

which is pretty straight-forward. It translates to

-n don't print each line by default
-e execute the following item
1,/abc/ from line 1, through the line where you match "abc"
p print each line
and also
-e execute the following item
/xyz/,$ from the line matching "abc" through the last line
p print each line

It assumes that
1) there's only one /abc/ & /xyz/ in the file (otherwise, it
defaults to the first one it finds in each case)
2) that they're in that order (otherwise, you'll get 2x each
line, rather than 0x each line)

However, it's a oneliner here, and seems to be a bit more
complex in python, so if you don't need to integrate the
results into further down-stream python processing, this
might be a nice way to go. If you need the python, others
on the list have offered a panoply of good answers already.

-tkc

Dan Sommers · May 12, 2006

While this *is* the python list, you don't specify whether
this is the end goal, or whether it's part of a larger
program. If it *is* the end goal (namely, you just want the
filtered output someplace), and you're not adverse to using
other tools, you can do something like

sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt

Or even

awk '/abc/,/xyz/' file.txt

Excluding the abc and xyz lines is left as an exercise to the
interested reader.

Regards,
Dan

Edward Elliott · May 12, 2006

Dan said:
Or even

awk '/abc/,/xyz/' file.txt

Excluding the abc and xyz lines is left as an exercise to the
interested reader.

Once again, us completely disinterested readers get the short end of the
stick.

Ravi Teja · May 12, 2006

I don't think that's what you really meant ^ 2

Right! That was very buggy. That's what I get for posting past 1 AM :-(.

John Savage · May 20, 2006

Tim Chase said:
sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt

which is pretty straight-forward.

While it looks neat, it will not work when /abc/ matches line 1.
Non-standard versions of sed, e.g., GNU, allow you to use 0,/abc/
to neatly step around this nuisance; but for standard sed you'll
need a more complicated sed script.

Baoqiu Cui · Jun 4, 2006

John Machin said:
Uh-oh.

Try this:

'blahfubarabc\nxyz\nxyzzy'

This regexp still has a problem. It may remove the lines between two
lines like 'aaabc' and 'xxxyz' (and also removes the first two 'x's in
'xxxyz').

The following regexp works better:

pattern = re.compile('(?<=^abc\n).*?(?=^xyz\n)', re.DOTALL | re.MULTILINE)
.... abc
.... line2
.... xyz
.... line3
.... aaabc
.... line4
.... xxxyz
.... line5'''

line1
abc
xyz
line3
aaabc
line4
xxxyz
line5

- Baoqiu

John Machin · Jun 5, 2006

This regexp still has a problem. It may remove the lines between two
lines like 'aaabc' and 'xxxyz' (and also removes the first two 'x's in
'xxxyz').

The following regexp works better:

pattern = re.compile('(?<=^abc\n).*?(?=^xyz\n)', re.DOTALL | re.MULTILINE)

You are quite correct. Your reply, and the rejoinder below, only add to
the proposition that regexes are not necessarily the best choice for
every text-processing job

Just in case the last line is 'xyz' but is not terminated by '\n':

pattern = re.compile('(?<=^abc\n).*?(?=^xyz$)', re.DOTALL | re.MULTILINE)

Cheers,
John

Read File and Output to new format	7	Mar 24, 2009
Python point location of intersect between two lines	0	Feb 28, 2018
Non-constant constant strings	561	Jan 19, 2014
Matchtable	5	Apr 12, 2010
Redirection at the prompt	4	Jan 5, 2004
deleting elements from a list in a for loop	5	Oct 29, 2004
reference question	8	Jun 23, 2006
simple XSLT question	3	Jun 18, 2007

deleting texts between patterns

micklee74

Ravi Teja

Duncan Booth

Fredrik Lundh

bruno at modulix

John Machin

bruno at modulix

bruno at modulix

Tim Chase

Dan Sommers

Edward Elliott

Ravi Teja

John Savage

Baoqiu Cui

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads