[Newbie] Search-and-delete text processing problem...

T

Todd_Calhoun

I'm trying to learn about text processing in Python, and I'm trying to
tackle what should be a simple task.

I have long text files of books with a citation between each paragraph,
which might be like "Bill D. Smith, History through the Ages, p.5".

So, I need to search for every line that starts with a certain string (in
this example, "Bill D. Smith"), and delete the whole line.

I've tried a couple of different things, but none seem to work. Here's my
latest try. I apologize in advance for being so clueless.

##########################
#Text search and delete line tool

theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")

allLines = theInFile.readlines()

for line in allLines:
if line[3] == 'Bill':
line == ' '


theOutFile.writelines(allLines)
#########################

I know I could do it in Word fairly easily, but I'd like to learn the Python
way to do things.

Thanks for any advice.
 
M

M.E.Farmer

Strings have many methods that are worth learning.
If you haven't already discovered dir(str) try it.
Also I am not sure if you were just typing in some pseudocode, but your
use of writelines is incorrect.
help(file.writelines)
Help on built-in function writelines:

writelines(...)
writelines(sequence_of_strings) -> None. Write the strings to the
file.

Note that newlines are not added. The sequence can be any iterable
object
producing strings. This is equivalent to calling write() for each
string.

Todd_Calhoun said:
I'm trying to learn about text processing in Python, and I'm trying to
tackle what should be a simple task.

I have long text files of books with a citation between each paragraph,
which might be like "Bill D. Smith, History through the Ages, p.5".

So, I need to search for every line that starts with a certain string (in
this example, "Bill D. Smith"), and delete the whole line.

I've tried a couple of different things, but none seem to work. Here's my
latest try. I apologize in advance for being so clueless.

##########################
#Text search and delete line tool
theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")
allLines = theInFile.readlines()
theInFile.close()

for line in allLines:
if not line.startswith('Bill'):
theOutFile.write(line)

theOutFile.close()
#########################

# You can also accumulate lines
# in a list then write them all at once
##########################
#Text search and delete line tool
theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")
allLines = theInFile.readlines()
theInFile.close()

outLines = []

for line in allLines:
if not line.startswith('Bill'):
outLines.append(line)

theOutFile.writelines(outLines)
theOutFile.close()
#########################

hth,
M.E.Farmer
 
M

M.E.Farmer

My apologies you did indeed use writelines correctly ;)
dohhh!
I had a gut reaction to this.
Py>f = ['hij\n','efg\n','abc\n']
Py> for i in f:
.... if i.startswith('a'):
.... i == ''
Py> f
['hij\n', 'efg\n', 'abc\n']
Notice that it does not modify the list in any way.
You are trying to loop thru the list and modify the items in place, it
just won't work.
When you rebind the item name to a new value ' ' it does not rebind the
element in the list just the current item.
It is also bug prone to modify a list you are looping over.
M.E.Farmer
 
B

Bengt Richter

I'm trying to learn about text processing in Python, and I'm trying to
tackle what should be a simple task.

I have long text files of books with a citation between each paragraph,
Most text files aren't long enough to worry about, but you can avoid
reading in the whole file by just iterating, one line at a time. That is
the way a file object iterates by default, so there's not much to that.
which might be like "Bill D. Smith, History through the Ages, p.5".

So, I need to search for every line that starts with a certain string (in
this example, "Bill D. Smith"), and delete the whole line.
If you want to test what a string starts with, there's a string method for that.
E.g., if line is the string representing one line, line.startswith('Bill') would
return True or False.
I've tried a couple of different things, but none seem to work. Here's my
latest try. I apologize in advance for being so clueless.

##########################
#Text search and delete line tool

theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")

allLines = theInFile.readlines()
This will create a list of lines, all (except perhaps the last, if
the file had no end-of-line character(s) at the very end) with '\n'
as the last character. There are various ways to strip the line ends,
but your use case doesn't appear to require it.
for line in allLines:
# line at this point contains each line successively as the loop proceeds,
# but you don't know where in the sequence you are unless you provide for it,
# e.g. by using
for i, line in enumerate(allLines):
if line[3] == 'Bill':
The above line is comparing the 4th character of the line (indexing from 0) with 'Bill'
which is never going to be true, and will raise an IndexError if the line is shorter than
4 characters. Not what you want to do.
if line.startswith('Bill'): # note that this is case sensitive. Otherwise use line.lower().startswith('bill')
line == ' '
the enumerate will give you an index you can use for this, but I doubt if you want and invisible space
without a line ending in place of 'Bill ... \n'
line = '\n' # makes an actual blank line , but you want to delete it, so this is not going to work
theOutFile.writelines(allLines)

UIAM (untested) you should be able to do the entire job removing lines that start with 'Bill' thus:

theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")
theOutFile.writelines(line for line in theInfile if not line.startswith('Bill'))

Or just the line

open("test_f_out.txt", "w").writelines(L for L in open("test_f.txt") if not L.startswith('Bill'))

If you need to remove lines starting with any name in a certain list, you can do that too, e.g.,

delStarts = ['Bill', 'Bob', 'Sue']
theInFile = open("test_f.txt", "r")
theOutFile = open("test_f_out.txt", "w")
for line in theInFile:
for name in delStarts:
if line.startswith(name): break
else: # will happen if there is NO break, so line does not start with any delStarts name
theOutFile.write(line) # write line out if not starting badly

(You could do that with a oneliner too, but it gets silly ;-)

If you have a LOT of names to check for, it could pay you to figure a way to split off the name
from the fron of a lines, and check if that is in a set instead using a delStart list.
If you do use delStart, put the most popular names at the front.
#########################

I know I could do it in Word fairly easily, but I'd like to learn the Python
way to do things. Have fun.

Thanks for any advice.
HTH (nothing tested, sorry ;-)

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top