Need help in extracting lines from word using python


R

razinzamada

I'm currently trying to extract some data between 2 lines of an input file using Python. the infile is set up such that there is a line -START- where I need the next 10 lines of code if and only if the -END- condition occurs before the next -START-. The -START- line occurs many times before the -END-. Heres a general example of what I mean:

blah
blah
-START-
10 lines I DONT need
blah
-START-
10 lines I need
blah
blah
-END-
blah
blah
-START-
10 lines I dont need
blah
-START-

..... and so on and so forth

so far I have only been able to get the -START- + 10 lines for every iteration, but am at a total loss when it comes to specifying the condition to only write if the -END- condition comes before another -START- condition. I'ma bit of a newb, so any help will be greatly appreciated.


heres the code I have for printing the -START- + 10 lines:

in = open('input.log')
out = open('output.txt', 'a')

lines = in.readlines()
for i, line in enumerate(lines):
if (line.find('START')) > -1:
out.write(line)
out.write(lines[i + 1])
out.write(lines[i + 2])
out.write(lines[i + 3])
out.write(lines[i + 4])
out.write(lines[i + 5])
out.write(lines[i + 6])
out.write(lines[i + 7])
out.write(lines[i + 8])
out.write(lines[i + 9])
out.write(lines[i + 10])
 
Ad

Advertisements

S

Steven D'Aprano

I'm currently trying to extract some data between 2 lines of an input
file using Python. the infile is set up such that there is a line
-START- where I need the next 10 lines of code if and only if the -END-
condition occurs before the next -START-. The -START- line occurs many
times before the -END-. Heres a general example of what I mean:

blah
blah
-START-
10 lines I DONT need
blah
-START-
10 lines I need
blah
blah
-END-
blah
blah
-START-
10 lines I dont need
blah
-START-

.... and so on and so forth
[...]

heres the code I have for printing the -START- + 10 lines:

in = open('input.log')

No it is not. "in" is a reserved word in Python, that code cannot
possibly work, it will give a SyntaxError.


Try this code. Untested but it should do want you want.


infile = open('input.log')
outfile = open('output.txt', 'a')
# Accumulate lines between START and END lines, ignoring everything else.
collect = False # Initially we start by ignoring lines.
for line in infile:
if '-START-' in line:
# Ignore any lines already seen, and start collecting.
accum = []
collect = True
elif '-END-' in line:
# Write the first ten accumulated lines.
outfile.writelines(accum[:10])
# Clear the accumulated lines.
accum = []
# and stop collecting until the next START line
collect = False
elif collect:
accum.append(line)

outfile.close()
infile.close()
 
D

Dave Angel

I'm currently trying to extract some data between 2 lines of an input file

Your subject line says "from word". I'm only guessing that you might
mean Microsoft Word, a proprietary program that does not, by default,
save text files. The following code and description assumes a text
file, so there's a contradiction.

using Python. the infile is set up such that there is a line -START- where I need the next 10 lines of code if and only if the -END- condition occurs before the next -START-. The -START- line occurs many times before the -END-. Heres a general example of what I mean:

In other words, you want to scan for -END-, then go backwards to -START-
and use the first ten of the lines between? Try coding it that way, and
perhaps it'll be easier.

You also need to consider (and specify behavior for) the possibility
that start and end are less than 10 lines apart.
blah
blah
-START-
10 lines I DONT need
blah
-START-
10 lines I need
blah
blah
-END-
blah
blah
-START-
10 lines I dont need
blah
-START-

.... and so on and so forth

so far I have only been able to get the -START- + 10 lines for every iteration, but am at a total loss when it comes to specifying the condition to only write if the -END- condition comes before another -START- condition. I'm a bit of a newb, so any help will be greatly appreciated.


heres the code I have for printing the -START- + 10 lines:

in = open('input.log')
out = open('output.txt', 'a')

lines = in.readlines()
for i, line in enumerate(lines):
if (line.find('START')) > -1:
out.write(line)
out.write(lines[i + 1])
out.write(lines[i + 2])
out.write(lines[i + 3])
out.write(lines[i + 4])
out.write(lines[i + 5])
out.write(lines[i + 6])
out.write(lines[i + 7])
out.write(lines[i + 8])
out.write(lines[i + 9])
out.write(lines[i + 10])

or just out.write(lines[i:i+11) to write out all 11 of them.
 
R

razinzamada

Thanks steven

I'm currently trying to extract some data between 2 lines of an input
file using Python. the infile is set up such that there is a line
-START- where I need the next 10 lines of code if and only if the -END-
condition occurs before the next -START-. The -START- line occurs many
times before the -END-. Heres a general example of what I mean:

10 lines I DONT need

10 lines I need

10 lines I dont need

.... and so on and so forth


[...]



heres the code I have for printing the -START- + 10 lines:
in = open('input.log')



No it is not. "in" is a reserved word in Python, that code cannot

possibly work, it will give a SyntaxError.





Try this code. Untested but it should do want you want.





infile = open('input.log')

outfile = open('output.txt', 'a')

# Accumulate lines between START and END lines, ignoring everything else.

collect = False # Initially we start by ignoring lines.

for line in infile:

if '-START-' in line:

# Ignore any lines already seen, and start collecting.

accum = []

collect = True

elif '-END-' in line:

# Write the first ten accumulated lines.

outfile.writelines(accum[:10])

# Clear the accumulated lines.

accum = []

# and stop collecting until the next START line

collect = False

elif collect:

accum.append(line)



outfile.close()

infile.close()
 
R

razinzamada

Thanks DAVE

I'm currently trying to extract some data between 2 lines of an input file



Your subject line says "from word". I'm only guessing that you might

mean Microsoft Word, a proprietary program that does not, by default,

save text files. The following code and description assumes a text

file, so there's a contradiction.




using Python. the infile is set up such that there is a line -START- where I need the next 10 lines of code if and only if the -END- condition occurs before the next -START-. The -START- line occurs many times before the -END-. Heres a general example of what I mean:



In other words, you want to scan for -END-, then go backwards to -START-

and use the first ten of the lines between? Try coding it that way, and

perhaps it'll be easier.



You also need to consider (and specify behavior for) the possibility

that start and end are less than 10 lines apart.


blah


10 lines I DONT need


10 lines I need






10 lines I dont need



.... and so on and so forth

so far I have only been able to get the -START- + 10 lines for every iteration, but am at a total loss when it comes to specifying the condition to only write if the -END- condition comes before another -START- condition.I'm a bit of a newb, so any help will be greatly appreciated.


heres the code I have for printing the -START- + 10 lines:

in = open('input.log')
out = open('output.txt', 'a')

lines = in.readlines()
for i, line in enumerate(lines):
if (line.find('START')) > -1:
out.write(line)

out.write(lines[i + 1])
out.write(lines[i + 2])
out.write(lines[i + 3])
out.write(lines[i + 4])
out.write(lines[i + 5])
out.write(lines[i + 6])
out.write(lines[i + 7])
out.write(lines[i + 8])
out.write(lines[i + 9])
out.write(lines[i + 10])



or just out.write(lines[i:i+11) to write out all 11 of them.
 
Ad

Advertisements

R

razinzamada

Thanks DAVE

I'm currently trying to extract some data between 2 lines of an input file



Your subject line says "from word". I'm only guessing that you might

mean Microsoft Word, a proprietary program that does not, by default,

save text files. The following code and description assumes a text

file, so there's a contradiction.




using Python. the infile is set up such that there is a line -START- where I need the next 10 lines of code if and only if the -END- condition occurs before the next -START-. The -START- line occurs many times before the -END-. Heres a general example of what I mean:



In other words, you want to scan for -END-, then go backwards to -START-

and use the first ten of the lines between? Try coding it that way, and

perhaps it'll be easier.



You also need to consider (and specify behavior for) the possibility

that start and end are less than 10 lines apart.


blah


10 lines I DONT need


10 lines I need






10 lines I dont need



.... and so on and so forth

so far I have only been able to get the -START- + 10 lines for every iteration, but am at a total loss when it comes to specifying the condition to only write if the -END- condition comes before another -START- condition.I'm a bit of a newb, so any help will be greatly appreciated.


heres the code I have for printing the -START- + 10 lines:

in = open('input.log')
out = open('output.txt', 'a')

lines = in.readlines()
for i, line in enumerate(lines):
if (line.find('START')) > -1:
out.write(line)

out.write(lines[i + 1])
out.write(lines[i + 2])
out.write(lines[i + 3])
out.write(lines[i + 4])
out.write(lines[i + 5])
out.write(lines[i + 6])
out.write(lines[i + 7])
out.write(lines[i + 8])
out.write(lines[i + 9])
out.write(lines[i + 10])



or just out.write(lines[i:i+11) to write out all 11 of them.
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top