iterating over a file with two pointers

N

nikhil Pandey

hi,
I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
please help. I am stuck up on this.
 
C

Chris Angelico

hi,
I want to iterate over the lines of a file and when i find certain lines, i need another loop starting from the next of that "CERTAIN" line till a few (say 20) lines later.
so, basically i need two pointers to lines (one for outer loop(for each line in file)) and one for inner loop. How can i do that in python?
please help. I am stuck up on this.

After the inner loop finishes, do you want to go back to where the
outer loop left off, or should the outer loop continue from the point
where the inner loop stopped? In other words, do you want to locate
overlapping sections, or not? Both are possible, but the solutions
will look somewhat different.

ChrisA
 
D

Dave Angel

After the inner loop finishes, do you want to go back to where the
outer loop left off, or should the outer loop continue from the point
where the inner loop stopped? In other words, do you want to locate
overlapping sections, or not? Both are possible, but the solutions
will look somewhat different.

In addition, is this really a text file? For binary files, you could
use seek(), and manage things yourself. But that's not strictly legal
in a text file, and may work on one system, not on another.

I'd suggest you open the file twice, and get two file objects. Then you
can iterate over them independently.

Or if the file is under a few hundred meg, just do a readlines, and do
the two iterators over that. That way, the inner loop could just
iterate over a simple slice.



infile = open(.... "rb")
lines = infile.readlines()
infile.close()

for index, line in enumerate(lines):
for inner in lines[index+1:20]:
...
 
P

Peter Otten

nikhil said:
hi,
I want to iterate over the lines of a file and when i find certain lines,
i need another loop starting from the next of that "CERTAIN" line till a
few (say 20) lines later. so, basically i need two pointers to lines (one
for outer loop(for each line in file)) and one for inner loop. How can i
do that in python? please help. I am stuck up on this.

Here's an example that prints the three lines following a line containing a
'*':

Example data:

$ cat tmp.txt
alpha
*beta
*gamma
delta
epsilon
zeta
*eta

The python script:

$ cat tmp.py
from itertools import islice, tee

with open("tmp.txt") as f:
while True:
for outer in f:
print outer,
if "*" in outer:
f, g = tee(f)
for inner in islice(g, 3):
print " ", inner,
break
else:
break

The script's output:

$ python tmp.py
alpha
*beta
*gamma
delta
epsilon
*gamma
delta
epsilon
zeta
delta
epsilon
zeta
*eta
$

As you can see the general logic is relatively complex; it is likely that we
can come up with a simpler solution if you describe your actual requirement
in more detail.
 
N

nikhil Pandey

After the inner loop finishes, do you want to go back to where the

outer loop left off, or should the outer loop continue from the point

where the inner loop stopped? In other words, do you want to locate

overlapping sections, or not? Both are possible, but the solutions

will look somewhat different.



ChrisA

Hi Chris,
After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop.
1>>I iterate over lines of the file
2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines
3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)
 
N

nikhil Pandey

nikhil Pandey wrote:











Here's an example that prints the three lines following a line containing a

'*':



Example data:



$ cat tmp.txt

alpha

*beta

*gamma

delta

epsilon

zeta

*eta



The python script:



$ cat tmp.py

from itertools import islice, tee



with open("tmp.txt") as f:

while True:

for outer in f:

print outer,

if "*" in outer:

f, g = tee(f)

for inner in islice(g, 3):

print " ", inner,

break

else:

break



The script's output:



$ python tmp.py

alpha

*beta

*gamma

delta

epsilon

*gamma

delta

epsilon

zeta

delta

epsilon

zeta

*eta

$



As you can see the general logic is relatively complex; it is likely that we

can come up with a simpler solution if you describe your actual requirement

in more detail.

hi,
I want to iterate in the inner loop by reading each line till some condition is met.how can i do that. Thanks for this code.
 
P

Peter Otten

I want to iterate in the inner loop by reading each line till some
condition is met.how can i do that. Thanks for this code.

That's not what I had in mind when I asked you to

Anyway, change

[...][...]

to

f, g = tee(f)
for inner in g:
if some condition:
break
print " ", inner,
break

in my example.
 
R

Roy Smith


Dave Angel said:
In addition, is this really a text file? For binary files, you could
use seek(), and manage things yourself. But that's not strictly legal
in a text file, and may work on one system, not on another.

Why is seek() not legal on a text file? The only issue I'm aware of is
the note at http://docs.python.org/2/library/stdtypes.html, which says:

"On Windows, tell() can return illegal values (after an fgets()) when
reading files with Unix-style line-endings. Use binary mode ('rb') to
circumvent this problem."

so, don't do that (i.e. read unix-line-terminated files on windows).
But assuming you're not in that situation, it seems like something like
this this should work:
I'd suggest you open the file twice, and get two file objects. Then you
can iterate over them independently.

and use tell() to keep them in sync. Something along the lines of (not
tested):

f1 = open("my_file")
f2 = open("my_file")

while True:
where = f1.tell()
line = f1.readline()
if not line:
break
if matches_pattern(line):
f2.seek(where)
for i in range(20):
line = f2.readline()
print line

Except for the specific case noted above (i.e. reading a unix file on a
windows box, so don't do that), it doesn't matter that seek() does funny
things with windows line endings, because tell() does the same funny
things. Doing f2.seek(f1.tell()) will get the two file pointers into
the same place in both files.
 
O

Oscar Benjamin

hi,
I want to iterate over the lines of a file and when i find certain lines,
i need another loop starting from the next of that "CERTAIN" line till a
few (say 20) lines later.
so, basically i need two pointers to lines (one for outer loop(for each
line in file)) and one for inner loop. How can i do that in python?
please help. I am stuck up on this.
[...]

Dave Angel said:
In addition, is this really a text file? For binary files, you could
use seek(), and manage things yourself. But that's not strictly legal
in a text file, and may work on one system, not on another.

Why is seek() not legal on a text file? The only issue I'm aware of is
the note at http://docs.python.org/2/library/stdtypes.html, which says:

"On Windows, tell() can return illegal values (after an fgets()) when
reading files with Unix-style line-endings. Use binary mode ('rb') to
circumvent this problem."

so, don't do that (i.e. read unix-line-terminated files on windows).
But assuming you're not in that situation, it seems like something like
this this should work:
I'd suggest you open the file twice, and get two file objects. Then you
can iterate over them independently.

There's no need to use OS resources by opening the file twice or to
screw up the IO caching with seek(). Peter's version holds just as
many lines as is necessary in an internal Python buffer and performs
the minimum possible amount of IO. I would expect this to be more
efficient as well as less error-prone on Windows.


Oscar
 
R

Roy Smith

There's no need to use OS resources by opening the file twice or to
screw up the IO caching with seek().

There's no reason NOT to use OS resources. That's what the OS is there for; to make life easier on application programmers. Opening a file twice costs almost nothing. File descriptors are almost as cheap as whitespace.
Peter's version holds just as many lines as is necessary in an
internal Python buffer and performs the minimum possible
amount of IO.

I believe by "Peter's version", you're talking about:
from itertools import islice, tee

with open("tmp.txt") as f:
while True:
for outer in f:
print outer,
if "*" in outer:
f, g = tee(f)
for inner in islice(g, 3):
print " ", inner,
break
else:
break


There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().


I have no idea how that interacts with the pattern above where you call tee() serially. You're basically doing

with open("my_file") as f:
while True:
f, g = tee(f)

Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? I have no idea. But I do know that no such problems exist with the two file descriptor versions.
 
T

Travis Griggs

Hi Chris,
After the inner loop finishes, I want to go back to the next line from where the outer loop was left i.e the lines of the inner loop will be traversed again in the outer loop.
1>>I iterate over lines of the file
2>> when i find a match in a certain line, i start another loop till some condition is met in the subsequent lines
3>> then i come back to where i left and repeat 1(ideally i want to delete that line in inner loop where that condition is met, but even if it is not deleted, its OK)


Just curious, do you really need two loops and file handles? Without better details about what you're really doing, but as you've provided more detail, it seems to me that just iterating the lines of the file, and using a latch boolean to indicate when you should do additional processing on lines might be easier. I modified Chris's example input to look like:

alpha
*beta
gamma+
delta
epsilon
zeta
*eta
kappa
tau
pi+
omicron

And then shot it with the following:

#!/usr/bin/env python3
with open("samplein.txt") as file:
reversing = False
for line in (raw.strip() for raw in file):
if reversing:
print('____', line[::-1], '____')
reversing = not line.endswith('+')
else:
print(line)
reversing = line.startswith('*')

Which begins reversing lines as its working through them, until a different condition is met.

Travis Griggs
 
D

Dave Angel

There's no reason NOT to use OS resources. That's what the OS is there for; to make life easier on application programmers. Opening a file twice costs almost nothing. File descriptors are almost as cheap as whitespace.


I believe by "Peter's version", you're talking about:



There's this note from http://docs.python.org/2.7/library/itertools.html#itertools.tee:



I have no idea how that interacts with the pattern above where you call tee() serially. You're basically doing

with open("my_file") as f:
while True:
f, g = tee(f)

Are all of those g's just hanging around, eating up memory, while waiting to be garbage collected? I have no idea. But I do know that no such problems exist with the two file descriptor versions.








---
Roy Smith
(e-mail address removed)





<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br class="Apple-interchange-newline">---</div><div>Roy Smith</div><div><a href="mailto:[email protected]">[email protected]</a></div><div><br></div></div></span></div></div><br class="Apple-interchange-newline">
</div>
<br></body></html>

And if you're willing to ignore the possibility that the text file has
unix line endings, I'm willing to ignore the possibility that the text
file has a huge number of lines. Everything is MUCH simpler if one
assumes readlines() will work. Most of these other approaches are much
more complex than the OP probably needs, if he ever gets around to
actually describing his requirements.

BTW, please post in text, all that html is really annoying.
 
S

Steven D'Aprano

I want to iterate in the inner loop by reading each line till some
condition is met.how can i do that. Thanks for this code.

while not condition:
read line


Re-write using Python syntax, and you are done.
 
S

Steven D'Aprano

hi,
I want to iterate over the lines of a file and when i find certain
lines, i need another loop starting from the next of that "CERTAIN" line
till a few (say 20) lines later. so, basically i need two pointers to
lines (one for outer loop(for each line in file)) and one for inner
loop. How can i do that in python? please help. I am stuck up on this.

No, you don't "need" two pointers to lines. That is just one way to solve
this problem. You can solve it many ways.

One way, for small files (say, under one million lines), is to read the
whole file into a list, then have two pointers to a line:

lines = file.readlines()
p = q = 0

while p < len(lines):
print(lines[p])
p += 1


then advance the pointers p and q as needed. This is the most flexible
way to do it: you can have as many pointers as needed, you can back-
track, jump forward, jump back, and it is all high-speed random-access
memory accesses. Except for the initial readlines, none of it is slow I/O
processing.


Another solution is to use a state-machine:


for line in somefile:
if state == SCANNING:
do_something()
elif state == PROCESSING:
do_something_else()
elif state == WOBBLING:
wobble()
state = adjust_state(line)


You can combine the two, of course, and have a state machine with
multiple pointers to a list of lines.

Using itertools.tee, you can potentially combine these solutions with the
straightforward for-loop over a list. The danger of itertools.tee is that
it may use as much memory as reading the entire file into memory at once,
but the benefit is that it may use much less. But personally, I find list-
based processing with random-access by index much easier to understand
that itertools.tee solutions.
 
J

Joshua Landau

Although "tee" is most certainly preferable because IO is far slower
than the small amounts of memory "tee" will use, you do have this
option:

def iterate_file_lines(file):
"""
Iterate over lines in a file, unlike normal
iteration this allows seeking.
"""
while True:
line = thefile.readline()
if not line:
break

yield line


thefile = open("/tmp/thefile")
thelines = iterate_file_lines(thefile)

for line in thelines:
print("Outer:", repr(line))

if is_start(line):
outer_position = thefile.tell()

for line in thelines:
print("Inner:", repr(line))

if is_end(line):
break

thefile.seek(outer_position)

It's simpler than having two files but probably not faster, "tee" will
almost definitely be way better a choice (unless the subsections can't
fit in memory) and it forfeits being able to change up the order of
these things.

If you want to change up the order to another defined order, you can
think about storing the subsections, but if you want to support
independent iteration you'll need to seek before every "readline"
which is a bit silly.

Basically, read it all into memory like Steven D'Aprano suggested. If
you really don't want to, use "tee". If you can't handle non-constant
memory usage (really? You're reading lines, man) I'd suggest my
method. If you can't handle the inflexibility there, use multiple
files.

There, is that enough choices?
 
P

Peter Otten

Roy said:
There's no reason NOT to use OS resources. That's what the OS is there
for; to make life easier on application programmers. Opening a file twice
costs almost nothing. File descriptors are almost as cheap as whitespace.


I believe by "Peter's version", you're talking about:



There's this note from
http://docs.python.org/2.7/library/itertools.html#itertools.tee:



I have no idea how that interacts with the pattern above where you call
tee() serially.

As I understand it the above says that

items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
pass
for pair in izip(a, b):
pass

stores 1000 items and can go on forever, but

items = infinite()
a, b = tee(items)
for item in a:
pass

will consume unbounded memory and that if items is finite using a list
instead of tee is more efficient. The documentation says nothing about

items = infinite()
a, b = tee(items)
del a
for item in b:
pass

so you have to trust Mr Hettinger or come up with a test case...
You're basically doing

with open("my_file") as f:
while True:
f, g = tee(f)

Are all of those g's just hanging around, eating up memory, while waiting
to be garbage collected? I have no idea.

I'd say you've just devised a nice test to find out ;)
But I do know that no such
problems exist with the two file descriptor versions.

The trade-offs are different. My version works with arbitrary iterators
(think stdin), but will consume unbounded amounts of memory when the inner
loop doesn't stop.
 
O

Oscar Benjamin

This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.

Fair point.
As I understand it the above says that

items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
pass
for pair in izip(a, b):
pass

stores 1000 items and can go on forever, but

items = infinite()
a, b = tee(items)
for item in a:
pass

will consume unbounded memory and that if items is finite using a list
instead of tee is more efficient. The documentation says nothing about

items = infinite()
a, b = tee(items)
del a
for item in b:
pass

so you have to trust Mr Hettinger or come up with a test case...


I'd say you've just devised a nice test to find out ;)

$ cat tee.py
#!/usr/bin/env python

import sys
from itertools import tee

items = iter(range(int(sys.argv[1])))

while True:
for x in items:
items, discard = tee(items)
break
else:
break

print(x)

$ time py -3.3 ./tee.py 100000000
99999999

real 1m47.711s
user 0m0.015s
sys 0m0.000s

While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).

When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.

There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.

However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.

Here's an alternate iterator recipe that doesn't depend on these optimisations:

from itertools import islice
from collections import deque

class Peekable(object):

def __init__(self, iterable):
self.iterator = iter(iterable)
self.peeked = deque()

def __iter__(self):
while True:
while self.peeked:
yield self.peeked.popleft()
yield next(self.iterator)

def peek(self):
for p in self.peeked:
yield p
for val in self.iterator:
self.peeked.append(val)
yield val

with open("tmp.txt") as f:
f = Peekable(f)
for outer in f:
print outer,
if "*" in outer:
for inner in islice(f.peek(), 3):
print " ", inner,


Oscar
 
P

Peter Otten

Oscar said:
$ cat tee.py
#!/usr/bin/env python

import sys
from itertools import tee

items = iter(range(int(sys.argv[1])))

while True:
for x in items:
items, discard = tee(items)
break
else:
break

print(x)

$ time py -3.3 ./tee.py 100000000
99999999

real 1m47.711s
user 0m0.015s
sys 0m0.000s

While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).
[...]

However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.

Did you use xrange()?
 
O

Oscar Benjamin

While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).
[...]

However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.

Did you use xrange()?

No I didn't. :)

Okay so it only uses 4.6MB of memory and it runs at the same speed:
there's no problem with chaining tee objects as long as you discard
them. If you don't discard them then a script like the one I wrote
would quickly blow all the system memory.


Oscar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top