use fileinput to read a specific line

J

jo3c

hi everybody
im a newbie in python
i need to read line 4 from a header file
using linecache will crash my computer due to memory loading, because
i am working on 2000 files each is 8mb

fileinput don't load the file into memory first
how do i use fileinput module to read a specific line from a file?

for line in fileinput.Fileinput('sample.txt')
????
 
R

Russ P.

hi everybody
im a newbie in python
i need to read line 4 from a header file
using linecache will crash my computer due to memory loading, because
i am working on 2000 files each is 8mb

fileinput don't load the file into memory first
how do i use fileinput module to read a specific line from a file?

for line in fileinput.Fileinput('sample.txt')
????

Assuming it's a text file, you could use something like this:

lnum = 0 # line number

for line in file("sample.txt"):
lnum += 1
if lnum >= 4: break

The variable "line" should end up with the contents of line 4 if I am
not mistaken. To handle multiple files, just wrap that code like this:

for file0 in files:

lnum = 0 # line number

for line in file(file0):
lnum += 1
if lnum >= 4: break

# do something with "line"

where "files" is a list of the files to be read.

That's not tested.
 
D

Dennis Lee Bieber

for file0 in files:

lnum = 0 # line number

for line in file(file0):
lnum += 1
if lnum >= 4: break

# do something with "line"

where "files" is a list of the files to be read.
Given that the OP is talking 2000 files to be processed, I think I'd
recommend explicit open() and close() calls to avoid having lots of I/O
structures floating around...

for fid in file_list:
fin = open(fid)
jnk = fin.readline()
jnk = fin.readline()
jnk = fin.readline()
ln = fin.readline()
fin.close()

Yes, coding three junk reads does mean maintenance will be a pain
(we now need the 5th line, not the fourth -- and would need to add
another jnk = line)... I'd maybe consider replacing all four readline()
with:

for cnt in xrange(4):
ln = fin.readline()

since it doesn't need the overhead of a separate line counter/test and
will leave the fourth input line in "ln" on exit.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
R

Russ P.

Given that the OP is talking 2000 files to be processed, I think I'd
recommend explicit open() and close() calls to avoid having lots of I/O
structures floating around...

Good point. I didn't think of that. It could also be done as follows:

for fileN in files:

lnum = 0 # line number
input = file(fileN)

for line in input:
lnum += 1
if lnum >= 4: break

input.close()

# do something with "line"

Six of one or half a dozen of the other, I suppose.
 
R

Russ P.

Given that the OP is talking 2000 files to be processed, I think I'd
recommend explicit open() and close() calls to avoid having lots of I/O
structures floating around...

for fid in file_list:
fin = open(fid)
jnk = fin.readline()
jnk = fin.readline()
jnk = fin.readline()
ln = fin.readline()
fin.close()

Yes, coding three junk reads does mean maintenance will be a pain
(we now need the 5th line, not the fourth -- and would need to add
another jnk = line)... I'd maybe consider replacing all four readline()
with:

for cnt in xrange(4):
ln = fin.readline()

since it doesn't need the overhead of a separate line counter/test and
will leave the fourth input line in "ln" on exit.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

One second thought, I wonder if the reference counting mechanism would
be "smart" enough to automatically close the previous file on each
iteration of the outer loop. If so, the files don't need to be
explicitly closed.
 
J

jo3c

Good point. I didn't think of that. It could also be done as follows:

for fileN in files:

lnum = 0 # line number
input = file(fileN)

for line in input:
lnum += 1
if lnum >= 4: break

input.close()

# do something with "line"

Six of one or half a dozen of the other, I suppose.

this is what i did using glob

import glob
for files in glob.glob('/*.txt'):
x = open(files)
x.readline()
x.readline()
x.readline()
y = x.readline()
# do something with y
x.close()
 
D

Dennis Lee Bieber

One second thought, I wonder if the reference counting mechanism would
be "smart" enough to automatically close the previous file on each
iteration of the outer loop. If so, the files don't need to be
explicitly closed.

Hard to tell... Eventually I'd expect them to fade away, but I'm a
bit old-school... Explicit file control seems better than implied.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
M

Martin Marcher

jo3c said:
i need to read line 4 from a header file

http://docs.python.org/lib/module-linecache.html

~/2delete $ cat data.txt
L1
L2
L3
L4

~/2delete $ python
Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

--
http://noneisyours.marcher.name
http://feeds.feedburner.com/NoneIsYours

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.
 
F

Fredrik Lundh

jo3c said:
hi everybody
im a newbie in python
i need to read line 4 from a header file
using linecache will crash my computer due to memory loading, because
i am working on 2000 files each is 8mb

fileinput don't load the file into memory first
how do i use fileinput module to read a specific line from a file?

for line in fileinput.Fileinput('sample.txt')
????

I could have sworn that I posted working code (including an explanation
why linecache wouldn't work) the last time you asked about this... yes,
here it is again:
> i have a 2000 files with header and data
> i need to get the date information from the header
> then insert it into my database
> i am doing it in batch so i use glob.glob('/mydata/*/*/*.txt')
> to get the date on line 4 in the txt file i use
> linecache.getline('/mydata/myfile.txt/, 4)
>
> but if i use
> linecache.getline('glob.glob('/mydata/*/*/*.txt', 4) won't work

glob.glob returns a list of filenames, so you need to call getline once
for each file in the list.

but using linecache is absolutely the wrong tool for this; it's designed
for *repeated* access to arbitrary lines in a file, so it keeps all the
data in memory. that is, all the lines, for all 2000 files.

if the files are small, and you want to keep the code short, it's easier
to just grab the file's content and using indexing on the resulting list:

for filename in glob.glob('/mydata/*/*/*.txt'):
line = list(open(filename))[4-1]
... do something with line ...

(note that line numbers usually start with 1, but Python's list indexing
starts at 0).

if the files might be large, use something like this instead:

for filename in glob.glob('/mydata/*/*/*.txt'):
f = open(filename)
# skip first three lines
f.readline(); f.readline(); f.readline()
# grab the line we want
line = f.readline()
... do something with line ...

</F>
 
S

Steven D'Aprano

One second thought, I wonder if the reference counting mechanism would
be "smart" enough to automatically close the previous file on each
iteration of the outer loop. If so, the files don't need to be
explicitly closed.

Python guarantees[1] that files will be closed, but doesn't specify when
they will be closed. I understand that Jython doesn't automatically close
files until the program terminates, so even if you could rely on the ref
counter to close the files in CPython, it won't be safe to do so in
Jython. I don't know about IronPython or PyPy or the semi-mythical Parrot.

Given how little effort it is to explicitly close the files yourself, I
don't see any reason to not close them, rather than relying on an
implementation-dependent feature.



[1] Guarantee void under any circumstance that prevents files from being
closed.
 
S

Scott David Daniels

Russ said:
Given that the OP is talking 2000 files to be processed, I think I'd
recommend explicit open() and close() calls to avoid having lots of I/O
structures floating around...
[effectively]
for fid in file_list:
fin = open(fid)
for cnt in xrange(4):
ln = fin.readline()
fin.close()
One second thought, I wonder if the reference counting mechanism would
be "smart" enough to automatically close the previous file on each
iteration of the outer loop. If so, the files don't need to be
explicitly closed.

I _hate_ relying on that, but context managers mean you don't have to.
There are good reasons to close as early as you can. For example,
readers of files from zip files will eventually either be slower or
not work until the other readers close.

Here is what I imagine you want (2.5 or better):

from __future__ import with_statement

def pairing(names, position):
for filename in names:
with open(filename) as f:
for n, line in enumerate(f):
if n == position:
break
else:
line = None # indicate a short file
yield filename, line
...
for name, line in pairing(glob.glob('*.txt'), 3):
do_something(name, line)

--Scott David Daniels
(e-mail address removed)
 
F

Fredrik Lundh

Steven said:
Python guarantees[1] that files will be closed, but doesn't specify when
they will be closed. I understand that Jython doesn't automatically close
files until the program terminates, so even if you could rely on the ref
counter to close the files in CPython, it won't be safe to do so in
Jython.

From what I can tell, Java's GC automatically closes file streams, so
Jython will behave pretty much like CPython in most cases. I sure
haven't been able to make Jython run out by file handles by opening tons
of files and discarding the file objects without closing them. Has anyone?

</F>
 
M

Martin Marcher

Fredrik said:
I guess you missed the "using linecache will crash my computer due to
memory loading, because i am working on 2000 files each is 8mb" part.

oops sorry indeed

still the enumerate version seems fine:.... print no, line
....

someone posted this already i think (or was it another thread?)

--
http://noneisyours.marcher.name
http://feeds.feedburner.com/NoneIsYours

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.
 
H

Hrvoje Niksic

Fredrik Lundh said:
From what I can tell, Java's GC automatically closes file streams,
so Jython will behave pretty much like CPython in most cases.

The finalizer does close the reclaimed streams, but since it is
triggered by GC, you have to wait for GC to occur for the stream to
get closed. That means that something like:

open('foo', 'w').write(some_contents)

may leave 'foo' empty until the next GC. Fortunately this pattern is
much rarer than open('foo').read(), but both work equally well in
CPython, and will continue to work, despite many people's dislike for
them. (For the record, I don't use them in production code, but
open(...).read() is great for throwaway scripts and one-liners.)
I sure haven't been able to make Jython run out by file handles by
opening tons of files and discarding the file objects without
closing them.

Java's generational GC is supposed to be quick to reclaim recently
discarded objects. That might lead to swift finalization of open
files similar to what CPython's reference counting does in practice.

It could also be that Jython internally allocates so many Java objects
that the GC is triggered frequently, again resulting in swift
reclamation of file objects. It would be interesting to monitor (at
the OS level) the number of open files maintained by the process at
any given time during the execution of such a loop.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,223
Latest member
Jurgen2087

Latest Threads

Top