iglob performance no better than glob

Discussion in 'Python' started by Kyp, Jan 31, 2010.

  1. Kyp

    Kyp Guest

    I have a dir with a large # of files that I need to perform operations
    on, but only needing to access a subset of the files, i.e. the first
    100 files.

    Using glob is very slow, so I ran across iglob, which returns an
    iterator, which seemed just like what I wanted. I could iterate over
    the files that I wanted, not having to read the entire dir.

    So the iglob was faster, but accessing the first file took about the
    same time as glob.glob.

    Here's some code to compare glob vs. iglob performance, it outputs
    the time before/after a glob.iglob('*.*') files.next() sequence and a
    glob.glob('*.*') sequence.

    #!/usr/bin/env python

    import glob,time
    print '\nTest of glob.iglob'
    print 'before iglob:', time.asctime()
    files = glob.iglob('*.*')
    print 'after iglob:',time.asctime()
    print files.next()
    print 'after files.next():', time.asctime()

    print '\nTest of glob.glob'
    print 'before glob:', time.asctime()
    files = glob.glob('*.*')
    print 'after glob:',time.asctime()


    Here are the results:

    Test of glob.iglob
    before iglob: Sun Jan 31 11:09:08 2010
    after iglob: Sun Jan 31 11:09:08 2010
    foo.bar
    after files.next(): Sun Jan 31 11:09:59 2010

    Test of glob.glob
    before glob: Sun Jan 31 11:09:59 2010
    after glob: Sun Jan 31 11:10:51 2010

    The results are about the same for the 2 approaches, both took about
    51 seconds. Am I doing something wrong with iglob?

    Is there a way to get the first X # of files from a dir with lots of
    files, that does not take a long time to run?

    thanx, mark
    Kyp, Jan 31, 2010
    #1
    1. Advertising

  2. > So the iglob was faster, but accessing the first file took about the
    > same time as glob.glob.


    I'll wager most of the time required to access the first file is due
    to filesystem overhead, not any inherent limitation in Python.

    Skip Montanaro
    Skip Montanaro, Jan 31, 2010
    #2
    1. Advertising

  3. Kyp

    John Bokma Guest

    Kyp <> writes:

    > Is there a way to get the first X # of files from a dir with lots of
    > files, that does not take a long time to run?


    Assuming Linux: what does time

    ls thedir | head

    give?

    with thedir the name of the actual dir

    Also how many is many files?

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
    John Bokma, Jan 31, 2010
    #3
  4. Kyp

    Peter Otten Guest

    Kyp wrote:

    > I have a dir with a large # of files that I need to perform operations
    > on, but only needing to access a subset of the files, i.e. the first
    > 100 files.
    >
    > Using glob is very slow, so I ran across iglob, which returns an
    > iterator, which seemed just like what I wanted. I could iterate over
    > the files that I wanted, not having to read the entire dir.
    >
    > So the iglob was faster, but accessing the first file took about the
    > same time as glob.glob.
    >
    > Here's some code to compare glob vs. iglob performance, it outputs
    > the time before/after a glob.iglob('*.*') files.next() sequence and a
    > glob.glob('*.*') sequence.
    >
    > #!/usr/bin/env python
    >
    > import glob,time
    > print '\nTest of glob.iglob'
    > print 'before iglob:', time.asctime()
    > files = glob.iglob('*.*')
    > print 'after iglob:',time.asctime()
    > print files.next()
    > print 'after files.next():', time.asctime()
    >
    > print '\nTest of glob.glob'
    > print 'before glob:', time.asctime()
    > files = glob.glob('*.*')
    > print 'after glob:',time.asctime()
    >
    >
    > Here are the results:
    >
    > Test of glob.iglob
    > before iglob: Sun Jan 31 11:09:08 2010
    > after iglob: Sun Jan 31 11:09:08 2010
    > foo.bar
    > after files.next(): Sun Jan 31 11:09:59 2010
    >
    > Test of glob.glob
    > before glob: Sun Jan 31 11:09:59 2010
    > after glob: Sun Jan 31 11:10:51 2010
    >
    > The results are about the same for the 2 approaches, both took about
    > 51 seconds. Am I doing something wrong with iglob?


    No, but iglob() being lazy is pointless in your case because it uses
    os.listdir() and fnmatch.filter() underneath which both read the whole
    directory before returning anything.

    > Is there a way to get the first X # of files from a dir with lots of
    > files, that does not take a long time to run?


    Here's my attempt. It turned out to be more work than expected, so I cut a
    few corners. It's Linux-only "works on my machine" code, but may give you
    some hints on how to proceed.

    from ctypes import *
    import fnmatch
    import glob
    import os
    import re
    from itertools import ifilter, imap

    class dirent(Structure):
    "works on my machine ;)"
    _fields_ = [
    ("d_ino", c_long),
    ("d_off", c_long),
    ("d_reclen", c_ushort),
    ("d_type", c_ubyte),
    ("d_name", c_char*256)]


    direntp = POINTER(dirent)

    LIBC = "libc.so.6"
    cdll.LoadLibrary(LIBC)
    libc = CDLL(LIBC)
    libc.readdir.restype = direntp


    def diriter(dir):
    "lazy partial replacement for os.listdir()"
    # errors? what errors?
    dirp = libc.opendir(dir)
    if not dirp:
    return
    try:
    while True:
    ep = libc.readdir(dirp)
    if not ep:
    break
    yield ep.contents.d_name
    finally:
    libc.closedir(dirp)


    def filter(names, pattern):
    "lazy partial replacement for fnmatch.filter()"
    import posixpath

    pattern = os.path.normcase(pattern)
    r = fnmatch.translate(pattern)
    r = re.compile(r)

    if os.path is not posixpath:
    names = imap(os.path.normcase, names)

    return ifilter(r.match, names)

    def globiter(path):
    "lazy partial replacement for glob.glob()"
    dir, filename = os.path.split(path)
    if glob.has_magic(dir):
    raise ValueError("wildcards in directory not supported")
    return filter(diriter(dir), filename)


    if __name__ == "__main__":
    import sys
    [pattern] = sys.argv[1:]
    for name in globiter(pattern):
    print name

    Peter
    Peter Otten, Jan 31, 2010
    #4
  5. Kyp <kyp <at> stsci.edu> writes:

    > So the iglob was faster, but accessing the first file took about the
    > same time as glob.glob.


    That would be because glob is implemented in terms of iglob.
    Benjamin Peterson, Jan 31, 2010
    #5
  6. Kyp

    Kyp Guest

    On Jan 31, 1:06 pm, John Bokma <> wrote:
    > Kyp <> writes:
    > > Is there a way to get the first X # of files from a dir with lots of
    > > files, that does not take a long time to run?

    >
    > Assuming Linux: what does time
    >
    >  ls thedir | head
    >
    > give?
    >
    > with thedir the name of the actual dir

    about 3 seconds.

    3.086u 0.201s 0:03.32 98.7% 0+0k 0+0io 0pf+0w

    >
    > Also how many is many files?

    over 100K (I know I should not do that, but it's a temp dir holding
    files to be transferred)
    thanx, mark
    Kyp, Feb 1, 2010
    #6
  7. Kyp

    Kyp Guest

    On Jan 31, 2:44 pm, Peter Otten <> wrote:
    > Kyp wrote:
    > > I have a dir with a large # of files that I need to perform operations
    > > on, but only needing to access a subset of the files, i.e. the first
    > > 100 files.

    >
    > > Using glob is very slow, so I ran across iglob, which returns an
    > > iterator, which seemed just like what I wanted. I could iterate over
    > > the files that I wanted, not having to read the entire dir.

    >
    > > So the iglob was faster, but accessing the first file took about the
    > > same time as glob.glob.

    >
    > > Here's some code to compare glob vs. iglob performance,  it outputs
    > > the time before/after a glob.iglob('*.*') files.next() sequence and a
    > > glob.glob('*.*') sequence.

    >
    > > #!/usr/bin/env python

    >
    > > import glob,time
    > > print '\nTest of glob.iglob'
    > > print 'before       iglob:', time.asctime()
    > > files = glob.iglob('*.*')
    > > print 'after        iglob:',time.asctime()
    > > print files.next()
    > > print 'after files.next():', time.asctime()

    >
    > > print '\nTest of glob.glob'
    > > print 'before        glob:', time.asctime()
    > > files = glob.glob('*.*')
    > > print 'after         glob:',time.asctime()

    >
    > > Here are the results:

    >
    > > Test of glob.iglob
    > > before       iglob: Sun Jan 31 11:09:08 2010
    > > after        iglob: Sun Jan 31 11:09:08 2010
    > > foo.bar
    > > after files.next(): Sun Jan 31 11:09:59 2010

    >
    > > Test of glob.glob
    > > before        glob: Sun Jan 31 11:09:59 2010
    > > after         glob: Sun Jan 31 11:10:51 2010

    >
    > > The results are about the same for the 2 approaches, both took about
    > > 51 seconds. Am I doing something wrong with iglob?

    >
    > No, but iglob() being lazy is pointless in your case because it uses
    > os.listdir() and fnmatch.filter() underneath which both read the whole
    > directory before returning anything.
    >
    > > Is there a way to get the first X # of files from a dir with lots of
    > > files, that does not take a long time to run?

    >
    > Here's my attempt. It turned out to be more work than expected, so I cut a
    > few corners. It's Linux-only "works on my machine" code, but may give you
    > some hints on how to proceed.
    >
    > from ctypes import *
    > import fnmatch
    > import glob
    > import os
    > import re
    > from itertools import ifilter, imap
    >
    > class dirent(Structure):
    >     "works on my machine ;)"
    >     _fields_ = [
    >         ("d_ino", c_long),
    >         ("d_off", c_long),
    >         ("d_reclen", c_ushort),
    >         ("d_type", c_ubyte),
    >         ("d_name", c_char*256)]
    >
    > direntp = POINTER(dirent)
    >
    > LIBC = "libc.so.6"
    > cdll.LoadLibrary(LIBC)
    > libc = CDLL(LIBC)
    > libc.readdir.restype = direntp
    >
    > def diriter(dir):
    >     "lazy partial replacement for os.listdir()"
    >     # errors? what errors?
    >     dirp = libc.opendir(dir)
    >     if not dirp:
    >         return
    >     try:
    >         while True:
    >             ep = libc.readdir(dirp)
    >             if not ep:
    >                 break
    >             yield ep.contents.d_name
    >     finally:
    >         libc.closedir(dirp)
    >
    > def filter(names, pattern):
    >     "lazy partial replacement for fnmatch.filter()"
    >     import posixpath
    >
    >     pattern = os.path.normcase(pattern)
    >     r = fnmatch.translate(pattern)
    >     r = re.compile(r)
    >
    >     if os.path is not posixpath:
    >         names = imap(os.path.normcase, names)
    >
    >     return ifilter(r.match, names)
    >
    > def globiter(path):
    >     "lazy partial replacement for glob.glob()"
    >     dir, filename = os.path.split(path)
    >     if glob.has_magic(dir):
    >         raise ValueError("wildcards in directory not supported")
    >     return filter(diriter(dir), filename)
    >
    > if __name__ == "__main__":
    >     import sys
    >     [pattern] = sys.argv[1:]
    >     for name in globiter(pattern):
    >         print name
    >
    > Peter


    I'll give it a try, thanx for the reply.
    mark
    Kyp, Feb 1, 2010
    #7
  8. On 31Jan2010 16:23, Kyp <> wrote:
    | On Jan 31, 2:44 pm, Peter Otten <> wrote:
    | > Kyp wrote:
    | > > I have a dir with a large # of files that I need to perform operations
    | > > on, but only needing to access a subset of the files, i.e. the first
    | > > 100 files.
    | > > Using glob is very slow, so I ran across iglob, which returns an
    | > > iterator, which seemed just like what I wanted. I could iterate over
    | > > the files that I wanted, not having to read the entire dir.
    [...]
    | > > So the iglob was faster, but accessing the first file took about the
    | > > same time as glob.glob.
    | >
    | > > Here's some code to compare glob vs. iglob performance,  it outputs
    | > > the time before/after a glob.iglob('*.*') files.next() sequence and a
    | > > glob.glob('*.*') sequence.
    | >
    | > > #!/usr/bin/env python
    | >
    | > > import glob,time
    | > > print '\nTest of glob.iglob'
    | > > print 'before       iglob:', time.asctime()
    | > > files = glob.iglob('*.*')
    | > > print 'after        iglob:',time.asctime()
    | > > print files.next()
    | > > print 'after files.next():', time.asctime()
    | >
    | > > print '\nTest of glob.glob'
    | > > print 'before        glob:', time.asctime()
    | > > files = glob.glob('*.*')
    | > > print 'after         glob:',time.asctime()
    | >
    | > > Here are the results:
    | >
    | > > Test of glob.iglob
    | > > before       iglob: Sun Jan 31 11:09:08 2010
    | > > after        iglob: Sun Jan 31 11:09:08 2010
    | > > foo.bar
    | > > after files.next(): Sun Jan 31 11:09:59 2010
    | >
    | > > Test of glob.glob
    | > > before        glob: Sun Jan 31 11:09:59 2010
    | > > after         glob: Sun Jan 31 11:10:51 2010
    | >
    | > > The results are about the same for the 2 approaches, both took about
    | > > 51 seconds. Am I doing something wrong with iglob?
    | >
    | > No, but iglob() being lazy is pointless in your case because it uses
    | > os.listdir() and fnmatch.filter() underneath which both read the whole
    | > directory before returning anything.
    | >
    | > > Is there a way to get the first X # of files from a dir with lots of
    | > > files, that does not take a long time to run?
    | >
    | > Here's my attempt. [...open directory and read native format...]

    I'd be inclined first to time os.listdir('.') versus glob.lgo('*.*').

    Glob routines tend to lstat() every matching name to ensure the path
    exists. That's very slow. If you just do os.listdir() and choose your
    100 nmaes, you only need to stat (or just try to open) them.

    So time glob.glob("*.*") versus os.listdir(".") first.

    Generally, with a large directory, stat time will change performance
    immensely.
    --
    Cameron Simpson <> DoD#743
    http://www.cskk.ezoshosting.com/cs/

    Usenet is essentially a HUGE group of people passing notes in class. --R. Kadel
    Cameron Simpson, Feb 14, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Georgy Pruss
    Replies:
    15
    Views:
    712
    Tim Roberts
    Dec 1, 2003
  2. Tim Peters
    Replies:
    1
    Views:
    348
    Duncan Booth
    Dec 1, 2003
  3. Sean Berry

    Question about glob.glob <--newbie

    Sean Berry, May 4, 2004, in forum: Python
    Replies:
    3
    Views:
    340
    David M. Cooke
    May 4, 2004
  4. Elbert Lev

    glob.glob unicode bug or feature

    Elbert Lev, Jul 31, 2004, in forum: Python
    Replies:
    5
    Views:
    382
    Neil Hodgson
    Aug 2, 2004
  5. Peter Bencsik
    Replies:
    2
    Views:
    808
Loading...

Share This Page