search speed

anders · Jan 29, 2009

Hi!
I have written a Python program that serach for specifik customer in
files (around 1000 files)
the trigger is LF01 + CUSTOMERNO

So a read all fils with dirchached

Then a loop thru all files each files is read with readLines() and
after that scaned

Today this works fine, it saves me a lot of manuall work, but a seach
takes around 5 min,
so my questin is is there another way of search in a file
(Today i step line for line and check)

What i like to find is just filenames for files with the customerdata
in, there can and often
is more than one,

English is not my first language and i hope someone understand my
beginner question
what i am looking for is somting like

if file.findInFile("LF01"):
........

Is there any library like this ??

Best Regards
Anders

r · Jan 29, 2009

if file.findInFile("LF01"):
Is there any library like this ??
Best Regards
Anders

Yea, it's called a for loop!

for line in file:
if "string" in line:
do_this()

alex23 · Jan 30, 2009

Yea, it's called a for loop!

for line in file:
if "string" in line:
do_this()

Which is what the OP is already doing:

(Today i step line for line and check)

anders, you might have more luck with one of the text search libraries
out there:

PyLucene (although this makes Java a dependency): http://lucene.apache.org/pylucene/
Nucular: http://nucular.sourceforge.net/
mxTextTools: http://www.egenix.com/products/python/mxBase/mxTextTools/

Diez B. Roggisch · Jan 30, 2009

anders said:
Hi!
I have written a Python program that serach for specifik customer in
files (around 1000 files)
the trigger is LF01 + CUSTOMERNO

So a read all fils with dirchached

Then a loop thru all files each files is read with readLines() and
after that scaned

Today this works fine, it saves me a lot of manuall work, but a seach
takes around 5 min,
so my questin is is there another way of search in a file
(Today i step line for line and check)

What i like to find is just filenames for files with the customerdata
in, there can and often
is more than one,

English is not my first language and i hope someone understand my
beginner question
what i am looking for is somting like

if file.findInFile("LF01"):
.......

Is there any library like this ??

No. Because nobody can automagically infer whatever structure your files
have.

alex23 gave you a set of tools that you can use for full-text-search.
However, that's not necessarily the best thing to do if things have a
record-like structure. The canonical answer to this is then to use a
database to hold the data, instead of flat files. So if you have any
chance to do that, you should try & stuff things in there.

Diez

D'Arcy J.M. Cain · Jan 30, 2009

$ find <path_to_dirs_containing_files> -name "*" -exec grep -nH "LF01" {} \;
| cut -d ":" -f 1 | sort | uniq

I know this isn't a Unix group but please allow me to suggest instead;

$ grep -lR LF01 <path_to_dirs_containing_files>

Tim Rowe · Jan 30, 2009

2009/1/30 Diez B. Roggisch said:
No. Because nobody can automagically infer whatever structure your files
have.

Just so. But even without going to a full database solution it might
be possible to make use of the flat file structure. For example, does
the "LF01" have to appear at a specific position in the input line? If
so, there's no need to search for it in the complete line. *If* there
is any such structure then a compiled regexp search is likely to be
faster than just 'if "LF01" in line', and (provided it's properly
designed) provides a bit of extra insurance against false positives.

John Machin · Jan 30, 2009

D'Arcy J.M. Cain said:
I know this isn't a Unix group but please allow me to suggest instead;

$ grep -lR LF01 <path_to_dirs_containing_files>

and if the OP is on Windows: an alternative to cygwin is the GnuWin32 collection
of Gnu utilities ported to Windows. See http://gnuwin32.sourceforge.net/ ...
you'll want the Grep package but I'd suggest the CoreUtils package as worth a
detailed look, and do scan through the whole list of packages while you're there.

HTH,
John

Stefan Behnel · Jan 30, 2009

D'Arcy J.M. Cain said:
I know this isn't a Unix group but please allow me to suggest instead;

$ grep -lR LF01 <path_to_dirs_containing_files>

That's a very good advice. I had to pull some statistics from a couple of
log files recently some of which were gzip compressed. The obvious Python
program just eats your first CPU's cycles parsing data into strings while
the disk runs idle, but using the subprocess module to spawn a couple of
gzgrep's in parallel that find the relevant lines, and then using Python to
extract and aggregate the relevant information from them does the job in
no-time.

Stefan

Stefan Behnel · Jan 30, 2009

Diez said:
that's not necessarily the best thing to do if things have a
record-like structure. The canonical answer to this is then to use a
database to hold the data, instead of flat files. So if you have any
chance to do that, you should try & stuff things in there.

It's worth mentioning to the OP that Python has a couple of database
libraries in the stdlib, notably simple things like the various dbm
flavoured modules (see the anydbm module) that provide fast
string-to-string hash mappings (which might well be enough in this case),
but also a pretty powerful SQL database called sqlite3 which allows much
more complex (and complicated) ways to find the needle in the haystack.

http://docs.python.org/library/persistence.html

Stefan

Tim Chase · Jan 30, 2009

I have written a Python program that serach for specifik customer in

files (around 1000 files)
the trigger is LF01 + CUSTOMERNO

While most of the solutions folks have offered involve scanning
all the files each time you search, if the content of those files
doesn't change much, you can build an index once and then query
the resulting index multiple times. Because I was bored, I threw
together the code below (after the "-------" divider) which does
what you detail as best I understand, allowing you to do

python tkc.py 31415

to find the files containing CUSTOMERNO=31415 The first time,
it's slow because it needs to create the index file. However,
subsequent runs should be pretty speedy. You can also specify
multiple customers on the command-line:

python tkc.py 31415 1414 77777

and it will search for each of them. I presume they're found by
the regexp "LF01(\d+)" based on your description, that the file
can be sensibly broken into lines, and the code allows for
multiple results on the same line. Adjust accordingly if that's
not the pattern you want or the conditions you expect.

If your source files change, you can reinitialize the database with

python tkc.py -i

You can also change the glob pattern used for indexing -- by
default, I assumed they were "*.txt". But you can either
override the default with

python tkc.py -i -p "*.dat"

or you can change the source to default differently (or even skip
the glob-check completely...look for the fnmatch() call). There
are a few more options. Just use

python tkc.py --help

as usual. It's also a simple demo of the optparse module if
you've never used it.

Enjoy!

-tkc

PS: as an aside, how do I import just the fnmatch function? I
tried both of the following and neither worked:

from glob.fnmatch import fnmatch
from glob import fnmatch.fnmatch

I finally resorted to the contortion coded below in favor of
import glob
fnmatch = glob.fnmatch.fnmatch

-----------------------------------------------------------------

#!/usr/bin/env python
import dbm
import os
import re
from glob import fnmatch
fnmatch = fnmatch.fnmatch
from optparse import OptionParser

customer_re = re.compile(r"LF01(\d+)")

def build_parser():
parser = OptionParser(
usage="%prog [options] [cust#1 [cust#2 ... ]]"
)
parser.add_option("-i", "--index", "--reindex",
action="store_true",
dest="reindex",
default=False,
help="Reindex files found in the current directory "
"in the event any files have changed",
)
parser.add_option("-p", "--pattern",
action="store",
dest="pattern",
default="*.txt",
metavar="GLOB_PATTERN",
help="Index files matching GLOB_PATTERN",
)
parser.add_option("-d", "--db", "--database",
action="store",
dest="indexfile",
default=".index",
metavar="FILE",
help="Use the index stored at FILE",
)
parser.add_option("-v", "--verbose",
action="count",
dest="verbose",
default=0,
help="Increase verbosity"
)
return parser

def reindex(options, db):
if options.verbose: print "Indexing..."
for path, dirs, files in os.walk('.'):
for fname in files:
if fname == options.indexfile:
# ignore our database file
continue
if not fnmatch(fname, options.pattern):
# ensure that it matches our pattern
continue
fullname = os.path.join(path, fname)
if options.verbose: print fullname
f = file(fullname)
found_so_far = set()
for line in f:
for customer_number in customer_re.findall(line):
if customer_number in found_so_far: continue
found_so_far.add(customer_number)
try:
val = '\n'.join([
db[customer_number],
fullname,
])
if options.verbose > 1:
print "Appending %s" % customer_number
except KeyError:
if options.verbose > 1:
print "Creating %s" % customer_number
val = fullname
db[customer_number] = val
f.close()

if __name__ == "__main__":
parser = build_parser()
opt, args = parser.parse_args()
reindexed = False
if opt.reindex or not os.path.exists("%s.db" % opt.indexfile):
db = dbm.open(opt.indexfile, 'n')
reindex(opt, db)
reindexed = True
else:
db = dbm.open(opt.indexfile, 'r')
if not (args or reindexed):
parser.print_help()
for arg in args:
print "%s:" % arg,
try:
val = db[arg]
print
for item in val.splitlines():
print " %s" % item
except KeyError:
print "Not found"
db.close()

rdmurray · Jan 30, 2009

Quoth Tim Chase said:
PS: as an aside, how do I import just the fnmatch function? I
tried both of the following and neither worked:

from glob.fnmatch import fnmatch
from glob import fnmatch.fnmatch

I finally resorted to the contortion coded below in favor of
import glob
fnmatch = glob.fnmatch.fnmatch

What you want is:

from fnmatch import fnmatch

fnmatch is its own module, it just happens to be in the (non __all__)
namespace of the glob module because glob uses it.

--RDM

anders · Jan 31, 2009

Tanks everyone that spent time helping my, the help was great.
Best regards Anders

Tim Chase · Jan 31, 2009

What you want is:

from fnmatch import fnmatch

Oh, that's head-smackingly obvious now...thanks!

My thought process usually goes something like

"""
I want to do some file-name globbing

there's a glob module that looks like a good place to start

hmm, dir(glob) tells me there's a fnmatch thing that looks like
what I want according to help(glob.fnmatch)

oh, the fnmatch() function is inside this glob.fnmatch thing

so, I want glob.fnmatch.fnmatch()
"""

It never occurred to me that fnmatch was its own importable
module. <sheepish grin>

-tkc

rdmurray · Jan 31, 2009

Quoth Tim Chase said:
Oh, that's head-smackingly obvious now...thanks!

My thought process usually goes something like

"""
I want to do some file-name globbing

there's a glob module that looks like a good place to start

hmm, dir(glob) tells me there's a fnmatch thing that looks like
what I want according to help(glob.fnmatch)

oh, the fnmatch() function is inside this glob.fnmatch thing

so, I want glob.fnmatch.fnmatch()
"""

It never occurred to me that fnmatch was its own importable
module. <sheepish grin>

I did a help(glob), saw that fnmatch wasn't in there, did a dir(glob),
saw fnmatch and was puzzled, so I looked up the glob doc page on
docs.python.org for glob. There was a cross reference at the bottom
of the page to fnmatch, and it was only at that point that I went:
"oh, duh"

--RDM

Tim Rowe · Jan 31, 2009

2009/1/30 Scott David Daniels said:
Be careful with your assertion that a regex is faster, it is certainly
not always true.

I was careful *not* to assert that a regex would be faster, merely
that it was *likely* to be in this case.

Aaron Watters · Feb 1, 2009

alex23 gave you a set of tools that you can use for full-text-search.
However, that's not necessarily the best thing to do if things have a
record-like structure.

In Nucular (and others I think) you can do searches
for terms anywhere (full text)
searches for terms within fields, searches for prefixes in fields,
searches
based on field inequality, or searches for field exact value. I would
argue this subsumes the standard "fielded approach".
-- Aaron Watters
===
Oh, I'm a lumberjack and I'm O.K...

How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 18, 2024
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Speed abilities	1	Apr 18, 2006
REXML Speed Question	3	Apr 8, 2011
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
How can I speed up reading the Lucene search results	0	Mar 12, 2008
Speed of pysnmp	4	Jul 10, 2004
GoogleHack Search	0	Jun 6, 2009

search speed

anders

r

alex23

Diez B. Roggisch

D'Arcy J.M. Cain

Tim Rowe

John Machin

Stefan Behnel

Stefan Behnel

Tim Chase

rdmurray

anders

Tim Chase

rdmurray

Tim Rowe

Aaron Watters

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads