how to organize a module that requires a data file

S

Steven Bethard

Ok, so I have a module that is basically a Python wrapper around a big
lookup table stored in a text file[1]. The module needs to provide a
few functions::

get_stem(word, pos, default=None)
stem_exists(word, pos)
...

Because there should only ever be one lookup table, I feel like these
functions ought to be module globals. That way, you could just do
something like::

import morph
assist = morph.get_stem('assistance', 'N')
...

My problem is with the text file. Where should I keep it? If I want to
keep the module simple, I need to be able to identify the location of
the file at module import time. That way, I can read all the data into
the appropriate Python structure, and all my module-level functions will
work immediatly after import.

I can only think of a few obvious places where I could find the text
file at import time -- in the same directory as the module (e.g.
lib/site-packages), in the user's home directory, or in a directory
indicated by an environment variable. The first seems weird because the
text file is large (about 10MB) and I don't really see any other
packages putting data files into lib/site-packages. The second seems
weird because it's not a per-user configuration - it's a data file
shared by all users. And the the third seems weird because my
experience with a configuration depending heavily on environment
variables is that this is difficult to maintain.

If I don't mind complicating the module functions a bit (e.g. by
starting each function with "if _lookup_table is not None"), I could
allow users to specify a location for the file after the module is
imported, e.g.::

import morph
morph.setfile(r'C:\resources\morph_english.flat')
...

Then all the module-level functions would have to raise Exceptions until
setfile() was called. I don't like that the user would have to
configure the module each time they wanted to use it, but perhaps that's
unaviodable.

Any suggestions? Is there an obvious place to put the text file that
I'm missing?

Thanks in advance,

STeVe

[1] In case you're curious, the file is a list of words and their
morphological stems provided by the University of Pennsylvania.
 
P

Paul Boddie

Steven Bethard wrote:

[Text file for a module's internal use.]
My problem is with the text file. Where should I keep it? If I want to
keep the module simple, I need to be able to identify the location of
the file at module import time. That way, I can read all the data into
the appropriate Python structure, and all my module-level functions will
work immediatly after import.

I tend to make use of the __file__ attribute available in every module.
For example:

resource_dir = os.path.join(os.path.split(__file__)[0], "Resources")

This assigns to resource_dir the path to the Resources directory
alongside the module itself in the filesystem. Of course, if you just
wanted the text file to reside alongside the module, rather than a
whole directory of stuff, you'd replace "Resources" with the name of
your file (and change the variable name, of course). For example:

filename = os.path.join(os.path.split(__file__)[0],
"morph_english.flat")

Having posted this solution, and in the tradition of Usenet, I'd be
interested to hear whether this is a particularly bad idea.

Paul
 
T

Terry Hancock

My problem is with the text file. Where should I keep it?

I can only think of a few obvious places where I could
find the text file at import time -- in the same
directory as the module (e.g. lib/site-packages), in the
user's home directory, or in a directory indicated by an
environment variable.

Why don't you search those places in order for it?

Check ~/.mymod/myfile, then /etc/mymod/myfile, then
/lib/site-packages/mymod/myfile or whatever. It won't take
long, just do the existence checks on import of the module.
If you don't find it after checking those places, *then*
raise an exception.

You don't say what this data file is or whether it is
subject to change or customization. If it is, then there is
a real justification for this approach, because an
individual user might want to shadow the system install with
his own version of the data.

That's pretty typical behavior for configuration files on
any Posix system.

Cheers,
Terry
 
S

Steven Bethard

Terry said:
Why don't you search those places in order for it?

Check ~/.mymod/myfile, then /etc/mymod/myfile, then
/lib/site-packages/mymod/myfile or whatever. It won't take
long, just do the existence checks on import of the module.
If you don't find it after checking those places, *then*
raise an exception.

You don't say what this data file is or whether it is
subject to change or customization. If it is, then there is
a real justification for this approach, because an
individual user might want to shadow the system install with
his own version of the data.

The file is a lookup table of word stems distributed by the University
of Pennsylvania. It doesn't really make sense for users to customize
it, because it's not a configuration file, but it is possible that UPenn
would distribute a new version at some point. That's what I meant when
I said "it's not a per-user configuration - it's a data file shared by
all users". So there should be exactly one copy of the file, so I
shouldn't have to deal with shadowing.

Of course, even with only one copy of the file, that doesn't mean that I
couldn't search a few places. Maybe I could by default put it in
lib/site-packages, but allow an option to setup.py to put it somewhere
else for anyone who was worried about putting 10MB into
lib/site-packages. Those folks would then have to use an environment
variable, say $MORPH_FLAT, to identify the directory they . At module
import I would just check both locations...

I'll have to think about this some more...

STeVe
 
L

Larry Bates

Personally I would do this as a class and pass a path to where
the file is stored as an argument to instantiate it (maybe try
to help user if they don't pass it). Something like:

class morph:
def __init__(self, pathtodictionary=None):
if pathtodictionary is None:
#
# Insert code here to see if it is in the current
# directory and/or look in other directories.
#

try: self.fp=open(pathtodictionary, 'r')
except:
print "unable to locate dictionary at: %s" % pathtodictionary

else:
#
# Insert code here to load data from .txt file
#

fp.close()
return

def get_stem(self, arg1, arg2):
#
# Code for get_stem method
#

The other way I've done this is to have a .INI file that always lives
in the same directory as the class with an entry in it that points me
to where the .txt file lives.

Hope this helps.

-Larry Bates

Steven said:
Ok, so I have a module that is basically a Python wrapper around a big
lookup table stored in a text file[1]. The module needs to provide a
few functions::

get_stem(word, pos, default=None)
stem_exists(word, pos)
...

Because there should only ever be one lookup table, I feel like these
functions ought to be module globals. That way, you could just do
something like::

import morph
assist = morph.get_stem('assistance', 'N')
...

My problem is with the text file. Where should I keep it? If I want to
keep the module simple, I need to be able to identify the location of
the file at module import time. That way, I can read all the data into
the appropriate Python structure, and all my module-level functions will
work immediatly after import.

I can only think of a few obvious places where I could find the text
file at import time -- in the same directory as the module (e.g.
lib/site-packages), in the user's home directory, or in a directory
indicated by an environment variable. The first seems weird because the
text file is large (about 10MB) and I don't really see any other
packages putting data files into lib/site-packages. The second seems
weird because it's not a per-user configuration - it's a data file
shared by all users. And the the third seems weird because my
experience with a configuration depending heavily on environment
variables is that this is difficult to maintain.

If I don't mind complicating the module functions a bit (e.g. by
starting each function with "if _lookup_table is not None"), I could
allow users to specify a location for the file after the module is
imported, e.g.::

import morph
morph.setfile(r'C:\resources\morph_english.flat')
...

Then all the module-level functions would have to raise Exceptions until
setfile() was called. I don't like that the user would have to
configure the module each time they wanted to use it, but perhaps that's
unaviodable.

Any suggestions? Is there an obvious place to put the text file that
I'm missing?

Thanks in advance,

STeVe

[1] In case you're curious, the file is a list of words and their
morphological stems provided by the University of Pennsylvania.
 
S

Steven Bethard

Larry said:
Personally I would do this as a class and pass a path to where
the file is stored as an argument to instantiate it (maybe try
to help user if they don't pass it). Something like:

class morph:
def __init__(self, pathtodictionary=None):
if pathtodictionary is None:
# Insert code here to see if it is in the current
# directory and/or look in other directories.
try: self.fp=open(pathtodictionary, 'r')
except:
print "unable to locate dictionary at: %s" % pathtodictionary
else:
# Insert code here to load data from .txt file
fp.close()
return

def get_stem(self, arg1, arg2):
# Code for get_stem method

Actually, this is basically what I have right now. It bothers me a
little because you can get two instances of "morph", with two separate
dictionaries loaded. Since they're all loading the same file, it
doesn't seem like there should be multiple instances. I know I could
use a singleton pattern, but aren't modules basically the singletons of
Python?
The other way I've done this is to have a .INI file that always lives
in the same directory as the class with an entry in it that points me
to where the .txt file lives.

That's a thought. Thanks.

Steve
 
M

manuelg

I have tried several ways, this is the way I like best (I develop in
Windows, but this technique should work in *NIX for your application)

:: \whereever\whereever\ (the directory your module is in,
obviously somewhere where PYTHONPATH can
see it)

:::: stevemodule.py (your module)

:::: stevemodule_workfiles\ (a subdirectory in the same directory as
your module)

:::::: __init__.py (an empty file in stevemodule_workfiles\,
only here to make stevemodule_workfiles\
look like a package)

:::::: stevelargetextfile.txt (your large textfile in
stevemodule_workfiles\)

Now, to load the large textfile, I agree that it should be done with
module functions, so if it gets used several times in the same process,
it is only loaded once. The Python module itself follows the
"singleton" pattern, so you get that behavior for free.

Here is the Python code for loading the file:

import os.path
import stevemodule_workfiles

workfiles_path =
os.path.split(stevemodule_workfiles.__file__)[0]

stevelargetextfile_fullpath =
os.path.join(workfiles_path, 'stevelargetextfile.txt')

stevelargetextfile_file = open(stevelargetextfile_fullpath)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top