File to dict

mrkafk · Dec 7, 2007

Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:

def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

The /etc/virtual/domainowners file contains double-colon separated
entries:
domain1.tld: owner1
domain2.tld: own2
domain3.another: somebody
....

Now, the above lookupdmo function works. However, it's rather tedious
to transform files into dicts this way and I have quite a lot of such
files to transform (like custom 'passwd' files for virtual email
accounts etc).

Is there any more clever / more pythonic way of parsing files like
this? Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]

Chris · Dec 7, 2007

Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:

def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

The /etc/virtual/domainowners file contains double-colon separated
entries:
domain1.tld: owner1
domain2.tld: own2
domain3.another: somebody
...

Now, the above lookupdmo function works. However, it's rather tedious
to transform files into dicts this way and I have quite a lot of such
files to transform (like custom 'passwd' files for virtual email
accounts etc).

Is there any more clever / more pythonic way of parsing files like
this? Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]

For the first one you are parsing the entire file everytime you want
to lookup just one domain. If it is something reused several times
during your code execute you could think of rather storing it so it's
just a simple lookup away, for eg.

_domain_dict = dict()
def generate_dict(input_file):
finput = open(input_file, 'rb')
global _domain_dict
for each_line in enumerate(finput):
line = each_line.strip().split(':')
if len(line)==2: _domain_dict[line[0]] = line[1]

finput.close()

def domain_lookup(domain_name):
global _domain_dict
try:
return _domain_dict[domain_name]
except KeyError:
return 'Unknown.Domain'

Your second parsing example would be a simple case of:

finput = open('input_file.ext', 'rb')
results_list = []
for each_line in enumerate(finput.readlines()):
results_list.append( each_line.strip().split(':') )
finput.close()

Duncan Booth · Dec 7, 2007

def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

Just some minor points without changing the basis of what you have done
here:

Don't bother with 'readlines', file objects are directly iterable.
Why are you calling both lstrip and rstrip? The strip method strips
whitespace from both ends for you.

It is usually a good idea with code like this to limit the split method to
a single split in case there is more than one colon on the line: i.e.
x.split(':',1)

When you have a sequence whose elements are sequences with two elements
(which is what you have here), you can construct a dict directly from the
sequence.

But why do you construct a dict from that input data simply to throw it
away? If you only want 1 domain from the file just pick it out of the list.
If you want to do multiple lookups build the dict once and keep it around.

So something like the following (untested code):

from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]
pairs = [ (domain.strip(), owner.strip())
for (domain,owner) in pairs ]
return dict(lines)

DOMAINOWNERS = loaddomainowners()

def lookupdmo(domain):
return DOMAINOWNERS[domain]

Matt Nordhoff · Dec 7, 2007

Chris said:
For the first one you are parsing the entire file everytime you want
to lookup just one domain. If it is something reused several times
during your code execute you could think of rather storing it so it's
just a simple lookup away, for eg.

_domain_dict = dict()
def generate_dict(input_file):
finput = open(input_file, 'rb')
global _domain_dict
for each_line in enumerate(finput):
line = each_line.strip().split(':')
if len(line)==2: _domain_dict[line[0]] = line[1]

finput.close()

def domain_lookup(domain_name):
global _domain_dict
try:
return _domain_dict[domain_name]
except KeyError:

What about this?

_domain_dict = dict()
def generate_dict(input_file):
global _domain_dict
# If it's already been run, do nothing. You might want to change
# this.
if _domain_dict:
return
fh = open(input_file, 'rb')
try:
for line in fh:
line = line.strip().split(':', 1)
if len(line) == 2:
_domain_dict[line[0]] = line[1]
finally:
fh.close()

def domain_lookup(domain_name):
return _domain_dict.get(domain_name)

I changed generate_dict to do nothing if it's already been run. (You
might want it to run again with a fresh dict, or throw an error or
something.)

I removed enumerate() because it's unnecessary (and wrong -- you were
trying to split a tuple of (index, line)).

I also changed the split to only split once, like Duncan Booth suggested.

The try-finally is to ensure that the file is closed if an exception is
thrown for some reason.

domain_lookup doesn't need to declare _domain_dict as global because
it's not assigning to it. .get() returns None if the key doesn't exist,
so now the function returns None. You might want to use a different
value or throw an exception (use _domain_dict[domain_name] and not catch
the KeyError if it doesn't exist, perhaps).

Other than that, I just reformatted it and renamed variables, because I
do that.

--

Matt Nordhoff · Dec 7, 2007

Duncan said:
Just some minor points without changing the basis of what you have done
here:

Don't bother with 'readlines', file objects are directly iterable.
Why are you calling both lstrip and rstrip? The strip method strips
whitespace from both ends for you.

It is usually a good idea with code like this to limit the split method to
a single split in case there is more than one colon on the line: i.e.
x.split(':',1)

When you have a sequence whose elements are sequences with two elements
(which is what you have here), you can construct a dict directly from the
sequence.

But why do you construct a dict from that input data simply to throw it
away? If you only want 1 domain from the file just pick it out of the list.
If you want to do multiple lookups build the dict once and keep it around.

So something like the following (untested code):

from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]
pairs = [ (domain.strip(), owner.strip())
for (domain,owner) in pairs ]
return dict(lines)

DOMAINOWNERS = loaddomainowners()

def lookupdmo(domain):
return DOMAINOWNERS[domain]

Using two list comprehensions mean you construct two lists, which sucks
if it's a large file.

Also, you could pass the list comprehension (or better yet a generator
expression) directly to dict() without saving it to a variable:

with open('/etc/virtual/domainowners','r') as fh:
return dict(line.strip().split(':', 1) for line in fh)

(Argh, that doesn't .strip() the key and value, which means it won't
work, but it's so simple and elegant and I'm tired enough that I'm not
going to add that.

Just use another genexp. Makes for a line
complicated enough that it could be turned into a for loop, though.)
--

Chris · Dec 7, 2007

Ta Matt, wasn't paying attention to what I typed.

And didn't know that about .get() and not having to declare the
global.
Thanks for my mandatory new thing for the day

Bruno Desthuilliers · Dec 7, 2007

(e-mail address removed) a écrit :

Hello everyone,
(snip)

Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]

The csv module is your friend.

Matt Nordhoff · Dec 7, 2007

Chris said:
Ta Matt, wasn't paying attention to what I typed.
And didn't know that about .get() and not having to declare the
global.
Thanks for my mandatory new thing for the day

--

mrkafk · Dec 7, 2007

Duncan said:
Just some minor points without changing the basis of what you have done
here:

All good points, thanks. Phew, there's nothing like peer review for
your code...

But why do you construct a dict from that input data simply to throw it
away?

Because comparing strings for equality in a loop is writing C in
Python, and that's
exactly what I'm trying to unlearn.

The proper way to do it is to produce a dictionary and look up a value
using a key.

If you only want 1 domain from the file just pick it out of the list.

for item in list:
if item == 'searched.domain':
return item...

Yuck.

with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]

Didn't think about doing it this way. Good point. Thx

mrkafk · Dec 7, 2007

The csv module is your friend.

(slapping forehead) why the Holy Grail didn't I think about this? That
should be much simpler than using SimpleParse or SPARK.

Thx Bruno & everyone.

Marc 'BlackJack' Rintsch · Dec 7, 2007

Because comparing strings for equality in a loop is writing C in
Python, and that's exactly what I'm trying to unlearn.

The proper way to do it is to produce a dictionary and look up a value
using a key.

for item in list:
if item == 'searched.domain':
return item...

Yuck.

I guess Duncan's point wasn't the construction of the dictionary but the
throw it away part. If you don't keep it, the loop above is even more
efficient than building a dictionary with *all* lines of the file, just to
pick one value afterwards.

Ciao,
Marc 'BlackJack' Rintsch

Bruno Desthuilliers · Dec 7, 2007

(e-mail address removed) a écrit :

(slapping forehead) why the Holy Grail didn't I think about this?

If that can make you feel better, a few years ago, I spent two days
writing my own (SquaredWheel(tm) of course) csv reader/writer... before
realizing there was such a thing as the csv module :-/

Should have known better...

Duncan Booth · Dec 7, 2007

Matt Nordhoff said:
Using two list comprehensions mean you construct two lists, which sucks
if it's a large file.

Only if it is very large. You aren't duplicating the data except for
entries with whitespace round them. If there isn't a lot of whitespace then
the extra overhead for duplicating the list is unlikely to be significant.

Also, you could pass the list comprehension (or better yet a generator
expression) directly to dict() without saving it to a variable:

with open('/etc/virtual/domainowners','r') as fh:
return dict(line.strip().split(':', 1) for line in fh)

(Argh, that doesn't .strip() the key and value, which means it won't
work, but it's so simple and elegant and I'm tired enough that I'm not
going to add that. Just use another genexp. Makes for a line
complicated enough that it could be turned into a for loop, though.)

It isn't hard to convert my lists to generators keeping the structure
exactly the same (and fixing the typo):

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
pairs = (line.split(':',1) for line in infile if ':' in line)
pairs = ((domain.strip(), owner.strip())
for (domain,owner) in pairs)
return dict(pairs)

mrkafk · Dec 7, 2007

I guess Duncan's point wasn't the construction of the dictionary but the
throw it away part. If you don't keep it, the loop above is even more
efficient than building a dictionary with *all* lines of the file, just to
pick one value afterwards.

Sure, but I have two options here, none of them nice: either "write C
in Python" or do it inefficient and still elaborate way.

Anyway, I found my nirvana at last:
.... return x.replace(' ','').strip('\n').split(':',1)
....

ownerslist = [ shelper(x)[1] for x in it if len(shelper(x)) == 2 and shelper(x)[0] == domain ]
ownerslist

Click to expand...

Click to expand...

['da2']

Python rulez.

mrkafk · Dec 7, 2007

... return x.replace(' ','').strip('\n').split(':',1)

Argh, typo, should be def shelper(x) of course.

Neil Cerutti · Dec 7, 2007

from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:

I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?

Neil Cerutti · Dec 7, 2007

(e-mail address removed) a écrit :

If that can make you feel better, a few years ago, I spent two
days writing my own (SquaredWheel(tm) of course) csv
reader/writer... before realizing there was such a thing as the
csv module :-/

Should have known better...

But probably it has made you a better person.

Duncan Booth · Dec 7, 2007

Neil Cerutti said:
I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?

That is not so.

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information..... print len(list(f))
....
40

Glauco · Dec 7, 2007

(e-mail address removed) ha scritto:

Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:

def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

cache = None

def lookup( domain ):
if not cache:
cache = dict( [map( lambda x: x.strip(), x.split(':')) for x in
open('/etc/virtual/domainowners','r').readlines()])
return cache.get(domain)

Glauco

Neil Cerutti · Dec 7, 2007

Neil Cerutti said:
Neil Cerutti said:

I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?

Click to expand...

That is not so.

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.... print len(list(f))
...
40<closed file 'diffs.txt', mode 'r' at 0x00AA0698>

Thanks. After seeing your answer I managed to find what I'd
overlooked before, in the docs for file.close:

As of Python 2.5, you can avoid having to call this method
explicitly if you use the with statement. For example, the
following code will automatically close f when the with block
is exited:

from __future__ import with_statement

with open("hello.txt") as f:
for line in f:
print line

Newbie ? file structures in Dict, List, Tuples etc How	1	Dec 12, 2007
ANN: eGenix PyRun - One file Python Runtime 1.2.0	0	Apr 30, 2013
writing to file fails	5	Jun 19, 2007
FAQ 5.2 How do I change, delete, or insert a line in a file, or append to the beginning of a file?	0	Feb 24, 2011
newbie write to file question	2	Dec 4, 2005
help - python can't find file	3	May 21, 2007
Need help with writefile line ranges within extensive file	4	Dec 2, 2009
Global symbol "%Properties" requires explicit package name	3	Apr 29, 2008

File to dict

mrkafk

Chris

Duncan Booth

Matt Nordhoff

Matt Nordhoff

Chris

Bruno Desthuilliers

Matt Nordhoff

mrkafk

mrkafk

Marc 'BlackJack' Rintsch

Bruno Desthuilliers

Duncan Booth

mrkafk

mrkafk

Neil Cerutti

Neil Cerutti

Duncan Booth

Glauco

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads