Where is the syntax for the dict() constructor ?!

  • Thread starter Captain Poutine
  • Start date
C

Captain Poutine

I'm simply trying to read a CSV into a dictionary.

(if it matters, it's ZIP codes and time zones, i.e.,
35983,CT
39161,CT
47240,EST



Apparently the way to do this is:

import csv

dictZipZones = {}

reader = csv.reader(open("some.csv", "rb"))
for row in reader:
# Add the row to the dictionary


But how to do this?

Apparently there is no dict.append() nor dict.add()

But what is there? I see vague references to "the dict() constructor"
and some examples, and news that it has been recently improved. But
where is the full, current documentation for the dict() constructor?

Frustrated,
Captain Poutine
 
C

Chris Mellon

I'm simply trying to read a CSV into a dictionary.

(if it matters, it's ZIP codes and time zones, i.e.,
35983,CT
39161,CT
47240,EST



Apparently the way to do this is:

import csv

dictZipZones = {}

reader = csv.reader(open("some.csv", "rb"))
for row in reader:
# Add the row to the dictionary


But how to do this?

Apparently there is no dict.append() nor dict.add()

But what is there? I see vague references to "the dict() constructor"
and some examples, and news that it has been recently improved. But
where is the full, current documentation for the dict() constructor?

There's no dict.append or dict.add because a dict doesn't require
them. You insert keys via the normal indexing interface:

somedict[somekey] = somevalue.

Depending on which direction your mapping is, you may be interested in
the setdefault method.

The dict constructor is described in section 2.1, built in functions
(even though it's not a function anymore).

The other dict methods are described in section 3.8, mapping types.
 
N

Neil Cerutti

I'm simply trying to read a CSV into a dictionary.

(if it matters, it's ZIP codes and time zones, i.e.,
35983,CT
39161,CT
47240,EST



Apparently the way to do this is:

import csv

dictZipZones = {}

reader = csv.reader(open("some.csv", "rb"))
for row in reader:
# Add the row to the dictionary

In addition to Chris's answer, the csv module can read and write
dictionaries directly. Look up csv.DictReader and csv.DictWriter.
 
P

Peter Otten

Neil said:
In addition to Chris's answer, the csv module can read and write
dictionaries directly. Look up csv.DictReader and csv.DictWriter.

DictReader gives one dict per row, with field names as keys. The OP is more
likely to want

dict(csv.reader(open("some.csv", "rb")))

which produces a dict that maps ZIP codes to time zones.

Peter
 
C

Captain Poutine

Neil said:
In addition to Chris's answer, the csv module can read and write
dictionaries directly. Look up csv.DictReader and csv.DictWriter.

Yes, thanks. I was happy when I saw it at
http://www.python.org/doc/2.4/lib/node615.html


"Reader objects (DictReader instances and objects returned by the
reader() function) have the following public methods:

next( )
Return the next row of the reader's iterable object as a list,
parsed according to the current dialect."

But that's not enough information for me to use. Also, the doc says
basically "csv has dialects," but doesn't even enumerate them. Where is
the real documentation?


Also, when I do a print of row, it comes out as:
['12345', 'ET']

But there are no quotes around the number in the file. Why is Python
making it a string?
 
C

Captain Poutine

Peter said:
DictReader gives one dict per row, with field names as keys. The OP is more
likely to want

dict(csv.reader(open("some.csv", "rb")))

which produces a dict that maps ZIP codes to time zones.

Peter

Thanks Peter, that basically works, even if I don't understand it.

What does "rb" mean? (read binary?)
Why are the keys turned into strings (they are not quoted in the .csv file)?
 
T

Thomas Jollans

Thanks Peter, that basically works, even if I don't understand it.

What does "rb" mean? (read binary?)
Why are the keys turned into strings (they are not quoted in the .csv
file)?

"rb" is read, in binary mode. On DOS and derivatives this prevents intentional
file corruption when reading. (for ASCII files, omitting the b might be
desirable...)

Think of csv.reader as a fancy variant of the following: (fancy in that it
supports things like non-comma separators and comma escaping)

def CSVReader(file):
for line in file:
yield line.split(',')

--
Regards, Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQBGjTi3JpinDvQhQ0sRAkJ2AJ98iVIqj9lG/p8UwO5DpRmspdIShgCfXwXu
PtAhDJzSqzOMUfPorYq7Ytk=
=Dxr0
-----END PGP SIGNATURE-----
 
N

Neil Cerutti

"Reader objects (DictReader instances and objects returned by
the reader() function) have the following public methods:

Lucky for you and me, Peter Otten corrected my mistaken advice.
next( )
Return the next row of the reader's iterable object as a list,
parsed according to the current dialect."

But that's not enough information for me to use. Also, the doc
says basically "csv has dialects," but doesn't even enumerate
them. Where is the real documentation?

It's referring to the fact that csv has no standard to adhere to,
so sometimes slightly different ways of quoting and escaping
appear in csv files. Apart from that, through configuring a new
dialect you can use csv to parse many kinds of delimited data
files, not just csv as written by MS apps.

Mostly you can use the default 'excel' dialect and be quite
happy, since Excel is the main reason anybody still cares about
this unecessarily hard to parse (it requires more than one
character of lookahead for no reason except bad design) data
format.

See Library Reference 9.1.2 Dialects and Formatting Paremeters
for an explanation of what can be configured.
Also, when I do a print of row, it comes out as:
['12345', 'ET']

But there are no quotes around the number in the file. Why is
Python making it a string?

It's a string to start with, since it comes from a text file.
Besides, a string is an excellent epresentation for a zip code,
since arithmetic upon them is unthinkable.

I shared your frustration with the csv module docs when I first
read them. But happily you can skip them and just read
the easily adapted examples (9.1.5 Examples).
 
?

=?ISO-8859-1?Q?Nis_J=F8rgensen?=

Neil Cerutti skrev:
Mostly you can use the default 'excel' dialect and be quite
happy, since Excel is the main reason anybody still cares about
this unecessarily hard to parse (it requires more than one
character of lookahead for no reason except bad design) data
format.

I knew there had to be a reason why everyone is using xml these days ...

Nis
 
W

Wildemar Wildenburger

Nis said:
Neil Cerutti skrev:



I knew there had to be a reason why everyone is using xml these days ...

Nis
Are you serious? You want to cram tabular data in some tree-oriented
data format? Now THAT is evangelism. Why is it a bad desing choice to
seperate values by a specific delimiter, thus emulating a tabular
arrangement? I really don't get it. Thats the natural format for tabular
data. Period.

/W
 
J

John Machin

Mostly you can use the default 'excel' dialect and be quite
happy, since Excel is the main reason anybody still cares about
this unecessarily hard to parse (it requires more than one
character of lookahead for no reason except bad design) data
format.

One cares about this format because people create data files of
millions of rows (far exceeding the capacity of Excel (pre-2007)) in
many imaginative xSV dialects, some of which are not handled by the
Python csv module.

I don't know what you mean by "requires more than one
character of lookahead" -- any non-Mickey-Mouse implementation of a
csv reader will use a finite state machine with about half-a-dozen
states, and data structures no more complicated than (1) completed
rows received so far (2) completed fields in current row (3) bytes in
current field. When a new input byte arrives, what to do can be
determined based on only that byte and the current state; no look-
ahead into the input stream is required, nor is any look-back into
those data structures.
 
A

Alex Martelli

Neil Cerutti said:
Besides, a string is an excellent epresentation for a zip code,
since arithmetic upon them is unthinkable.

Absolutely! Excel, unless you remedied that later with a column
operation, would turn some East Coast zipcodes into 3- and 4-digit
numbers (dropping the leading 0s), exactly because it's so obtuse as to
try to "treat as numbers" entries that are all-digits.

Funny enough, that very issue took at least 5 minutes of lecture time at
a recent lecture in a course on "statistical approaches to data mining"
that I'm following (hopefully it will turn to slightly more advanced
issues soon:): Excel (!) and R are the two "recommended" programs for
the course, and substantial parts of the lecture times so far have been
spent illustrating various foibles of each. Still, another guy who's
taking the class had a funny and relevant war story: at one point a
company he was working for did a mass mailing (paper mail)... and big
bag of the mails was returned as undeliverable by a pretty peeved US
Post Office... the "mail merge" they had done apparently involved Excel
(or some other spreadsheet program) at some point, and many East Coast
zipcodes had indeed been truncated (which messes with USPO's automatic
system and thus is NOT tolerated, at least in bulk mail)...!-)


Alex
 
D

Dan Bishop

Absolutely! Excel, unless you remedied that later with a column
operation, would turn some East Coast zipcodes into 3- and 4-digit
numbers (dropping the leading 0s), exactly because it's so obtuse as to
try to "treat as numbers" entries that are all-digits.

What's even worse is having it treat credit card numbers as numbers.
Which wouldn't be so much of a problem if they were 15 digits. But
they're 16 digits.
 
H

Hendrik van Rooyen

John Machin said:
I don't know what you mean by "requires more than one
character of lookahead" -- any non-Mickey-Mouse implementation of a
csv reader will use a finite state machine with about half-a-dozen
states, and data structures no more complicated than (1) completed
rows received so far (2) completed fields in current row (3) bytes in
current field. When a new input byte arrives, what to do can be
determined based on only that byte and the current state; no look-
ahead into the input stream is required, nor is any look-back into
those data structures.

True.

You can even do it more simply - by writing a GetField() that
scans for either the delimiter or end of line or end of file, and
returns the "field" found, along with the delimiter that caused
it to exit, and then writing a GetRecord() that repetitively calls
the GetField and assembles the row record until the delimiter
returned is either the end of line or the end of file, remembering
that the returned field may be empty, and handling the cases based
on the delimiter returned when it is.

This also makes all the decisions based on the current character
read, no lookahead as far as I can see.

Also no state variables, no switch statements...

Is this the method that you would call "Mickey Mouse"?

Actually I lie about the no state variables - you have to keep track
of where you are in the file - but calling read(1) will do it for you,
so no worries, mate...

*wondering if someone will call him on the current row number
as state variable*

- Hendrik
 
M

Marc 'BlackJack' Rintsch

True.

You can even do it more simply - by writing a GetField() that
scans for either the delimiter or end of line or end of file, and
returns the "field" found, along with the delimiter that caused
it to exit, and then writing a GetRecord() that repetitively calls
the GetField and assembles the row record until the delimiter
returned is either the end of line or the end of file, remembering
that the returned field may be empty, and handling the cases based
on the delimiter returned when it is.

This also makes all the decisions based on the current character
read, no lookahead as far as I can see.

Also no state variables, no switch statements...

Is this the method that you would call "Mickey Mouse"?

Maybe, because you've left out all handling of quoting and escape
characters here. Consider this:

erik,viking,"ham, spam and eggs","He said ""Ni!""","line one
line two"

That's 5 elements:

1: eric
2: viking
3: ham, spam and eggs
4: He said "Ni!"
5: line one
line two

Ciao,
Marc 'BlackJack' Rintsch
 
N

Neil Cerutti

One cares about this format because people create data files of
millions of rows (far exceeding the capacity of Excel (pre-2007)) in
many imaginative xSV dialects, some of which are not handled by the
Python csv module.

I don't know what you mean by "requires more than one
character of lookahead"

It's because of the silly way that quotes are quoted in quoted
fields.

"a,""b",c

But I'm not a parsing expert by any means.
 
N

Neil Cerutti

It's because of the silly way that quotes are quoted in quoted
fields.

"a,""b",c

But I'm not a parsing expert by any means.

Moreover, the most common version of csv uses both escape and
shift codes, when only escape codes were really needed, and then
compounds this stupidity by using the same character for escaping
and shifting.
 
H

Hendrik van Rooyen

Marc 'BlackJack' Rintsch said:
Maybe, because you've left out all handling of quoting and escape
characters here. Consider this:

erik,viking,"ham, spam and eggs","He said ""Ni!""","line one
line two"

That's 5 elements:

1: eric
2: viking
3: ham, spam and eggs
4: He said "Ni!"
5: line one
line two

Also true - What can I say - I can only wriggle and mutter...

I see that you escaped the quotes by doubling them up -
What would the following parse to?:

erik,viking,ham, spam and eggs,He said "Ni!",line one
line two

- Hendrik
 
M

Marc 'BlackJack' Rintsch

Also true - What can I say - I can only wriggle and mutter...

I see that you escaped the quotes by doubling them up -

That's how Excel and the `csv` module do it.
What would the following parse to?:

erik,viking,ham, spam and eggs,He said "Ni!",line one
line two

Why don't you try yourself? The `csv` module returns two records, the
first has six items:

1: erik
2: viking
3: ham
4: spam and eggs
5: He said "Ni!"
6: line one

'line two' is the only item in the next record then.

Ciao,
Marc 'BlackJack' Rintsch
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top