Elementary string-parsing

O

Odysseus

I'm writing my first 'real' program, i.e. that has a purpose aside from
serving as a learning exercise. I'm posting to solicit comments about my
efforts at translating strings from an external source into useful data,
regarding efficiency and 'pythonicity' both. My only significant
programming experience is in PostScript, and I feel that I haven't yet
'found my feet' concerning the object-oriented aspects of Python, so I'd
be especially interested to know where I may be neglecting to take
advantage of them.

My input is in the form of correlated lists of strings, which I want to
merge (while ignoring some extraneous items). I populate a dictionary
called "found" with these data, still in string form. It contains
sub-dictionaries of various items keyed to strings extracted from the
list "names"; these sub-dictionaries in turn contain the associated
items I want from "cells". After loading in the strings (I have omitted
the statements that pick up strings that require no further processing,
some of them coming from a third list), I convert selected items in
place. Here's the function I wrote:

def extract_data():
i = 0
while i < len(names):
name = names[6:] # strip off "Name: "
found[name] = {'epoch1': cells[10 * i + na],
'epoch2': cells[10 * i + na + 1],
'time': cells[10 * i + na + 5],
'score1': cells[10 * i + na + 6],
'score2': cells[10 * i + na + 7]}
###
Following is my first parsing step, for those data that represent real
numbers. The two obstacles I'm contending with here are that the figures
have commas grouping the digits in threes, and that sometimes the data
are non-numeric -- I'll deal with those later. Is there a more elegant
way of removing the commas than the split-and-rejoin below?
###
for k in ('time', 'score1', 'score2'):
v = found[name][k]
if v != "---" and v != "n/a": # skip non-numeric data
v = ''.join(v.split(",")) # remove commas between 000s
found[name][k] = float(v)
###
The next one is much messier. A couple of the strings represent times,
which I think will be most useful in 'native' form, but the input is in
the format "DD Mth YYYY HH:MM:SS UTC". Near the beginning of my program
I have "from calendar import timegm". Before I can feed the data to this
function, though, I have to convert the month abbreviation to a number.
I couldn't come up with anything more elegant than look-up from a list:
the relevant part of my initialization is
'''
m_abbrevs = ("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
'''
I'm also rather unhappy with the way I kluged the seventh and eighth
values in the tuple passed to timegm, the order of the date in the week
and in the year respectively. (I would hate to have to calculate them.)
The function doesn't seem to care what values I give it for these -- as
long as I don't omit them -- so I guess they're only there for the sake
of matching the output of the inverse function. Is there a version of
timegm that takes a tuple of only six (or seven) elements, or any better
way to handle this situation?
###
for k in ('epoch1', 'epoch2'):
dlist = found[name][k].split(" ")
m = 0
while m < 12:
if m_abbrevs[m] == dlist[1]:
dlist[1] = m + 1
break
m += 1
tlist = dlist[3].split(":")
found[name][k] = timegm((int(dlist[2]), int(dlist[1]),
int(dlist[0]), int(tlist[0]),
int(tlist[1]), int(tlist[2]),
-1, -1, 0))
i += 1

The function appears to be working OK as is, but I would welcome any &
all suggestions for improving it or making it more idiomatic.
 
D

Dennis Lee Bieber

My input is in the form of correlated lists of strings, which I want to
merge (while ignoring some extraneous items). I populate a dictionary
called "found" with these data, still in string form. It contains
sub-dictionaries of various items keyed to strings extracted from the
list "names"; these sub-dictionaries in turn contain the associated
items I want from "cells". After loading in the strings (I have omitted
the statements that pick up strings that require no further processing,
some of them coming from a third list), I convert selected items in
place. Here's the function I wrote:
Rather complicated description... A sample of the real/actual input
/file/ would be useful.

def extract_data():
i = 0
while i < len(names):

for i in range(len(names)):
name = names[6:] # strip off "Name: "


cellRoot = 10 * i + na #where did na come from?
#heck, where do names and cells
#come from? Globals? Not recommended..

use

def extract_data(names, na, cells):

and

return said:
found[name] = {'epoch1': cells[10 * i + na],
{ "epoch1" : cells[cellRoot],
'epoch2': cells[10 * i + na + 1],
"epoch2" : cells[cellRoot + 1],
'time': cells[10 * i + na + 5],
'score1': cells[10 * i + na + 6],
'score2': cells[10 * i + na + 7]}
#same modification on rest -- avoid recomputing the 10*i+na
###
Following is my first parsing step, for those data that represent real
numbers. The two obstacles I'm contending with here are that the figures
have commas grouping the digits in threes, and that sometimes the data
are non-numeric -- I'll deal with those later. Is there a more elegant
way of removing the commas than the split-and-rejoin below?
###
for k in ('time', 'score1', 'score2'):
v = found[name][k]
if v != "---" and v != "n/a": # skip non-numeric data
v = ''.join(v.split(",")) # remove commas between 000s
found[name][k] = float(v)
###

I'd suggest splitting this into a short function, and invoking it in
the preceding... say it is called "parsed"

"time" : parsed(cells[cellRoot + 5]),
The next one is much messier. A couple of the strings represent times,
which I think will be most useful in 'native' form, but the input is in
the format "DD Mth YYYY HH:MM:SS UTC". Near the beginning of my program
I have "from calendar import timegm". Before I can feed the data to this
function, though, I have to convert the month abbreviation to a number.
I couldn't come up with anything more elegant than look-up from a list:
the relevant part of my initialization is

Did you check the library for time/date parsing/formatting
operations?


Note that the %Z is a problematic entry... Per the help file
"""
Support for the %Z directive is based on the values contained in tzname
and whether daylight is true. Because of this, it is platform-specific
except for recognizing UTC and GMT which are always known (and are
considered to be non-daylight savings timezones).
"""
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "E:\Python24\lib\_strptime.py", line 293, in strptime
raise ValueError("time data did not match format: data=%s fmt=%s"
%
ValueError: time data did not match format: data=03 Feb 2008 20:35:46
PST fmt=%d %b %Y %H:%M:%S %Z
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
M

Marc 'BlackJack' Rintsch

def extract_data():
i = 0
while i < len(names):
name = names[6:] # strip off "Name: "
found[name] = {'epoch1': cells[10 * i + na],
'epoch2': cells[10 * i + na + 1],
'time': cells[10 * i + na + 5],
'score1': cells[10 * i + na + 6],
'score2': cells[10 * i + na + 7]}


Here and in later code you use a ``while`` loop although it is known at
loop start how many times the loop body will be executed. That's a job
for a ``for`` loop. If possible not over an integer that is used later
just as index into list, but the list itself. Here you need both, index
and objects from `names`. There's the `enumerate()` function for creating
an iterable of (index, name) from `names`.

I'd put all the relevant information that describes a field of the
dictionary that is put into `found` into tuples and loop over it. There
is the cell name, the index of the cell and function that converts the
string from that cell into an object that is stored in the dictionary.
This leads to (untestet):

def extract_data(names, na, cells):
found = dict()
for i, name in enumerate(names):
data = dict()
cells_index = 10 * i + na
for cell_name, index, parse in (('epoch1', 0, parse_date),
('epoch2', 1, parse_date),
('time', 5, parse_number),
('score1', 6, parse_number),
('score2', 7, parse_number)):
data[cell_name] = parse(cells[cells_index + index])
assert name.startswith('Name: ')
found[name[6:]] = data
return found

The `parse_number()` function could look like this:

def parse_number(string):
try:
return float(string.replace(',', ''))
except ValueError:
return string

Indeed the commas can be replaced a bit more elegant. :)

`parse_date()` is left as an exercise for the reader.
for k in ('epoch1', 'epoch2'):
dlist = found[name][k].split(" ")
m = 0
while m < 12:
if m_abbrevs[m] == dlist[1]:
dlist[1] = m + 1
break
m += 1
tlist = dlist[3].split(":")
found[name][k] = timegm((int(dlist[2]), int(dlist[1]),
int(dlist[0]), int(tlist[0]),
int(tlist[1]), int(tlist[2]),
-1, -1, 0))
i += 1

The function appears to be working OK as is, but I would welcome any &
all suggestions for improving it or making it more idiomatic.

As already said, that ``while`` loop should be a ``for`` loop. But if you
put `m_abbrevs` into a `list` you can replace the loop with a single call
to its `index()` method: ``dlist[1] = m_abbrevs.index(dlist[1]) + 1``.

Ciao,
Marc 'BlackJack' Rintsch
 
O

Odysseus

Here and in later code you use a ``while`` loop although it is known at
loop start how many times the loop body will be executed. That's a job
for a ``for`` loop. If possible not over an integer that is used later
just as index into list, but the list itself. Here you need both, index
and objects from `names`. There's the `enumerate()` function for creating
an iterable of (index, name) from `names`.

Thanks, that will be very useful. I was casting about for a replacement
for PostScript's "for" loop, and the "while" loop (which PS lacks -- and
which I've never missed there) was all I could come up with.
I'd put all the relevant information that describes a field of the
dictionary that is put into `found` into tuples and loop over it. There
is the cell name, the index of the cell and function that converts the
string from that cell into an object that is stored in the dictionary.
This leads to (untestet):

def extract_data(names, na, cells):
found = dict()

The problem with initializing the 'super-dictionary' within this
function is that I want to be able to add to it in further passes, with
a new set of "names" & "cells" each time.

BTW what's the difference between the above and "found = {}"?
for i, name in enumerate(names):
data = dict()
cells_index = 10 * i + na
for cell_name, index, parse in (('epoch1', 0, parse_date),
('epoch2', 1, parse_date),
('time', 5, parse_number),
('score1', 6, parse_number),
('score2', 7, parse_number)):
data[cell_name] = parse(cells[cells_index + index])

This looks a lot more efficient than my version, but what about the
strings that don't need parsing? Would it be better to define a
'pass-through' function that just returns its input, so they can be
handled by the same loop, or to handle them separately with another loop?
assert name.startswith('Name: ')

I looked up "assert", but all I could find relates to debugging. Not
that I think debugging is something I can do without ;) but I don't
understand what this line does.
found[name[6:]] = data
return found

The `parse_number()` function could look like this:

def parse_number(string):
try:
return float(string.replace(',', ''))
except ValueError:
return string

Indeed the commas can be replaced a bit more elegant. :)

Nice, but I'm somewhat intimidated by the whole concept of
exception-handling (among others). How do you know to expect a
"ValueError" if the string isn't a representation of a number? Is there
a list of common exceptions somewhere? (Searching for "ValueError"
turned up hundreds of passing mentions, but I couldn't find a definition
or explanation.)

As already said, that ``while`` loop should be a ``for`` loop. But if you
put `m_abbrevs` into a `list` you can replace the loop with a single call
to its `index()` method: ``dlist[1] = m_abbrevs.index(dlist[1]) + 1``.

I had gathered that lists shouldn't be used for storing constants. Is
that more of a suggestion than a rule? I take it tuples don't have an
"index()" method.

Thanks for the detailed advice. I'll post back if I have any trouble
implementing your suggestions.
 
J

John Machin

BTW what's the difference between the above and "found = {}"?

{} takes 4 fewer keystrokes, doesn't have the overhead of a function
call, and works with Pythons at least as far back as 1.5.2 -- apart
from that, it's got absolutely nothing going for it ;-)
 
O

Odysseus

Rather complicated description... A sample of the real/actual input
/file/ would be useful.

Sorry, I didn't want to go on too long about the background, but I guess
more context would have helped. The data actually come from a web page;
I use a class based on SGMLParser to do the initial collection. The
items in the "names" list were originally "title" attributes of anchor
tags and are obtained with a "start_a" method, while "cells" holds the
contents of the <td> tags, obtained by a "handle_data" method according
to the state of a flag that's set to True by a "start_td" method and to
False by an "end_td". I don't care about anything else on the page, so I
didn't define most of the tag-specific methods available.

cellRoot = 10 * i + na #where did na come from?
#heck, where do names and cells
#come from? Globals? Not recommended..

The variable "na" is the number of 'not applicable' items (headings and
whatnot) preceding the data I'm interested in.

I'm not clear on what makes an object global, other than appearing as an
operand of a "global" statement, which I don't use anywhere. But "na" is
assigned its value in the program body, not within any function: does
that make it global? Why is this not recommended? If I wrap the
assignment in a function, making "na" a local variable, how can
"extract_data" then access it?

The lists of data are attributes (?) of my SGMLParser class; in my
misguided attempt to pare irrelevant details from "extract_data" I
obfuscated this aspect. I have a "parse_page(url)" function that returns
an instance of the class, as "captured", and the lists in question are
actually called "captured.names" and "captured.cells". The
"parse_page(url)" function is called in the program body; does that make
its output global as well?
use

def extract_data(names, na, cells):

and

return <something>

What should it return? A Boolean indicating success or failure? All the
data I want should all have been stored in the "found" dictionary by the
time the function finishes traversing the list of names.
for k in ('time', 'score1', 'score2'):
v = found[name][k]
if v != "---" and v != "n/a": # skip non-numeric data
v = ''.join(v.split(",")) # remove commas between 000s
found[name][k] = float(v)

I'd suggest splitting this into a short function, and invoking it in
the preceding... say it is called "parsed"

"time" : parsed(cells[cellRoot + 5]),

Will do. I guess part of my problem is that being unsure of myself I'm
reluctant to attempt too much in a single complex statement, finding it
easier to take small and simple (but inefficient) steps. I'll have to
learn to consolidate things as I go.
Did you check the library for time/date parsing/formatting
operations?

(2008, 2, 3, 20, 35, 46, 6, 34, 0)

I looked at the documentation for the "time" module, including
"strptime", but I didn't realize the "%b" directive would match the
month abbreviations I'm dealing with. It's described as "Locale's
abbreviated month name"; if someone were to run my program on a French
system e.g., wouldn't it try to find a match among "jan", "fév", ...,
"déc" (or whatever) and fail? Is there a way to declare a "locale" that
will override the user's settings? Are the locale-specific strings
documented anywhere? Can one assume them to be identical in all
English-speaking countries, at least? Now it's pretty unlikely in this
case that such an 'international situation' will arise, but I didn't
want to burn any bridges ...

I was also somewhat put off "strptime" on reading the caveat "Note: This
function relies entirely on the underlying platform's C library for the
date parsing, and some of these libraries are buggy. There's nothing to
be done about this short of a new, portable implementation of
strptime()." If it works, however, it'll be a lot tidier than what I was
doing. I'll make a point of testing it on its own, with a variety of
inputs.
Note that the %Z is a problematic entry...
ValueError: time data did not match format: data=03 Feb 2008
20:35:46 PST fmt=%d %b %Y %H:%M:%S %Z

All the times are UTC, so fortunately this is a non-issue for my
purposes of the moment. May I assume that leaving the zone out will
cause the time to be treated as UTC?

Thanks for your help, and for bearing with my elementary questions and
my fumbling about.
 
M

Marc 'BlackJack' Rintsch

I'm not clear on what makes an object global, other than appearing as an
operand of a "global" statement, which I don't use anywhere. But "na" is
assigned its value in the program body, not within any function: does
that make it global?

Yes. The term "global" usually means "module global" in Python.
Why is this not recommended?

Because the functions depend on some magic data coming from "nowhere" and
it's much harder to follow the data flow in a program. If you work with
globals you can't be sure what the following will print:

def spam():
global x
x = 42
beep()
print x

`beep()` might change `x` or any function called by `beep()` and so on.

Another issue is testing. If you rely on global names it's harder to test
individual functions. If I want to test your `extract_data()` I first have
to look through the whole function body and search all the global
references and bind those names to values before I can call the function.
This might not be enough, any function called by `extract_data()` might
need some global assignments too. This way you'll get quite soon to a
point where the single parts of a program can't be tested in isolation and
are not reusable for other programs.

In programs without such global names you see quite clearly in the
``def`` line what the function expects as input.
If I wrap the assignment in a function, making "na" a local variable, how
can "extract_data" then access it?

Give it as an argument. As a rule of thumb values should enter a function
as arguments and leave it as return values.

It's easy to "enforce" if you have minimal code on the module level. The
usual idiom is:

def main():
# Main program comes here.

if __name__ == '__main__':
main()

Then main is called when the script is called as program, but not called if
you just import the script as module. For example to test functions or to
reuse the code from other scripts.
What should it return? A Boolean indicating success or failure? All the
data I want should all have been stored in the "found" dictionary by the
time the function finishes traversing the list of names.

Then create the `found` dictionary in that function and return it at the
end.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Marc 'BlackJack' Rintsch

The problem with initializing the 'super-dictionary' within this
function is that I want to be able to add to it in further passes, with
a new set of "names" & "cells" each time.

Then you can either pass in `found` as argument instead of creating it
here, or you collect the passes in the calling code with the `update()`
method of `dict`. Something like this:

found = dict()
for pass in passes:
# ...
found.update(extract_data(names, na, cells))
BTW what's the difference between the above and "found = {}"?

I find it more "explicit". ``dict`` and ``list`` are easier to
distinguish than ``{}`` and ``[]`` after a loooong coding session or when
printed/displayed in a small font. It's just a matter of taste.
for i, name in enumerate(names):
data = dict()
cells_index = 10 * i + na
for cell_name, index, parse in (('epoch1', 0, parse_date),
('epoch2', 1, parse_date),
('time', 5, parse_number),
('score1', 6, parse_number),
('score2', 7, parse_number)):
data[cell_name] = parse(cells[cells_index + index])

This looks a lot more efficient than my version, but what about the
strings that don't need parsing? Would it be better to define a
'pass-through' function that just returns its input, so they can be
handled by the same loop, or to handle them separately with another loop?

I'd handle them in the same loop. A "pass-through" function for strings
already exists:

In [255]: str('hello')
Out[255]: 'hello'
I looked up "assert", but all I could find relates to debugging. Not
that I think debugging is something I can do without ;) but I don't
understand what this line does.

It checks if `name` really starts with 'Name: '. This way I turned the
comment into code that checks the assertion in the comment.
Nice, but I'm somewhat intimidated by the whole concept of
exception-handling (among others). How do you know to expect a
"ValueError" if the string isn't a representation of a number?

Experience. I just tried what happens if I feed `float()` with a string
that is no number:

In [256]: float('abc')
---------------------------------------------------------------------------
<type 'exceptions.ValueError'> Traceback (most recent call last)

/home/bj/<ipython console> in <module>()

Is there a list of common exceptions somewhere? (Searching for
"ValueError" turned up hundreds of passing mentions, but I couldn't find
a definition or explanation.)

The definition is quite vague. The type of an argument is correct, but
there's something wrong with the value.

See http://docs.python.org/lib/module-exceptions.html for an overview of
the built in exceptions.
As already said, that ``while`` loop should be a ``for`` loop. But if
you put `m_abbrevs` into a `list` you can replace the loop with a
single call to its `index()` method: ``dlist[1] =
m_abbrevs.index(dlist[1]) + 1``.

I had gathered that lists shouldn't be used for storing constants. Is
that more of a suggestion than a rule?

Some suggest this. Others say tuples are for data where the position of
an element has a "meaning" and lists are for elements that all have the
same "meaning" for some definition of meaning. As an example ('John',
'Doe', 'Dr.') vs. ['Peter', 'Paul', 'Mary']. In the first example we have
name, surname, title and in the second example all elements are just
names. Unless the second example models a relation like child, father,
mother, or something like that. Anyway, if you can make the source simpler
and easier to understand by using the `index()` method, use a list. :)

Ciao,
Marc 'BlackJack' Rintsch
 
D

Dennis Lee Bieber

Thanks, that will be very useful. I was casting about for a replacement
for PostScript's "for" loop, and the "while" loop (which PS lacks -- and
which I've never missed there) was all I could come up with.
Have you read the language reference manual yet? It is a rather
short document given that the language syntactic elements are not that
complex -- but would have exposed you to the "for" statement (along with
"return" and passing arguments). If your only other programming
experience is base PostScript you wouldn't really be familiar with
passing arguments or returning values -- as an RPN stack-based language,
argument passing is just listing the arguments before a function call
(putting a copy of them on the stack), and returns are whatever the
function left on the stack at the end; hence they appear sort of global.

After the language reference manual, the library reference manual
chapter on built-ins and data types would be next for study -- the rest
can usually be handled via search functions (working with time
conversions said:
The problem with initializing the 'super-dictionary' within this
function is that I want to be able to add to it in further passes, with
a new set of "names" & "cells" each time.
Well, you could initialize it in the "pass" loop, and pass it in
along with everything else...

Though (since I'm responding halfway in on a dialog, and don't have
the full prior level quoted)...

It looked a bit like you were using a SAX-style parser to collect
"names" and "cells" -- and then passing the "bunch" to another function
to trim out and convert data... It would take me a bit to restudy the
SAX parsing scheme (I did it once, back in the days of v1.5 or so) but
the way I'd /try/ to do it is to have the stream handler keep track of
which cell (<td> tag) is currently being parsed, and convert the string
data at that level. You'd initialize the record dictionary to {} (and
cell position to 0) on the <tr> tag, and return the populated record on
the </tr> tag.

Might want to check into making a class/instance of the parser so
you can make the record dictionary and column (cell) position instance
attributes (avoiding globals).
Nice, but I'm somewhat intimidated by the whole concept of
exception-handling (among others). How do you know to expect a
"ValueError" if the string isn't a representation of a number? Is there

Read the library reference for the function in question? Though it
appears the reference doesn't list the error raised for an invalid
string representation -- in which case just try one in the interactive
shell...

a list of common exceptions somewhere? (Searching for "ValueError"
turned up hundreds of passing mentions, but I couldn't find a definition
or explanation.)

Library reference -- under "Exception" (note the cap). {Section 2.4
Built-in Exceptions in my version of Python -- same chapter I mentioned
above about data types}
I had gathered that lists shouldn't be used for storing constants. Is
that more of a suggestion than a rule? I take it tuples don't have an
"index()" method.
Tuples are "read-only" whereas a list can be modified in-place...
But many look on tuples as a data "record" where each position has a
different meaning; lists are collections where each position has a
different value but the same meaning...

Tuple: (name, street, city, state, zip)
List: [name1, name2, name3, name...]

for name in List:
#makes sense as you would do the same process on each name

for field in Tuple:
#does NOT make sense; why would you do the same
#process on street and zip, for example

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
O

Odysseus

Dennis Lee Bieber said:
Have you read the language reference manual yet? It is a rather
short document given that the language syntactic elements are not that
complex -- but would have exposed you to the "for" statement (along with
"return" and passing arguments).

Sorry, translation problem: I am acquainted with Python's "for" -- if
far from fluent with it, so to speak -- but the PS operator that's most
similar (traversing a compound object, element by element, without any
explicit indexing or counting) is called "forall". PS's "for" loop is
similar to BASIC's (and ISTR Fortran's):

start_value increment end_value {procedure} for

I don't know the proper generic term -- "indexed loop"? -- but at any
rate it provides a counter, unlike Python's command of the same name.
If your only other programming experience is base PostScript you
wouldn't really be familiar with passing arguments or returning
values -- as an RPN stack-based language, argument passing is just
listing the arguments before a function call (putting a copy of them
on the stack), and returns are whatever the function left on the
stack at the end; hence they appear sort of global.

Working directly in the operand stack is efficient, but can make
interpretation by humans -- and debugging -- very difficult. So for the
sake of coder-friendliness it's generally advisable to use variables
(i.e. assign values to keys in a dictionary) in most cases instead of
passing values 'silently' via the stack. I'm beginning to realize that
for Python the situation is just about the opposite ...

Anyway, I have been reading the documentation on the website, but much
of the terminology is unfamiliar to me. When looking things up I seem to
get an inordinate number of 404 errors from links returned by the search
function, and often the language-reference or tutorial entries (if any)
are buried several pages down. In general I'm finding the docs rather
frustrating to navigate.
After the language reference manual, the library reference manual
chapter on built-ins and data types would be next for study -- the rest
can usually be handled via search functions (working with time
conversions, look for modules with date or time <G>).

As I mentioned elsethread, I did look at the "time" documentation; it
was there that I found a reference to the "calendar.timegm" function I
used in my first attempt.
It looked a bit like you were using a SAX-style parser to collect
"names" and "cells" -- and then passing the "bunch" to another function
to trim out and convert data... It would take me a bit to restudy the
SAX parsing scheme (I did it once, back in the days of v1.5 or so) but
the way I'd /try/ to do it is to have the stream handler keep track of
which cell (<td> tag) is currently being parsed, and convert the string
data at that level. You'd initialize the record dictionary to {} (and
cell position to 0) on the <tr> tag, and return the populated record on
the </tr> tag.

This is what my setup looks like -- mostly cribbed from _Dive Into
Python_ -- where "PageParser" is a class based on "SGMLParser":

from sgmllib import SGMLParser
from urllib import urlopen

# ...

def parse_page(url):
usock = urlopen(url)
parser = PageParser()
parser.feed(usock.read())
parser.close()
usock.close()
return parser

# ...

captured = parse_page(base_url + suffix)

I only use "parse_page" the once at this stage, but my plan was to call
it repeatedly while varying "suffix" (depending on the data found by the
previous pass). On each pass the class will initialize itself, which is
why I was collecting the data into a 'standing' (global) dictionary. Are
you suggesting essentially that I'd do better to make the text-parsing
function into a method of "PageParser"? Can one add, to such a derived
class, methods that don't have protoypes in the parent?
Might want to check into making a class/instance of the parser so
you can make the record dictionary and column (cell) position instance
attributes (avoiding globals).

AFAICT my "captured" is an instance of "PageParser", but I'm unclear on
how I would add attributes to it -- and as things stand it will get
rebuilt from scratch each time a page is read in.
[...] I'm somewhat intimidated by the whole concept of
exception-handling (among others). How do you know to expect a
"ValueError" if the string isn't a representation of a number?

Read the library reference for the function in question? Though it
appears the reference doesn't list the error raised for an invalid
string representation -- in which case just try one in the interactive
shell...

Under "2.1 Built-in Functions"

<http://docs.python.org/lib/built-in-funcs.html>

"""float([x])
Convert a string or a number to floating point. If the argument is a
string, it must contain a possibly signed decimal or floating point
number, possibly embedded in whitespace. Otherwise, the argument may be
a plain or long integer or a floating point number, and a floating point
number with the same value (within Python's floating point precision) is
returned. If no argument is given, returns 0.0.
Note: When passing in a string, values for NaN and Infinity may be
returned, depending on the underlying C library. The specific set of
strings accepted which cause these values to be returned depends
entirely on the C library and is known to vary.
"""

Not a word about errors. If I understand the note correctly,
"float('---')" might cause an error or might happily return "NaN", so it
appears experimentation is the only way to go. (On my system I get the
error, but if I wanted to run the program elsewhere, or share it, I
suppose the code would have to be tested in each environment.)
Library reference -- under "Exception" (note the cap). {Section 2.4
Built-in Exceptions in my version of Python -- same chapter I mentioned
above about data types}

Thanks -- one would think my search should have directed me there ...
Tuples are "read-only" whereas a list can be modified in-place...
But many look on tuples as a data "record" where each position has a
different meaning; lists are collections where each position has a
different value but the same meaning...

Tuple: (name, street, city, state, zip)

Why wouldn't one use a dictionary for that?
List: [name1, name2, name3, name...]

for name in List:
#makes sense as you would do the same process on each name

for field in Tuple:
#does NOT make sense; why would you do the same
#process on street and zip, for example

I see the distinction. Thanks again ...
 
O

Odysseus

The term "global" usually means "module global" in Python.

Because they're like the objects obtained from "import"?
[T]he functions depend on some magic data coming from "nowhere" and
it's much harder to follow the data flow in a program. If you work
with globals you can't be sure what the following will print:

def spam():
global x
x = 42
beep()
print x

`beep()` might change `x` or any function called by `beep()` and so on.

I think I get the general point, but couldn't "beep()" get at "x" even
without the "global" statement, since they're both in "spam()"?

It seems natural to me to give the most important objects in a program
persistent names: I guess this something of a 'security blanket' I need
to wean myself from. I can appreciate the benefits of
context-independence when it comes to reusing code.
Another issue is testing. If you rely on global names it's harder to test
individual functions. [...]

In programs without such global names you see quite clearly in the
``def`` line what the function expects as input.

Good points, although thorough commenting can go a long way to help on
both counts. In theory, at least ...

It's easy to "enforce" if you have minimal code on the module level. The
usual idiom is:

def main():
# Main program comes here.

if __name__ == '__main__':
main()

Then main is called when the script is called as program, but not called if
you just import the script as module. For example to test functions or to
reuse the code from other scripts.

I'm using "if __name__ == 'main'" now, but only for test inputs (which
will eventually be read from a config file or passed by the calling
script -- or something). I hadn't thought of putting code that actually
does something there. As for writing modules, that's way beyond where I
want to go at this point: I don't know any C and am not sure I would
want to ...

[consolidating]

Then you can either pass in `found` as argument instead of creating it
here, or you collect the passes in the calling code with the `update()`
method of `dict`. Something like this:

found = dict()
for pass in passes:
# ...
found.update(extract_data(names, na, cells))

Cool. I'll have to read more about dictionary methods.

It checks if `name` really starts with 'Name: '. This way I turned the
comment into code that checks the assertion in the comment.

Good idea to check, although this is actually only one of many
assumptions I make about the data -- but what happens if the assertion
fails? The program stops and the interpreter reports an AssertionError
on line whatever?

f you can make the source simpler and easier to understand by
using the `index()` method, use a list. :)


Understood; thanks for all the tips.
 
D

Dennis Lee Bieber

Sorry, translation problem: I am acquainted with Python's "for" -- if
far from fluent with it, so to speak -- but the PS operator that's most
similar (traversing a compound object, element by element, without any
explicit indexing or counting) is called "forall". PS's "for" loop is
similar to BASIC's (and ISTR Fortran's):

start_value increment end_value {procedure} for

I don't know the proper generic term -- "indexed loop"? -- but at any
rate it provides a counter, unlike Python's command of the same name.
The convention is Python is to use range() (or xrange() ) to
generate a sequence of "index" values for the for statement to loop
over:

for i in range([start], end, [step]):

with the caveat that "end" will not be one of the values, start defaults
to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
values starting at 0].
Working directly in the operand stack is efficient, but can make
interpretation by humans -- and debugging -- very difficult. So for the
sake of coder-friendliness it's generally advisable to use variables
(i.e. assign values to keys in a dictionary) in most cases instead of
passing values 'silently' via the stack. I'm beginning to realize that
for Python the situation is just about the opposite ...
You should see the confusions available in RPL (HP calculator
programming language). said:
Anyway, I have been reading the documentation on the website, but much
of the terminology is unfamiliar to me. When looking things up I seem to

You should have a full documentation set available from any standard
Python installer. Hitting the web shouldn't be needed. The ActiveState
2.4 Windows installer has all the documents in "Help" format (including
the text of "Dive into Python").
This is what my setup looks like -- mostly cribbed from _Dive Into
Python_ -- where "PageParser" is a class based on "SGMLParser":

from sgmllib import SGMLParser
from urllib import urlopen

# ...

def parse_page(url):
usock = urlopen(url)
parser = PageParser()
parser.feed(usock.read())
parser.close()
usock.close()
return parser

Why return a reference to the parser here, when you've already
closed it two lines above.
# ...

captured = parse_page(base_url + suffix)

I only use "parse_page" the once at this stage, but my plan was to call
it repeatedly while varying "suffix" (depending on the data found by the
previous pass). On each pass the class will initialize itself, which is
why I was collecting the data into a 'standing' (global) dictionary. Are
you suggesting essentially that I'd do better to make the text-parsing
function into a method of "PageParser"? Can one add, to such a derived
class, methods that don't have protoypes in the parent?
The whole idea behind the SGML parser is that YOU add methods to
handle each tag type you need... Also, FYI, there IS an HTML parser (in
module htmllib) that is already derived from sgmllib.

class PageParser(SGMLParser):
def __init__(self):
#need to call the parent __init__, and then
#initialize any needed attributes -- like someplace to collect
#the parsed out cell data
self.row = {}
self.all_data = []

def start_table(self, attrs):
self.inTable = True
.....

def end_table(self):
self.inTable = False
.....

def start_tr(self, attrs):
if self.inRow:
#unclosed row!
self.end_tr()
self.inRow = True
self.cellCount = 0
...

def end_tr(self):
self.inRow = False
# add/append collected row data to master stuff
self.all_data.append(self.row)
...

def start_td(self, attrs):
if self.inCell:
self.end_td()
self.inCell = True
...

def end_td(self):
self.cellCount = self.cellCount + 1
...

def handle_data(self, text):
if self.inTable and self.inRow and self.inCell:
if self.cellCount == 0:
#first column stuff
self.row["Epoch1"] = convert_if_needed(text)
elif self.cellCount == 1:
#second column stuff
...


Hope you don't have nested tables -- it could get ugly as this style
of parser requires the start_tag()/end_tag() methods to set instance
attributes for the purpose of tracking state needed in later methods
(notice the complexity of the handle_data() method just to ensure that
the text is from a table cell, and not some random text).

And somewhere before you close the parser, get a handle on the
collected data...


parsed_data = parser.all_data
parser.close()
return parsed_data

Why wouldn't one use a dictionary for that?
The overhead may not be needed... Tuples can also be used as the
keys /in/ a dictionary.

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
D

Dennis Lee Bieber

I think I get the general point, but couldn't "beep()" get at "x" even
without the "global" statement, since they're both in "spam()"?
No... beep is CALLED from spam, but does not have access to stuff
local to spam. The global statement says that "x" is the "x" that is
defined at the module level, not a local-only "x"

I'm using "if __name__ == 'main'" now, but only for test inputs (which

That won't work... It has to be "__main__"
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
M

Marc 'BlackJack' Rintsch

Marc 'BlackJack' Rintsch said:
Another issue is testing. If you rely on global names it's harder to test
individual functions. [...]

In programs without such global names you see quite clearly in the
``def`` line what the function expects as input.

Good points, although thorough commenting can go a long way to help on
both counts. In theory, at least ...

Won't work in practice so well. Say we have function `f()` and
document that it expects global name `a` to be set to something before
calling it. `f()` is used by other functions so we have to document `a` in
all other functions too. If we change `f()` to rely on global name `b`
too we have to hunt down every function that calls `f()` and add the
documentation for `b` there too. It's much work and error prone. Easy to
get inconsistent or missing documentation this way.

To write or check documentation for a function you have to scan the whole
function body for data in global names and calls to other functions and
repeat the search there. If you don't let functions communicate via global
names you just have to look at the argument list to see the input sources.
I'm using "if __name__ == 'main'" now, but only for test inputs (which
will eventually be read from a config file or passed by the calling
script -- or something). I hadn't thought of putting code that actually
does something there. As for writing modules, that's way beyond where I
want to go at this point: I don't know any C and am not sure I would
want to ...

What does this have to do with C!? There's no specific C knowledge
involved here.
Good idea to check, although this is actually only one of many
assumptions I make about the data -- but what happens if the assertion
fails? The program stops and the interpreter reports an AssertionError
on line whatever?

Yes, you get an `AssertionError`:

In [314]: assert True

In [315]: assert False
---------------------------------------------------------------------------
<type 'exceptions.AssertionError'> Traceback (most recent call last)

/home/bj/<ipython console> in <module>()

<type 'exceptions.AssertionError'>:

Ciao,
Marc 'BlackJack' Rintsch
 
S

Steve Holden

Marc said:
Marc 'BlackJack' Rintsch said:
Another issue is testing. If you rely on global names it's harder to test
individual functions. [...]

In programs without such global names you see quite clearly in the
``def`` line what the function expects as input.
Good points, although thorough commenting can go a long way to help on
both counts. In theory, at least ...

Won't work in practice so well. Say we have function `f()` and
document that it expects global name `a` to be set to something before
calling it. `f()` is used by other functions so we have to document `a` in
all other functions too. If we change `f()` to rely on global name `b`
too we have to hunt down every function that calls `f()` and add the
documentation for `b` there too. It's much work and error prone. Easy to
get inconsistent or missing documentation this way.
Essentially what Marc is saying is that you want your functions to be as
loosely coupled to their environment as practically possible. See

http://en.wikipedia.org/wiki/Coupling_(computer_science)

[...]

regards
Steve
 
S

Steve Holden

Dennis said:
Sorry, translation problem: I am acquainted with Python's "for" -- if
far from fluent with it, so to speak -- but the PS operator that's most
similar (traversing a compound object, element by element, without any
explicit indexing or counting) is called "forall". PS's "for" loop is
similar to BASIC's (and ISTR Fortran's):

start_value increment end_value {procedure} for

I don't know the proper generic term -- "indexed loop"? -- but at any
rate it provides a counter, unlike Python's command of the same name.
The convention is Python is to use range() (or xrange() ) to
generate a sequence of "index" values for the for statement to loop
over:

for i in range([start], end, [step]):

with the caveat that "end" will not be one of the values, start defaults
to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
values starting at 0].
If you have a sequence of values s and you want to associate each with
its index value as you loop over the sequence the easiest way to do this
is the enumerate built-in function:
>>> for x in enumerate(['this', 'is', 'a', 'list']):
.... print x
....
(0, 'this')
(1, 'is')
(2, 'a')
(3, 'list')

It's usually (though not always) much more convenient to bind the index
and the value to separate names, as in
>>> for i, v in enumerate(['this', 'is', 'a', 'list']):
.... print i, v
....
0 this
1 is
2 a
3 list

[...]
The whole idea behind the SGML parser is that YOU add methods to
handle each tag type you need... Also, FYI, there IS an HTML parser (in
module htmllib) that is already derived from sgmllib.

class PageParser(SGMLParser):
def __init__(self):
#need to call the parent __init__, and then
#initialize any needed attributes -- like someplace to collect
#the parsed out cell data
self.row = {}
self.all_data = []

def start_table(self, attrs):
self.inTable = True
.....

def end_table(self):
self.inTable = False
.....

def start_tr(self, attrs):
if self.inRow:
#unclosed row!
self.end_tr()
self.inRow = True
self.cellCount = 0
...

def end_tr(self):
self.inRow = False
# add/append collected row data to master stuff
self.all_data.append(self.row)
...

def start_td(self, attrs):
if self.inCell:
self.end_td()
self.inCell = True
...

def end_td(self):
self.cellCount = self.cellCount + 1
...

def handle_data(self, text):
if self.inTable and self.inRow and self.inCell:
if self.cellCount == 0:
#first column stuff
self.row["Epoch1"] = convert_if_needed(text)
elif self.cellCount == 1:
#second column stuff
...


Hope you don't have nested tables -- it could get ugly as this style
of parser requires the start_tag()/end_tag() methods to set instance
attributes for the purpose of tracking state needed in later methods
(notice the complexity of the handle_data() method just to ensure that
the text is from a table cell, and not some random text).
There is, of course, nothing to stop you building a recursive data
structure, so that encountering a new opening tag such as <table> adds
another level to some stack-like object, and the corresponding closing
tag pops it off again, but this *does* add to the complexity somewhat.

It seems natural that more complex input possibilities lead to more
complex parsers.
And somewhere before you close the parser, get a handle on the
collected data...


parsed_data = parser.all_data
parser.close()
return parsed_data


The overhead may not be needed... Tuples can also be used as the
keys /in/ a dictionary.
regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top