Need better string methods

D

David MacQuigg

I'm considering Python as a replacement for the highly specialized
scripting languages used in the electronics design industry. Design
engineers are typically not programmers, and they avoid working with
these complex proprietary languages, preferring instead to use GUI
tools that are poorly implemented and very limited in the problems
they can solve.

I am convinced that Python can do anything that can be done by these
CPL's, but I know it will be an uphill battle getting design engineers
to learn yet another scripting language. The pitch will be 1) What you
need to solve most of your design problems can be learned in two days.
Then you can decide if you want to learn the full language. 2) Learn
this one and you will have a language applicable to not just
controlling one company's EDA tools, but almost any scripting or
computational problem you may encounter. 3) Python may well be the
ultimate computer language for non-programmer technical professionals.
You won't have to learn another in the future.

The resistance will come from people who throw at us little bits and
pieces of code that can be done more easily in their chosen CPL.
String processing, for example, is one area where we may face some
difficulty. Here is a typical line of garbage from a statefile
revision control system (simplified to eliminate some items that pose
no new challenges):

line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

The problem is to break this into its component parts, and eliminate
spaces and other gradoo. The cleaned-up list should look like:

['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']

# Ruby:
# clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

This is pretty straight-forward once you know what each of the methods
do.

# Current best Python:
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

This is too much to expect of a non-programmer, even one who
undestands the methods. The usability problems are 1) the three
variations in syntax ( methods, a list comprehension, and what *looks
like* a join function prefixed by some odd punctuation), and 2) The
order in which each step is entered at the keyboard. ( I can show
this in step-by-step detail if anyone doesn't understand what I mean.)
3) Proper placement of parens can be confusing.

What we need is a syntax that flows in the same order you have to
think about the problem, stopping at each step to visualize an
intermendiate result, then typing the next operation, not mousing back
to insert a function or the start of a comprehension, and not screwing
up the parentheses. ( My inititial version had the closing paren of
the join method *after* the following strip, which lucky-for-me popped
an attribute error ... not-so-lucky could work OK on this example, but
mess up in subtle ways on future data. )

# Subclassing a list:
clean = [MyList(t.split()).join().strip('.') for t in line.split('|')]

The MyList.join method works as expected. I havent' figured out yet
how to add a map method to MyList, but already I can guess this is not
leading to a clean syntax. Having to insert 'MyList' everywhere is as
bad as the original syntax. Maybe someone can help me with the
Python. I would love it if there was a simple solution not requiring
changes to Python.

# Possible future Python:
# clean = line.split('|').map().split().join().strip('.')

The map method takes a list in the "front door" and feeds items from
the list one-at-a-time to the method waiting at its "back door". The
join method expects a list of strings at its front door and delivers a
single string at its back door. If something other than a space is
needed to join the strings, that can be provided via the (side-door)
of the join method.

-- Dave
 
W

William Park

David MacQuigg said:
The resistance will come from people who throw at us little bits and
pieces of code that can be done more easily in their chosen CPL.
String processing, for example, is one area where we may face some
difficulty. Here is a typical line of garbage from a statefile
revision control system (simplified to eliminate some items that pose
no new challenges):

line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

The problem is to break this into its component parts, and eliminate
spaces and other gradoo. The cleaned-up list should look like:

['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']

# Ruby:
# clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

This is pretty straight-forward once you know what each of the methods
do.

# Current best Python:
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

Both Bash shell and Python can split based on regular expression.
However, shell is not a bad alternative here:
tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
while IFS='|' read -a clean; do
...
done
 
C

Christian Tismer

William Park wrote:

....
# Current best Python:
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]


Both Bash shell and Python can split based on regular expression.
However, shell is not a bad alternative here:
tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
while IFS='|' read -a clean; do
...
done

But isn't that regex expression much harder to understand
for part-time programmers than the few Python methods?

(Quoting David's post)
"""
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

This is too much to expect of a non-programmer, even one who
undestands the methods. The usability problems are 1) the three
variations in syntax ( methods, a list comprehension, and what *looks
like* a join function prefixed by some odd punctuation), and 2) The
order in which each step is entered at the keyboard. ( I can show
this in step-by-step detail if anyone doesn't understand what I mean.)
3) Proper placement of parens can be confusing.
"""

Right. This quite a couple of concepts in one line, and it
might be short and efficient, but obfuscated for the none-
programmer.
Isn't this more readable? :

pieces = line.split(|) # break at the bars
nodots = [ piece.strip(".") for piece in pieces ] # remove leading or
trailing dots
clean = [" ".join(words.split()) for words in nodots] # normalise spaces

Well, there is still some complexity with the join/split mess.
But still more readable than the regex?

--
Christian Tismer :^) <mailto:[email protected]>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
 
W

William Park

Christian Tismer said:
But isn't that regex expression much harder to understand
for part-time programmers than the few Python methods?

But, OP's audience is not part-time programmers. My guess is that they
immediately abandon shell and jump to proprietary languages. OP may
have better luck if they stick with shell a bit longer, and then jump to
Python as last resort.

As for regex... it's usually easier to set up the data to be cut,
instead of cutting first and then patching up the pieces.
 
R

rzed

[...] Here is a typical line of garbage from a
statefile revision control system (simplified to eliminate some
items that pose no new challenges):

line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson
\n"

The problem is to break this into its component parts, and
eliminate spaces and other gradoo.

[UTTERLY OFFTOPIC QUESTION]:
This "gradoo" of which you speak ... where did you learn the word? I
only ask, because I know (and use) "gradeau", pronounced like I
imagine "gradoo" would be pronounced, to mean miscellaneous cruft or
garbage ... but I've only ever heard it used by a small group of
people who, as far as I know, originated the use of the word in Green
Bay, Wisconsin, in the mid-1970's.

[returning now to regular programming...]
 
G

Garry Knight

This "gradoo" of which you speak ... where did you learn the word? I
only ask, because I know (and use) "gradeau", pronounced like I
imagine "gradoo" would be pronounced, to mean miscellaneous cruft or
garbage

For your interest:

http://www.urbandictionary.com/define.php?term=gradeau
"gradeau
anything nasty that is small and slimy on you or anything else, ie., eye
slime from a dog or mucous of any sort
ewww, there's gradeau on my arm ewww"

http://www.collectivecopies.com/informative.htm
"Funky Gradoo
Colorful Southern term for schmutz (Yiddish), crud (Yankee) or other
unwanted marks, spots or streaks on a copy."

[And now, back to your regular programming...]
 
R

rzed

For your interest:

http://www.urbandictionary.com/define.php?term=gradeau
"gradeau
anything nasty that is small and slimy on you or anything
else, ie., eye
slime from a dog or mucous of any sort
ewww, there's gradeau on my arm ewww"

http://www.collectivecopies.com/informative.htm
"Funky Gradoo
Colorful Southern term for schmutz (Yiddish), crud (Yankee) or
other unwanted marks, spots or streaks on a copy."


Thank you. Yes, I saw the urbandictionary entry, and I've seen
about a hundred uses of the term in the sense I'm talking about by
Googling around, mostly in Google news.

I'm interested in finding out where it came from and when it
originated. I saw one post that claimed it was a Cajun term, which
I suppose could be the "colorful Southern term" mentioned above,
but I am not sure how to verify that. I haven't seen any dated use
before the early 1990's.
[And now, back to your regular programming...]
 
C

Christian Tismer

William said:
But, OP's audience is not part-time programmers. My guess is that they
immediately abandon shell and jump to proprietary languages. OP may
have better luck if they stick with shell a bit longer, and then jump to
Python as last resort.

I have no idea what OP is.
As for regex... it's usually easier to set up the data to be cut,
instead of cutting first and then patching up the pieces.

Why? One big, undecipherable regex is better than a stepwise
reduction of the problem? Not mentioning that the latter is
probably faster, but...
Can you enlighten me why you think you can claim that,
or is this going to become a thread like "PHP is better
than Python for web/database stuff"?

yes-I-meant-to-be-friendly -- chris

--
Christian Tismer :^) <mailto:[email protected]>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
 
S

Skip Montanaro

David> I am convinced that Python can do anything that can be done by
David> these CPL's, but I know it will be an uphill battle getting
David> design engineers to learn yet another scripting language....

David> The resistance will come from people who throw at us little bits
David> and pieces of code that can be done more easily in their chosen
David> CPL.

Then throw little bits and pieces of code back at them that can be done more
easily in Python. <0.5 wink>

David> String processing, for example, is one area where we may face
David> some difficulty.

...

David> # Ruby:
David> # clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

David> This is pretty straight-forward once you know what each of the
David> methods do.

David> # Current best Python:
David> clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

David> This is too much to expect of a non-programmer, even one who
David> undestands the methods.

...

My arguments from the "Zen of Python" would be:

Beautiful is better than ugly.
Simple is better than complex.
Sparse is better than dense.
Readability counts.

These aphorisms are especially important for non-programmers. They simply
aren't going to be able to remember what the above Ruby or Python code does
in six months without at least a little bit of study, especially if it's
buried in other similar code. That study will distract them, however
momentarily, from the actual task at hand. That breaks their chain of
concentration on the actual task at hand and lowers their productivity.

To that end, my proposed solution for your string smashing problem would be
something like:

import csv

for row in csv.reader(file("gradoo.csv"), delimiter='|'):
print row
# elide spaces
row = [" ".join(s.split()) for s in row]
print row
# trim leading ...
row = [s.lstrip(".") for s in row]
print row

given that gradoo.csv contains the line from your example. The advantages
that I see are:

* it's got some simple comments which identify the work being done

* it's easier to add new operations if needed in the future

* avoiding long chains of string methods makes the code easier to read

Skip
 
S

Stephen Horne

# Ruby:
# clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

This is pretty straight-forward once you know what each of the methods
do.

# Current best Python:
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

So what you are saying is that non-programmers just naturally
understand what "/\s*\|\s*/" means!

I kind of agree with you about the join method - I far prefer the now
deprecated function. But it's not much of a problem - you don't _have_
to use method-call syntax for Python, just get the unbound method from
the class and call it with the object as the first parameter...
str.join (' ', ['a', 'b', 'c'])
'a b c'

I guess I see the advantage in the Ruby form. It can of course be
replicated in Python using a library, but being able to handle the
task as neatly by default would be a plus.

So, how about this...
'/bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n'
'/bgref/stats.stf| SPICE | 3.2.7 | John Anderson'
['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']



Using ';' and '_', you can chain any functions or methods you want.
The downsides are (1) it only works at the command line, and (2) you
get intermediate results displayed.

A temporary variable can handle both issues, of course...
['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']


or, to save some hassle...
.... return re.sub (' +', ' ', p)
....['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']


On this basis, perhaps it would be useful to support the '_' variable
outside of the command line, and maybe to suppress all but the last
result when ';' is used on the command line.

OTOH, as you suggest, maybe we could use some extra string methods.
With an equivalent to the Ruby 'squeeze' and support for regular
expression methods, we could write...

line.strip().lstrip('.').squeeze().resplit(' ?\| ?')

Which is very much like the Ruby example.

Finally, it seems to me that this kind of tidy-and-split is probably a
common requirement. The split is easy enough, but after pondering
Robert Brewers argument I wondered if maybe a specialised tidying
class could do the job...

import re

class cleaner :
steps = []

def lstrip (self, *args) :
self.steps.append (lambda s : s.lstrip (*args))
return self

def rstrip (self, *args) :
self.steps.append (lambda s : s.rstrip (*args))
return self

def strip (self, *args) :
self.steps.append (lambda s : s.strip (*args))
return self

def squeeze (self) :
pat = re.compile (' +')
self.steps.append (lambda s : pat.sub (' ', s))
return self

def resub (self, regex, rep) :
pat=re.compile (regex)
self.steps.append (lambda s : pat.sub (rep, s))
return self

def clean (self, p) :
for i in self.steps :
p = i (p)
return p

line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

mycleaner = cleaner().lstrip(".").strip() \
.squeeze().resub(' ?\| ?','|')

print mycleaner.clean(line).split("|")
 
S

Stephen Horne

My arguments from the "Zen of Python" would be:

Beautiful is better than ugly.
Simple is better than complex.
Sparse is better than dense.
Readability counts.

Sparse can certainly be better than dense, but it is not an absolute.
With any style rule there is a need to balance issues and to use
common sense. If code can be made denser while still being readable
then more functionality can be viewed on screen at once - a major
benefit in readability and understanding as the more you can see, the
less you have to remember.

The Ruby code was IMO easier to understand Davids 'best' Python
(except for the regular expression). The left-to-right sequencing is
really no different than top-to-bottom sequencing in readability
terms. And adding comments is pointless when those comments just
duplicate what a standard method name already tells you - worse than
pointless, in fact, as it obscures the code that you're trying to
read. Good names are better than compensatory comments, and anyone
claiming to be a programmer should know the everyday names that are
used in his chosen language.

I know that isn't what your comments did, but my point is that the
Ruby example really doesn't need them. The nearest equivalent Python
code requires a temporary variable and either semicolons or splitting
over a few lines - the latter is probably better, though I adopted the
former in my earlier post. Simply breaking the code up, though,
provides no real readability benefits.

Put it this
way. How
much am I
improving
the
readability
of this
paragraph
by making
it stupidly
narrow like
this?

Splitting a perfectly clear line of code over several lines is exactly
the same thing and, as I said, the only readability issue that I could
see in the Ruby code was the regular expression.
 
B

benjamin schollnick

David MacQuigg said:
The resistance will come from people who throw at us little bits and
pieces of code that can be done more easily in their chosen CPL.
String processing, for example, is one area where we may face some
difficulty. Here is a typical line of garbage from a statefile
revision control system (simplified to eliminate some items that pose
no new challenges):

line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

The problem is to break this into its component parts, and eliminate
spaces and other gradoo. The cleaned-up list should look like:

['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']

# Ruby:
# clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

This is pretty straight-forward once you know what each of the methods
do.

# Current best Python:
clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

This is too much to expect of a non-programmer, even one who
undestands the methods. The usability problems are 1) the three
variations in syntax ( methods, a list comprehension, and what *looks
like* a join function prefixed by some odd punctuation), and 2) The
order in which each step is entered at the keyboard. ( I can show
this in step-by-step detail if anyone doesn't understand what I mean.)
3) Proper placement of parens can be confusing.

David,

I think your coming at this too much like a programmer... |-)

Your right, this is tooo complex for a non-programmer to expect
to simply use...

So redefine the problem, or look at it from a 90 degree angle.

If making the users understand the syntax is to complex, than
redefine the syntax.

Define a set of commands, and make them function wrappers around
your code.
line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

I am assuming your running into these lines on a regular basis, so
make a wrapper around your python function... Call it "Cleanup" or
"Parse_bar_line_string" or something that makes sense to your
users, and have them call that function....

- Benjamin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top