Need better string methods

Discussion in 'Python' started by David MacQuigg, Mar 6, 2004.

  1. I'm considering Python as a replacement for the highly specialized
    scripting languages used in the electronics design industry. Design
    engineers are typically not programmers, and they avoid working with
    these complex proprietary languages, preferring instead to use GUI
    tools that are poorly implemented and very limited in the problems
    they can solve.

    I am convinced that Python can do anything that can be done by these
    CPL's, but I know it will be an uphill battle getting design engineers
    to learn yet another scripting language. The pitch will be 1) What you
    need to solve most of your design problems can be learned in two days.
    Then you can decide if you want to learn the full language. 2) Learn
    this one and you will have a language applicable to not just
    controlling one company's EDA tools, but almost any scripting or
    computational problem you may encounter. 3) Python may well be the
    ultimate computer language for non-programmer technical professionals.
    You won't have to learn another in the future.

    The resistance will come from people who throw at us little bits and
    pieces of code that can be done more easily in their chosen CPL.
    String processing, for example, is one area where we may face some
    difficulty. Here is a typical line of garbage from a statefile
    revision control system (simplified to eliminate some items that pose
    no new challenges):

    line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

    The problem is to break this into its component parts, and eliminate
    spaces and other gradoo. The cleaned-up list should look like:

    ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']

    # Ruby:
    # clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

    This is pretty straight-forward once you know what each of the methods
    do.

    # Current best Python:
    clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

    This is too much to expect of a non-programmer, even one who
    undestands the methods. The usability problems are 1) the three
    variations in syntax ( methods, a list comprehension, and what *looks
    like* a join function prefixed by some odd punctuation), and 2) The
    order in which each step is entered at the keyboard. ( I can show
    this in step-by-step detail if anyone doesn't understand what I mean.)
    3) Proper placement of parens can be confusing.

    What we need is a syntax that flows in the same order you have to
    think about the problem, stopping at each step to visualize an
    intermendiate result, then typing the next operation, not mousing back
    to insert a function or the start of a comprehension, and not screwing
    up the parentheses. ( My inititial version had the closing paren of
    the join method *after* the following strip, which lucky-for-me popped
    an attribute error ... not-so-lucky could work OK on this example, but
    mess up in subtle ways on future data. )

    # Subclassing a list:
    clean = [MyList(t.split()).join().strip('.') for t in line.split('|')]

    The MyList.join method works as expected. I havent' figured out yet
    how to add a map method to MyList, but already I can guess this is not
    leading to a clean syntax. Having to insert 'MyList' everywhere is as
    bad as the original syntax. Maybe someone can help me with the
    Python. I would love it if there was a simple solution not requiring
    changes to Python.

    # Possible future Python:
    # clean = line.split('|').map().split().join().strip('.')

    The map method takes a list in the "front door" and feeds items from
    the list one-at-a-time to the method waiting at its "back door". The
    join method expects a list of strings at its front door and delivers a
    single string at its back door. If something other than a space is
    needed to join the strings, that can be provided via the (side-door)
    of the join method.

    -- Dave
     
    David MacQuigg, Mar 6, 2004
    #1
    1. Advertising

  2. David MacQuigg

    William Park Guest

    David MacQuigg <> wrote:
    > The resistance will come from people who throw at us little bits and
    > pieces of code that can be done more easily in their chosen CPL.
    > String processing, for example, is one area where we may face some
    > difficulty. Here is a typical line of garbage from a statefile
    > revision control system (simplified to eliminate some items that pose
    > no new challenges):
    >
    > line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"
    >
    > The problem is to break this into its component parts, and eliminate
    > spaces and other gradoo. The cleaned-up list should look like:
    >
    > ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']
    >
    > # Ruby:
    > # clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)
    >
    > This is pretty straight-forward once you know what each of the methods
    > do.
    >
    > # Current best Python:
    > clean = [' '.join(t.split()).strip('.') for t in line.split('|')]


    Both Bash shell and Python can split based on regular expression.
    However, shell is not a bad alternative here:
    tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
    while IFS='|' read -a clean; do
    ...
    done

    --
    William Park, Open Geometry Consulting, <>
    Linux solution for data processing and document management.
     
    William Park, Mar 6, 2004
    #2
    1. Advertising

  3. William Park wrote:

    ....

    >># Current best Python:
    >>clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

    >
    >
    > Both Bash shell and Python can split based on regular expression.
    > However, shell is not a bad alternative here:
    > tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
    > while IFS='|' read -a clean; do
    > ...
    > done


    But isn't that regex expression much harder to understand
    for part-time programmers than the few Python methods?

    (Quoting David's post)
    """
    clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

    This is too much to expect of a non-programmer, even one who
    undestands the methods. The usability problems are 1) the three
    variations in syntax ( methods, a list comprehension, and what *looks
    like* a join function prefixed by some odd punctuation), and 2) The
    order in which each step is entered at the keyboard. ( I can show
    this in step-by-step detail if anyone doesn't understand what I mean.)
    3) Proper placement of parens can be confusing.
    """

    Right. This quite a couple of concepts in one line, and it
    might be short and efficient, but obfuscated for the none-
    programmer.
    Isn't this more readable? :

    pieces = line.split(|) # break at the bars
    nodots = [ piece.strip(".") for piece in pieces ] # remove leading or
    trailing dots
    clean = [" ".join(words.split()) for words in nodots] # normalise spaces

    Well, there is still some complexity with the join/split mess.
    But still more readable than the regex?

    --
    Christian Tismer :^) <mailto:>
    Mission Impossible 5oftware : Have a break! Take a ride on Python's
    Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
    14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
    work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
    PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
    whom do you want to sponsor today? http://www.stackless.com/
     
    Christian Tismer, Mar 6, 2004
    #3
  4. David MacQuigg

    William Park Guest

    Christian Tismer <> wrote:
    > William Park wrote:
    > > Both Bash shell and Python can split based on regular expression.
    > > However, shell is not a bad alternative here:
    > > tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
    > > while IFS='|' read -a clean; do
    > > ...
    > > done

    >
    > But isn't that regex expression much harder to understand
    > for part-time programmers than the few Python methods?


    But, OP's audience is not part-time programmers. My guess is that they
    immediately abandon shell and jump to proprietary languages. OP may
    have better luck if they stick with shell a bit longer, and then jump to
    Python as last resort.

    As for regex... it's usually easier to set up the data to be cut,
    instead of cutting first and then patching up the pieces.

    --
    William Park, Open Geometry Consulting, <>
    Linux solution for data processing and document management.
     
    William Park, Mar 6, 2004
    #4
  5. David MacQuigg

    rzed Guest

    Utterly OT:[Was Re: Need better string methods]

    David MacQuigg <> wrote in
    news::

    > [...] Here is a typical line of garbage from a
    > statefile revision control system (simplified to eliminate some
    > items that pose no new challenges):
    >
    > line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson
    > \n"
    >
    > The problem is to break this into its component parts, and
    > eliminate spaces and other gradoo.


    [UTTERLY OFFTOPIC QUESTION]:
    This "gradoo" of which you speak ... where did you learn the word? I
    only ask, because I know (and use) "gradeau", pronounced like I
    imagine "gradoo" would be pronounced, to mean miscellaneous cruft or
    garbage ... but I've only ever heard it used by a small group of
    people who, as far as I know, originated the use of the word in Green
    Bay, Wisconsin, in the mid-1970's.

    [returning now to regular programming...]

    --
    rzed
     
    rzed, Mar 6, 2004
    #5
  6. David MacQuigg

    Garry Knight Guest

    Re: Utterly OT:[Was Re: Need better string methods]

    In message <Xns94A4C17566601rzed@63.223.5.95>, rzed wrote:

    > This "gradoo" of which you speak ... where did you learn the word? I
    > only ask, because I know (and use) "gradeau", pronounced like I
    > imagine "gradoo" would be pronounced, to mean miscellaneous cruft or
    > garbage


    For your interest:

    http://www.urbandictionary.com/define.php?term=gradeau
    "gradeau
    anything nasty that is small and slimy on you or anything else, ie., eye
    slime from a dog or mucous of any sort
    ewww, there's gradeau on my arm ewww"

    http://www.collectivecopies.com/informative.htm
    "Funky Gradoo
    Colorful Southern term for schmutz (Yiddish), crud (Yankee) or other
    unwanted marks, spots or streaks on a copy."

    [And now, back to your regular programming...]

    --
    Garry Knight
    ICQ 126351135
    Linux registered user 182025
     
    Garry Knight, Mar 7, 2004
    #6
  7. David MacQuigg

    rzed Guest

    Re: Utterly OT:[Was Re: Need better string methods]

    Garry Knight <> wrote in
    news::

    > In message <Xns94A4C17566601rzed@63.223.5.95>, rzed wrote:
    >
    >> This "gradoo" of which you speak ... where did you learn the
    >> word? I only ask, because I know (and use) "gradeau",
    >> pronounced like I imagine "gradoo" would be pronounced, to mean
    >> miscellaneous cruft or garbage

    >
    > For your interest:
    >
    > http://www.urbandictionary.com/define.php?term=gradeau
    > "gradeau
    > anything nasty that is small and slimy on you or anything
    > else, ie., eye
    > slime from a dog or mucous of any sort
    > ewww, there's gradeau on my arm ewww"
    >
    > http://www.collectivecopies.com/informative.htm
    > "Funky Gradoo
    > Colorful Southern term for schmutz (Yiddish), crud (Yankee) or
    > other unwanted marks, spots or streaks on a copy."



    Thank you. Yes, I saw the urbandictionary entry, and I've seen
    about a hundred uses of the term in the sense I'm talking about by
    Googling around, mostly in Google news.

    I'm interested in finding out where it came from and when it
    originated. I saw one post that claimed it was a Cajun term, which
    I suppose could be the "colorful Southern term" mentioned above,
    but I am not sure how to verify that. I haven't seen any dated use
    before the early 1990's.

    >
    > [And now, back to your regular programming...]
    >



    --
    rzed
     
    rzed, Mar 7, 2004
    #7
  8. William Park wrote:

    > Christian Tismer <> wrote:
    >
    >>William Park wrote:
    >>
    >>>Both Bash shell and Python can split based on regular expression.
    >>>However, shell is not a bad alternative here:
    >>> tr -s ' \t' ' ' | sed -e 's/ ?| ?/|/g' -e 's/^ //' -e 's/ $//' |
    >>> while IFS='|' read -a clean; do
    >>> ...
    >>> done

    >>
    >>But isn't that regex expression much harder to understand
    >>for part-time programmers than the few Python methods?

    >
    >
    > But, OP's audience is not part-time programmers. My guess is that they
    > immediately abandon shell and jump to proprietary languages. OP may
    > have better luck if they stick with shell a bit longer, and then jump to
    > Python as last resort.


    I have no idea what OP is.

    > As for regex... it's usually easier to set up the data to be cut,
    > instead of cutting first and then patching up the pieces.


    Why? One big, undecipherable regex is better than a stepwise
    reduction of the problem? Not mentioning that the latter is
    probably faster, but...
    Can you enlighten me why you think you can claim that,
    or is this going to become a thread like "PHP is better
    than Python for web/database stuff"?

    yes-I-meant-to-be-friendly -- chris

    --
    Christian Tismer :^) <mailto:>
    Mission Impossible 5oftware : Have a break! Take a ride on Python's
    Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
    14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
    work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
    PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
    whom do you want to sponsor today? http://www.stackless.com/
     
    Christian Tismer, Mar 7, 2004
    #8
  9. David MacQuigg

    Terry Reedy Guest

    "Christian Tismer" <> wrote in message
    news:...
    > I have no idea what OP is.


    Original Poster.
     
    Terry Reedy, Mar 7, 2004
    #9
  10. David MacQuigg

    William Park Guest

    Christian Tismer <> wrote:
    > I have no idea what OP is.


    OP = Original Poster of a given thread

    --
    William Park, Open Geometry Consulting, <>
    Linux solution for data processing and document management.
     
    William Park, Mar 7, 2004
    #10
  11. David> I am convinced that Python can do anything that can be done by
    David> these CPL's, but I know it will be an uphill battle getting
    David> design engineers to learn yet another scripting language....

    David> The resistance will come from people who throw at us little bits
    David> and pieces of code that can be done more easily in their chosen
    David> CPL.

    Then throw little bits and pieces of code back at them that can be done more
    easily in Python. <0.5 wink>

    David> String processing, for example, is one area where we may face
    David> some difficulty.

    ...

    David> # Ruby:
    David> # clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)

    David> This is pretty straight-forward once you know what each of the
    David> methods do.

    David> # Current best Python:
    David> clean = [' '.join(t.split()).strip('.') for t in line.split('|')]

    David> This is too much to expect of a non-programmer, even one who
    David> undestands the methods.

    ...

    My arguments from the "Zen of Python" would be:

    Beautiful is better than ugly.
    Simple is better than complex.
    Sparse is better than dense.
    Readability counts.

    These aphorisms are especially important for non-programmers. They simply
    aren't going to be able to remember what the above Ruby or Python code does
    in six months without at least a little bit of study, especially if it's
    buried in other similar code. That study will distract them, however
    momentarily, from the actual task at hand. That breaks their chain of
    concentration on the actual task at hand and lowers their productivity.

    To that end, my proposed solution for your string smashing problem would be
    something like:

    import csv

    for row in csv.reader(file("gradoo.csv"), delimiter='|'):
    print row
    # elide spaces
    row = [" ".join(s.split()) for s in row]
    print row
    # trim leading ...
    row = [s.lstrip(".") for s in row]
    print row

    given that gradoo.csv contains the line from your example. The advantages
    that I see are:

    * it's got some simple comments which identify the work being done

    * it's easier to add new operations if needed in the future

    * avoiding long chains of string methods makes the code easier to read

    Skip
     
    Skip Montanaro, Mar 7, 2004
    #11
  12. On Sat, 06 Mar 2004 12:01:16 -0700, David MacQuigg <>
    wrote:

    ># Ruby:
    ># clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)
    >
    >This is pretty straight-forward once you know what each of the methods
    >do.
    >
    ># Current best Python:
    >clean = [' '.join(t.split()).strip('.') for t in line.split('|')]


    So what you are saying is that non-programmers just naturally
    understand what "/\s*\|\s*/" means!

    I kind of agree with you about the join method - I far prefer the now
    deprecated function. But it's not much of a problem - you don't _have_
    to use method-call syntax for Python, just get the unbound method from
    the class and call it with the object as the first parameter...

    >>> str.join (' ', ['a', 'b', 'c'])

    'a b c'

    I guess I see the advantage in the Ruby form. It can of course be
    replicated in Python using a library, but being able to handle the
    task as neatly by default would be a plus.

    So, how about this...

    >>> line.lstrip ('.'); re.sub (' +', ' ', _).strip (); re.split (' ?\| ?', _)

    '/bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n'
    '/bgref/stats.stf| SPICE | 3.2.7 | John Anderson'
    ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']



    Using ';' and '_', you can chain any functions or methods you want.
    The downsides are (1) it only works at the command line, and (2) you
    get intermediate results displayed.

    A temporary variable can handle both issues, of course...

    >>> t=line.lstrip('.'); t=re.sub(' +', ' ', t).strip(); re.split(' ?\| ?', t)

    ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']


    or, to save some hassle...

    >>> def squeeze (p) :

    .... return re.sub (' +', ' ', p)
    ....
    >>> t=line.lstrip('.'); t=squeeze(t).strip(); re.split(' ?\| ?', t)

    ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']


    On this basis, perhaps it would be useful to support the '_' variable
    outside of the command line, and maybe to suppress all but the last
    result when ';' is used on the command line.

    OTOH, as you suggest, maybe we could use some extra string methods.
    With an equivalent to the Ruby 'squeeze' and support for regular
    expression methods, we could write...

    line.strip().lstrip('.').squeeze().resplit(' ?\| ?')

    Which is very much like the Ruby example.

    Finally, it seems to me that this kind of tidy-and-split is probably a
    common requirement. The split is easy enough, but after pondering
    Robert Brewers argument I wondered if maybe a specialised tidying
    class could do the job...

    import re

    class cleaner :
    steps = []

    def lstrip (self, *args) :
    self.steps.append (lambda s : s.lstrip (*args))
    return self

    def rstrip (self, *args) :
    self.steps.append (lambda s : s.rstrip (*args))
    return self

    def strip (self, *args) :
    self.steps.append (lambda s : s.strip (*args))
    return self

    def squeeze (self) :
    pat = re.compile (' +')
    self.steps.append (lambda s : pat.sub (' ', s))
    return self

    def resub (self, regex, rep) :
    pat=re.compile (regex)
    self.steps.append (lambda s : pat.sub (rep, s))
    return self

    def clean (self, p) :
    for i in self.steps :
    p = i (p)
    return p

    line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"

    mycleaner = cleaner().lstrip(".").strip() \
    .squeeze().resub(' ?\| ?','|')

    print mycleaner.clean(line).split("|")


    --
    Steve Horne

    steve at ninereeds dot fsnet dot co dot uk
     
    Stephen Horne, Mar 7, 2004
    #12
  13. On Sun, 7 Mar 2004 08:29:21 -0600, Skip Montanaro <>
    wrote:

    >My arguments from the "Zen of Python" would be:
    >
    > Beautiful is better than ugly.
    > Simple is better than complex.
    > Sparse is better than dense.
    > Readability counts.


    Sparse can certainly be better than dense, but it is not an absolute.
    With any style rule there is a need to balance issues and to use
    common sense. If code can be made denser while still being readable
    then more functionality can be viewed on screen at once - a major
    benefit in readability and understanding as the more you can see, the
    less you have to remember.

    The Ruby code was IMO easier to understand Davids 'best' Python
    (except for the regular expression). The left-to-right sequencing is
    really no different than top-to-bottom sequencing in readability
    terms. And adding comments is pointless when those comments just
    duplicate what a standard method name already tells you - worse than
    pointless, in fact, as it obscures the code that you're trying to
    read. Good names are better than compensatory comments, and anyone
    claiming to be a programmer should know the everyday names that are
    used in his chosen language.

    I know that isn't what your comments did, but my point is that the
    Ruby example really doesn't need them. The nearest equivalent Python
    code requires a temporary variable and either semicolons or splitting
    over a few lines - the latter is probably better, though I adopted the
    former in my earlier post. Simply breaking the code up, though,
    provides no real readability benefits.

    Put it this
    way. How
    much am I
    improving
    the
    readability
    of this
    paragraph
    by making
    it stupidly
    narrow like
    this?

    Splitting a perfectly clear line of code over several lines is exactly
    the same thing and, as I said, the only readability issue that I could
    see in the Ruby code was the regular expression.


    --
    Steve Horne

    steve at ninereeds dot fsnet dot co dot uk
     
    Stephen Horne, Mar 7, 2004
    #13
  14. In article <>, David MacQuigg
    <> wrote:

    > The resistance will come from people who throw at us little bits and
    > pieces of code that can be done more easily in their chosen CPL.
    > String processing, for example, is one area where we may face some
    > difficulty. Here is a typical line of garbage from a statefile
    > revision control system (simplified to eliminate some items that pose
    > no new challenges):
    >
    > line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"
    >
    > The problem is to break this into its component parts, and eliminate
    > spaces and other gradoo. The cleaned-up list should look like:
    >
    > ['/bgref/stats.stf', 'SPICE', '3.2.7', 'John Anderson']
    >
    > # Ruby:
    > # clean = line.chomp.strip('.').squeeze.split(/\s*\|\s*/)
    >
    > This is pretty straight-forward once you know what each of the methods
    > do.
    >
    > # Current best Python:
    > clean = [' '.join(t.split()).strip('.') for t in line.split('|')]
    >
    > This is too much to expect of a non-programmer, even one who
    > undestands the methods. The usability problems are 1) the three
    > variations in syntax ( methods, a list comprehension, and what *looks
    > like* a join function prefixed by some odd punctuation), and 2) The
    > order in which each step is entered at the keyboard. ( I can show
    > this in step-by-step detail if anyone doesn't understand what I mean.)
    > 3) Proper placement of parens can be confusing.


    David,

    I think your coming at this too much like a programmer... |-)

    Your right, this is tooo complex for a non-programmer to expect
    to simply use...

    So redefine the problem, or look at it from a 90 degree angle.

    If making the users understand the syntax is to complex, than
    redefine the syntax.

    Define a set of commands, and make them function wrappers around
    your code.

    > line = "..../bgref/stats.stf| SPICE | 3.2.7 | John Anderson \n"


    I am assuming your running into these lines on a regular basis, so
    make a wrapper around your python function... Call it "Cleanup" or
    "Parse_bar_line_string" or something that makes sense to your
    users, and have them call that function....

    - Benjamin
     
    benjamin schollnick, Mar 7, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Bencsik
    Replies:
    2
    Views:
    855
  2. Angus
    Replies:
    3
    Views:
    340
  3. Andrew Thompson
    Replies:
    8
    Views:
    158
    Premshree Pillai
    Jun 7, 2005
  4. Kenneth McDonald
    Replies:
    5
    Views:
    342
    Kenneth McDonald
    Sep 26, 2008
  5. Replies:
    2
    Views:
    60
    Mark H Harris
    May 13, 2014
Loading...

Share This Page