escaping/encoding/formatting in python

S

Steve Howell

One of the biggest nuisances for programmers, just beneath date/time
APIs in the pantheon of annoyances, is that we are constantly dealing
with escaping/encoding/formatting issues.

I wrote this little program as a cheat sheet for myself and others.
Hope it helps.

# escaping quotes
legal_string = ['"', "'", "'\"", '"\'', """ '" """]
for s in legal_string:
print("[" + s + "]")

# formatting
print 'Hello %s' % 'world'
print "Hello %s" % 'world'
planet = 'world'
print "Hello {planet}".format(**locals())
print "Hello {planet}".format(planet=planet)
print "Hello {0}".format(planet)

# Unicode
s = u"\u0394"
print s # prints a triangle
print repr(s) == "u'\u0394'" # True
print s.encode("utf-8") == "\xce\x94" # True
# other examples/resources???

# Web encodings
import urllib
s = "~foo ~bar"
print urllib.quote_plus(s) == '%7Efoo+%7Ebar' # True
print urllib.unquote_plus(urllib.quote_plus(s)) == s # True
import cgi
s = "x < 4 & x > 5"
print cgi.escape(s) == 'x &lt; 4 &amp; x &gt; 5' # True

# JSON
import json
h = {'foo': 'bar'}
print json.dumps(h) == '{"foo": "bar"}' # True
try:
bad_json = "{'foo': 'bar'}"
json.loads(bad_json)
except:
print 'Must use double quotes in your JSON'

It's tested under Python3.2. I didn't dare to cover regexes. It
would be great if somebody could flesh out the Unicode examples or
remind me (and others) of other common APIs that are useful to have in
your bag of tricks.
 
R

rusi

One of the biggest nuisances for programmers, just beneath date/time
APIs in the pantheon of annoyances, is that we are constantly dealing
with escaping/encoding/formatting issues.

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.

[Suitable assumptions: linux box with emacs installed]
 
S

Steve Howell

One of the biggest nuisances for programmers, just beneath date/time
APIs in the pantheon of annoyances, is that we are constantly dealing
with escaping/encoding/formatting issues.

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.

[Suitable assumptions: linux box with emacs installed]

You've one-upped me with 2-to-the-N backspace escaping. I've written
useful scripts before with "\\\\\\\\" (scripts that went through three
levels of interpretation), but four is setting a new bar. My use of
three backslashes back in the day was like Beamon's jump in the Mexico
City Olympics. An amazing feat for its time, but every record
eventually gets broken. Well done.
 
S

Steve Howell

One of the biggest nuisances for programmers, just beneath date/time
APIs in the pantheon of annoyances, is that we are constantly dealing
with escaping/encoding/formatting issues.

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.

[Suitable assumptions: linux box with emacs installed]

You've one-upped me with 2-to-the-N backslash escaping. I've written
useful scripts before with "\\\\\\\\" (scripts that went through
three
levels of interpretation), but four is setting a new bar. My use of
three exponentially increasing levels of backslashes back in the day
was like Beamon's jump in the Mexico City Olympics. An amazing feat
for its time, but every record
eventually gets broken. Well done.
 
R

rusi

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.
[Suitable assumptions: linux box with emacs installed]

You've one-upped me with 2-to-the-N backslash escaping.  I've written
useful scripts before with "\\\\\\\\" (scripts that went through
three
levels of interpretation), but four is setting a new bar.  My use of
three exponentially increasing levels of backslashes back in the day
was like Beamon's jump in the Mexico City Olympics.  An amazing feat
for its time, but every record
eventually gets broken.  Well done.

There was a competition here?!
If so I can break my own record -- double the number of backslashes
and you still get hits.
Its just that I was unsure of my ability at typing 32 backslashes (and
making a reasonable post).

On a more serious note this indicates that it is (may be?) a bad idea
for old-fashioned languages (like elisp and C) to have only 1 string-
quoter.

May-be-question-mark because programming language experience tells us
that avoiding recursion (in its infinite guises) by special-casing is
usually a bad idea.

All this mess would vanish if the string-literal-starter and ender
were different.
[You dont need to escape a open-paren in a lisp sexp]
 
N

Nobody

All this mess would vanish if the string-literal-starter and ender
were different.

You still need an escape character in order to be able to embed an
unbalanced end character.

Tcl and PostScript use mirrored string delimiters (braces for Tcl,
parentheses for PostScript), which results in the worst of both worlds:
they still need an escape character (backslash, in both cases) but now you
can't match tokens with a regexp/DFA.
 
R

rusi

You still need an escape character in order to be able to embed an
unbalanced end character.

Tcl and PostScript use mirrored string delimiters (braces for Tcl,
parentheses for PostScript), which results in the worst of both worlds:
they still need an escape character (backslash, in both cases) but now you
can't match tokens with a regexp/DFA.

Yes. I hand it to you that I missed the case of explicitly unbalanced
strings.
But are not such cases rare?
For example code such as:
print '"'
print str(something)
print '"'

could better be written as
print '"%s"' % str(something)
 
N

Nobody

But are not such cases rare?

They exist, therefore they have to be supported somehow.
For example code such as:
print '"'
print str(something)
print '"'

could better be written as
print '"%s"' % str(something)

Not if the text between the delimiters is large.

Consider:

print 'static const char * const data[] = {'
for line in infile:
print '\t"%s",' % line.rstrip()
print '};'

Versus:

text = '\n'.join('\t"%s",' % line.rstrip() for line in infile)
print 'static const char * const data[] = {\n%s\n};' % text

C++11 solves the problem to an extent by providing raw strings with
user-defined delimiters (up to 16 printable characters excluding
parentheses and backslash), e.g.:

R"delim(quote: " backslash: \ rparen: ))delim"

evaluates to the string:

quote: " backslash: \ rparen: )

The only sequence which cannot appear in such a string is )delim" (i.e. a
right parenthesis followed by the chosen delimiter string followed by a
double quote). The delimiter can be chosen either by analysing the string
or by choosing something a string at random and relying upon a collision
being statistically improbable.
 
R

rusi

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.
[Suitable assumptions: linux box with emacs installed]

You've one-upped me with 2-to-the-N backslash escaping.  I've written
useful scripts before with "\\\\\\\\" (scripts that went through
three
levels of interpretation), but four is setting a new bar.  My use of
three exponentially increasing levels of backslashes back in the day
was like Beamon's jump in the Mexico City Olympics.  An amazing feat
for its time, but every record
eventually gets broken.  Well done.


On a (somewhat distantly) related note, found this old fortune:

Wouldn't the sentence 'I want to put a hyphen between the words Fish
and And and And and Chips in my Fish-And-Chips sign' have been clearer
if quotation marks had been placed before Fish, and between Fish and
and, and and and And, and And and and, and and and And, and And and
and, and and and Chips, as well as after Chips?
 
J

John Nagle

You've one-upped me with 2-to-the-N backspace escaping.

Early attempts at UNIX word processing, "nroff" and "troff",
suffered from that problem, due to a badly designed macro system.

A question in language design is whether to escape or quote.
Do you write

"X = %d" % (n,))

or

"X = " + str(n)

In general, for anything but output formatting, the second scales
better. Regular expressions have a bad case of the first.
For a quoted alternative to regular expression syntax, see
SNOBOL or Icon. SNOBOL allows naming patterns, and those patterns
can then be used as components of other patterns. SNOBOL
is obsolete, but that approach produced much more readable
code.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top