Mandis Quotes (aka retiring """ and ''')

R

Russell Nelson

Jef Raskin (namedropping) has pointed me at a neat scheme for quoting
arbitrary textual matter called "Mandis quotes". Since google is
ignorant of the phrase, I presume that Jef made it up. It is
disgustingly simple, and very Pythonesque. Here's how it works: If
you have a string that doesn't have any single quotes in it, you
surround the string by a pair of doubled single quotes. ''Like
this''. No backslash interpolation. If you want a character in
there, you put it in there (yes, I know, stand down your armies).
Clearly, then, any character except a single quote can go into one of
these strings. If you need to put a single quote in, then you put
an arbitrary string in-between the single quotes which does NOT
appear in the string. For example, "Bill's house" becomes
'x'Bill's house'x'.

More formally, a mandis quote is a pair of tokens surrounding a
completely arbitrary sequence of bytes. These tokens are comprised of
a possibly null sequence of characters preceded by and followed by a
single quote.

To save time, here's why this pre-PEP proposal sucks in decreasing
order of severity:

o Python source is typically represented, not as an arbitrary string
of ASCII or Unicode characters, but instead as a sequence of lines
separated by the native line terminator (e.g. CRLF, LF, or CR).

o Editors are not all up to the task of inserting arbitrary
characters into strings (although they SHOULD).

o Email cannot withstand arbitrary strings of characters (although
quoted-printable suffices).

o Some distinct Unicode characters are represented using the same
glyph, so that information is lost when text gets printed (but
that's more of a Unicode stupidism.)

Obviously, the justification for it is that it eliminates ", ', r",
r', """, and ''' from the syntax, replacing them by a single 'x' that
suffices for everything. Makes the code easier to read (only one
visual element), easier to parse, and easier to write, because you
don't need to decide which literal method to use.
 
J

Jeff Epler

One problem I see is that certain 'x'-quoted strings are currently legal
Python program fragments. For instance,
xx = 'x'+'x'
and
xx = 'x' '' 'x'
(two silly ways to speciy the string 'xx')

A real-life example where "Mandis quotes" would change the meaning of an
existing program:
def m(rows):
m = max([len(str(s)) for row in rows for s in row])
return "\n".join([r(row, "%%%ds" % m) for row in rows])

def r(row, fmt):
return '|' + ' '.join([fmt % i for i in row]) + '|'
| 1 2 3|
| 4 5 100|

I didn't like that r'' strings were added, so I hope that "add one more
kind of string literal" is already dead-in-the-water as a proposal.
Especially since the removal of any kind of string literal is impossible
before Python 3000. The fact that this would change the meaning of
legitimate programs, well, that's even worse.

These quotes are similar to perl's q{} quoting:
Quote and Quote-like Operators

While we usually think of quotes as literal values, in Perl they func-
tion as operators, providing various kinds of interpolating and pattern
matching capabilities. Perl provides customary quote characters for
these behaviors, but also provides a way for you to choose your quote
character for any of them. In the following table, a "{}" represents
any pair of delimiters you choose.

Customary Generic Meaning Interpolates
'' q{} Literal no
"" qq{} Literal yes
[...]

Non-bracketing delimiters use the same character fore and aft, but the
four sorts of brackets (round, angle, square, curly) will all nest,
which means that

q{foo{bar}baz}

is the same as

'foo{bar}baz'
What advantage do mandis-quotes have over perl's q{}? The bracketing
delimeters rule seems like a handy one when dealing with strings that
might be program source code, mathematical expressions, or even just
paragraphs of text with parentheticals.

Do any major editors (Emacs and Vim to me; others can name their favorite)
treat Mandis-quotes better than they treat triple-quotes? For me,
the poor treatment in text editors of triple-quotes is their biggest
weakness, but parsing mandis-quotes seems no easier, and maybe harder.

Finally, I don't understand why
''here's the trick''
wouldn't be a single mandis-quoted string (equal to 'here\'s the trick'
and "here's the trick"), because the delimeter is '', and that
doesn't appear within the string.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBYXAaJd01MZaTXX0RAgMZAJ4vQ+bnqhC7g7zrbFP6SMnnD2JLkQCgnqb1
qdjQwwpgvgoS4fDEOUGpkKw=
=viN9
-----END PGP SIGNATURE-----
 
J

Jeff Epler

Responding to "replacing them by a single 'x' that suffices for everything":
No, it doesn't.
It replaces them with an arbitrary number of string delimiters. 'x'
is one, but 'y' is another, 'Would you like some spam?' is yet
another. You must still be careful in selecting your delimiter,
lest you accidentally use one included in your string literal. The
chances of doing so are greatly decreased, but they are still
non-zero.
The idea doesn't seem all bad, but it doesn't seem like a great
enough improvement to justify very nearly breaking every Python
program ever written in dozens, hundreds, or even thousands of
different places.
Jp

Ouch. You're right. Here's another program whose meaning changes
radically with mandis-quotes:

print '-' * 72
do_the_real_work()
print '-' * 72

Any program that contains the same double-quoted string twice would change
meaning under this proposal.

On the other hand, does this program have any meaning left?
print 'x'
it must be an unterminated mandis-quoted string.

OK, shoot this proposal right now.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBYXDLJd01MZaTXX0RAiC3AJ9uWqNuHTTLBat4ISnwFwqbmlFlywCfWv37
zH98DFMsfqAejZAxCez6pU4=
=foWI
-----END PGP SIGNATURE-----
 
A

Andrew Dalke

Russell said:
> "Mandis quotes".

Trying it out ...

'Dave said 'Let's go to Bill's house and shoot a round of
pool. Afterwards we'll watch a movie. 'Ere long 'twould
be 'ardly fittin' 'Dave said'

'quote
'When will we three three meet again?

'quote
'

'quote
'You see I had a spacesuit. How it happened was
this way.

'quote
'


The latter one will be tricky because the spaces. I
mixed "quote \n", "quote \n", "quote \n" and "quote \n"
in the above, so that there's one string with the quotes
from Shakespeare and Heinlein rather than two.

Were this implemented I would suggest that whitespace
characters not be allowed, or at least be prohibited
from the terminal points of the Manis quote indicators.
More formally, a mandis quote is a pair of tokens surrounding a
completely arbitrary sequence of bytes. These tokens are comprised of
a possibly null sequence of characters preceded by and followed by a
single quote.

I was going to say it precludes reading in a huge block
of bytes (>1GB in size) and quoting it because you'll need
to buffer everything in memory. Then I remembered string
concatenation. Process 1MB at a time.

To save time, here's why this pre-PEP proposal sucks in decreasing
order of severity:

o Python source is typically represented, not as an arbitrary string
of ASCII or Unicode characters, but instead as a sequence of lines
separated by the native line terminator (e.g. CRLF, LF, or CR).

o Editors are not all up to the task of inserting arbitrary
characters into strings (although they SHOULD).

One thought is that the actual quote identifier doesn't need
to be shown. To start the quote, press '. The computer
inserts '' and puts the cursor so the next character is
in between the two quotes. Everything between those two
characters is treated as a string. To stop the quote,
right arrow past the final quote or, in THE/HUMANE style,
LEAP to it.

When the text is saved, the editor is free to use an
arbitrarily created Mandis quote delimiter.
o Email cannot withstand arbitrary strings of characters (although
quoted-printable suffices)7.

But doesn't that mean email can "withstand arbitrary strings
of characters"?
o Some distinct Unicode characters are represented using the same
glyph, so that information is lost when text gets printed (but
that's more of a Unicode stupidism.)

When working with byte oriented data it's very helpful
to be able to see a text representation for non-printable
data. For example, seeing "\r\n" instead of "
" (actually, that's only a "\n"). Similarly there are
non-visible unicode characters, including
5760 OGHAM SPACE MARK
8192 EN QUAD
8193 EM QUAD
8194 EN SPACE
8195 EM SPACE
8196 THREE-PER-EM SPACE
8197 FOUR-PER-EM SPACE
8198 SIX-PER-EM SPACE
8199 FIGURE SPACE
8200 PUNCTUATION SPACE
8201 THIN SPACE
8202 HAIR SPACE
8203 ZERO WIDTH SPACE


I would like to be able to see exactly what I've got.
For example, here's something I could do with Python
as it is now

if u"\N{EN QUAD}" in s:
print "Has an 'en quad'"

How would I do that with Mandis quotes? Would it
use editor support to show special characters vs.
normal ones? How?


Any binary data can be inside the quote. When does the
program know that that binary data is a representation
of unicode characters? Not all binary data is valid
Unicode, and there are many possible encodings. Or
would there still be an indicator like

s'This is a character string'
b'This is a byte string'

Andrew
(e-mail address removed)
 
J

Jeff Epler

One thought is that the actual quote identifier doesn't need
to be shown. To start the quote, press '. The computer
inserts '' and puts the cursor so the next character is
in between the two quotes. Everything between those two
characters is treated as a string. To stop the quote,
right arrow past the final quote or, in THE/HUMANE style,
LEAP to it.

.... but if an editor is doing something fancy for display, the
representation of the string as bytes-in-a-file could include
backslashes, while the version onscreen would be unbackslashed and
surrounded by "special quotes"---something that can be drawn but not
typed. For instance, underlined quote marks, though you *can* type
those as unicode combining characters, I guess. Green quotes? Nope,
color blindness is a problem here. Flashing quote marks it is. Unless
unicode has a COMBINING 1.27HZ FLASH code point I don't know about)

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBYZYKJd01MZaTXX0RAgI/AJ9mPTZDDpofMs5NeifHNcQPn0KCaQCgl8VB
b0o5n5IjV+elzs/d+qBuZz8=
=Sr9m
-----END PGP SIGNATURE-----
 
B

Bengt Richter

Jef Raskin (namedropping) has pointed me at a neat scheme for quoting
arbitrary textual matter called "Mandis quotes". Since google is
ignorant of the phrase, I presume that Jef made it up. It is
disgustingly simple, and very Pythonesque. Here's how it works: If
you have a string that doesn't have any single quotes in it, you
surround the string by a pair of doubled single quotes. ''Like
this''. No backslash interpolation. If you want a character in
there, you put it in there (yes, I know, stand down your armies).
Clearly, then, any character except a single quote can go into one of
these strings. If you need to put a single quote in, then you put
an arbitrary string in-between the single quotes which does NOT
appear in the string. For example, "Bill's house" becomes
'x'Bill's house'x'.

More formally, a mandis quote is a pair of tokens surrounding a
completely arbitrary sequence of bytes. These tokens are comprised of
a possibly null sequence of characters preceded by and followed by a
single quote.

I once started a thread with the same (quoting arbitrary text) goal, but
I made it a special case of Python string syntax, using a q or Q prefix:

q'x'Bill's housex

I thought about re-quoting the 'x' at the tail, but thought more typical usage
would use a special character for single-character delimiters, e.g.,
q'|'Bill's house|

See

http://groups.google.com/groups?group=comp.lang.python.*&[email protected]&rnum=2

And click on view complete thread to see all 36 posts ;-)

To save time, here's why this pre-PEP proposal sucks in decreasing
order of severity:

o Python source is typically represented, not as an arbitrary string
of ASCII or Unicode characters, but instead as a sequence of lines
separated by the native line terminator (e.g. CRLF, LF, or CR).
See Q'... in the above cited thread.
o Editors are not all up to the task of inserting arbitrary
characters into strings (although they SHOULD).

o Email cannot withstand arbitrary strings of characters (although
quoted-printable suffices).

o Some distinct Unicode characters are represented using the same
glyph, so that information is lost when text gets printed (but
that's more of a Unicode stupidism.)

Obviously, the justification for it is that it eliminates ", ', r",
r', """, and ''' from the syntax, replacing them by a single 'x' that
suffices for everything. Makes the code easier to read (only one
visual element), easier to parse, and easier to write, because you
don't need to decide which literal method to use.

IMO a special use case does not justify complicating ordinary usage,
but can be justified as a special syntax variant if it stays out of the way
and provides otherwise unavailable capability.

As others have pointed out, you couldn't just switch to Mandis Quotes as
a complete replacement, since it would break existing programs. But you
could prefix e.g. and 'm' for a special syntax a lot like mine ;-)

m'x'Bill's House'x'

Quoting "arbitrary" text also involves the issue of encoding, which is something
I hadn't thought through when I proposed my syntax. E.g., what happens when you
paste arbitrary text of possibly different encoding between some delimiters?

Do you depend on the editor's (if you are using an editor, not programmatically
concatenating text from various sources) ability to call for encoding transformations
from clipboard content to its current encoding? Does that lose information if the
current encoding is not unicode? It's a long discussion, involving what byte sequences
really mean in the various representations involved (in source files, memory, screen
presentations, etc.), and which are transient escaped byte representations and which
are abstract text entities. Another time ... ;-)

Regards,
Bengt Richter
 
R

Russell Nelson

Clearly you would have to run every program through a reMandiser.
OK, shoot this proposal right now.

The necessity for a flag day isn't really a problem, because the
translation would be automated. The real problem is, in my mind, that
Python programs are NOT strings of characters, but are instead a
sequence of lines. If you change the line terminator, you haven't
changed the meaning of a program.
-russ
 
R

Russell Nelson

Andrew Dalke said:
'Dave said '

You are cruel .... and vicious.
Were this implemented I would suggest that whitespace
characters not be allowed, or at least be prohibited
from the terminal points of the Mandis quote indicators.

Very likely a sanity-preserving requirement.
I was going to say it precludes reading in a huge block
of bytes (>1GB in size) and quoting it because you'll need
to buffer everything in memory. Then I remembered string
concatenation. Process 1MB at a time.

Sure, they're not hard to parse. LALR(1).
When the text is saved, the editor is free to use an
arbitrarily created Mandis quote delimiter.

No question but that an editor should be helpful.
When working with byte oriented data it's very helpful
to be able to see a text representation for non-printable
data. For example, seeing "\r\n" instead of "
" (actually, that's only a "\n"). Similarly there are
non-visible unicode characters, including

This is more of a text editor problem than anything else. In THE,
when you select something, the invisible characters get rendered
visibly. Other editors can do similar things. When we get to a 100%
Unicode world, they'll have to do something. Same thing for Unicode
glyphs that get presented identically.
Any binary data can be inside the quote. When does the
program know that that binary data is a representation
of unicode characters?

That's a good question, I'll ask Jef. He's an inventive guy, he may
have thought of a solution already.
-russ
 
A

Andrew Dalke

Russell said:
You are cruel .... and vicious.

Ummm, okay. I would say it's due to too many years
working with unforgiving computers and reading standards
meant for computers.
Sure, they're not hard to parse. LALR(1).

It isn't the lookahead I was worried about, it was
the requirement to keep a lot of data in memory
before being able to work on it.
No question but that an editor should be helpful.

Right. Though as Jeff Epler pointed out, that helpful
editor could even work with the current Python syntax.
There's nothing to say that what the user sees on
the screen much match the representation on disk. Leo,
and of course THE show that.
This is more of a text editor problem than anything else. In THE,
when you select something, the invisible characters get rendered
visibly. Other editors can do similar things. When we get to a 100%
Unicode world, they'll have to do something. Same thing for Unicode
glyphs that get presented identically.

Which is why I gave an example of 7 or so different
characters which can be considered whitespace. I could
have added the combining character to the list, or the
flags to switch direction (as with a mix of English and
Hebrew). Will all those be shown as different characters?
Or some other way?

To bring it around to THE/HUMANE. Suppose I have the
unicode character \N{SECTION SIGN}. That's the paragraph
symbol. I believe THE uses to indicate the end of paragraph
during a LEAP search. How then do I search for that
character embedded in THE?

The problem is that any character you use to represent
one of the otherwise hidden characters may itself be the
target of a search. Given too the difficulty of actually
typing the SECTION SIGN character it's likely easier to
search based on the unicode name rather than the actual
character as typed via the keyboard. Perhaps the better
solution is that a LEAP search show the underlying unicode
name rather than the glyph. But that would depend on the
keyboard mode, because on a US keyboard I would like to
be able to search for "Göteborg" by typing "Goteborg"
(Noah Spurrier's "Unicode Hammer" approach) while a Swede
would prefer to type the ö directly and not have o and
ö match the same letter.

Hmm... And as I recall THE already needs to know the keyboard
layout because of its LEAP key emulation via shift-space
keypresses. Because the shift key stays down it needs to
know that * and 8 are on the same key, while a Swedish
keyboard has ( and 8 on the same key. So maybe there's
already work done along this route? And it would need
to know about the Alt-Gr key for some keyboards. Grrr!

Tangenting here, the THE docs talk about doing a LEAP
search forwards. When fails the computer beeps. The
docs are pretty emphatic about the beep saying that it
needs to be used by blind people. But what about deaf
people? Wouldn't a screen flash be more appropriate for
that case? I also couldn't figure out why a search
failure causes the search to abort. In EMACS when the
failure occurs I can backspace in case I made a typo
at the failure point.

Andrew
(e-mail address removed)
 
D

David Fraser

Russell said:
Jef Raskin (namedropping) has pointed me at a neat scheme for quoting
arbitrary textual matter called "Mandis quotes". Since google is
ignorant of the phrase, I presume that Jef made it up. It is
disgustingly simple, and very Pythonesque. Here's how it works: If
you have a string that doesn't have any single quotes in it, you
surround the string by a pair of doubled single quotes. ''Like
this''. No backslash interpolation. If you want a character in
there, you put it in there (yes, I know, stand down your armies).
Clearly, then, any character except a single quote can go into one of
these strings. If you need to put a single quote in, then you put
an arbitrary string in-between the single quotes which does NOT
appear in the string. For example, "Bill's house" becomes
'x'Bill's house'x'.

More formally, a mandis quote is a pair of tokens surrounding a
completely arbitrary sequence of bytes. These tokens are comprised of
a possibly null sequence of characters preceded by and followed by a
single quote.

To save time, here's why this pre-PEP proposal sucks in decreasing
order of severity:

o Python source is typically represented, not as an arbitrary string
of ASCII or Unicode characters, but instead as a sequence of lines
separated by the native line terminator (e.g. CRLF, LF, or CR).

o Editors are not all up to the task of inserting arbitrary
characters into strings (although they SHOULD).

o Email cannot withstand arbitrary strings of characters (although
quoted-printable suffices).

o Some distinct Unicode characters are represented using the same
glyph, so that information is lost when text gets printed (but
that's more of a Unicode stupidism.)

Obviously, the justification for it is that it eliminates ", ', r",
r', """, and ''' from the syntax, replacing them by a single 'x' that
suffices for everything. Makes the code easier to read (only one
visual element), easier to parse, and easier to write, because you
don't need to decide which literal method to use.

And distinctly ugly. I much prefer """ or ''' :)
 
G

gabriele renzi

Russell Nelson ha scritto:

I guess you could just use a shell like heredocument then:
foo << marker
string
marker

why invent something new?
 
V

Ville Vainio

gabriele> Russell Nelson ha scritto:
gabriele> I guess you could just use a shell like heredocument then:
gabriele> foo << marker
gabriele> string
gabriele> marker

gabriele> why invent something new?

I'm speculating that this thread is an elaborate joke. I hope so at
least.
 
M

Max M

Ville said:
gabriele> Russell Nelson ha scritto:
gabriele> I guess you could just use a shell like heredocument then:
gabriele> foo << marker
gabriele> string
gabriele> marker

gabriele> why invent something new?

I'm speculating that this thread is an elaborate joke. I hope so at
least.

Actually it is pretty interresting. Most likely it will not lead to
anything usefull. But at least it will have been considdered.

This kind of quoting is not exactly new. It is allready used in
multi-part MIME messages, where you define a boundary, that is a unique
string and can look like so:

------_=_NextPart_001_01C4971F.E7E266DB

It makes it possible to send multiple binary files encoded as text, and
still find the boundaries between them.

It is defined like this:

This is a multi-part message in MIME format.

Content-Type: multipart/alternative;
boundary="----_=_NextPart_002_01C4971F.E7E266DB"


So being able to do the same in Python might have some value. At least
you would be able to define all kind of strings the same way, and it's
easy to understand.

But the 'delimiter' notation is a bit too clunky. Using ' and " is quite
simple, and doesn't leave as much typographic noise.

If there was a smarter (shorter) notation, I think it would definitely
be usefull.


--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top