Writing a Carriage Return in Unicode

Discussion in 'Python' started by Doug, Nov 19, 2009.

  1. Doug

    Doug Guest

    Hi!

    I am trying to write a UTF-8 file of UNICODE strings with a carriage
    return at the end of each line (code below).

    filOpen = codecs.open("c:\\temp\\unicode.txt",'w','utf-8')

    str1 = u'This is a test.'
    str2 = u'This is the second line.'
    str3 = u'This is the third line.'

    strCR = u"\u240D"

    filOpen.write(str1 + strCR)
    filOpen.write(str2 + strCR)
    filOpen.write(str3 + strCR)

    filOpen.close()

    The output looks like
    This is a test.âThis is the second line.âThis is the third
    line.â when opened in Wordpad as a UNICODE file.

    Thanks for your help!!
     
    Doug, Nov 19, 2009
    #1
    1. Advertising

  2. Doug

    MRAB Guest

    Doug wrote:
    > Hi!
    >
    > I am trying to write a UTF-8 file of UNICODE strings with a carriage
    > return at the end of each line (code below).
    >
    > filOpen = codecs.open("c:\\temp\\unicode.txt",'w','utf-8')
    >
    > str1 = u'This is a test.'
    > str2 = u'This is the second line.'
    > str3 = u'This is the third line.'
    >
    > strCR = u"\u240D"
    >
    > filOpen.write(str1 + strCR)
    > filOpen.write(str2 + strCR)
    > filOpen.write(str3 + strCR)
    >
    > filOpen.close()
    >
    > The output looks like
    > This is a test.âThis is the second line.âThis is the third
    > line.â when opened in Wordpad as a UNICODE file.
    >
    > Thanks for your help!!


    u'\u240D' isn't a carriage return (that's u'\r') but a symbol (a visible
    "CR" graphic) for carriage return. Windows programs normally expect
    lines to end with '\r\n'; just use u'\n' in programs and open the text
    files in text mode ('r' or 'w').

    Some Windows programs won't recognise UTF-8 text as UTF-8 in files
    unless they start with a BOM; this will be handled automatically in
    Python if you specify the encoding as 'utf-8-sig'.
     
    MRAB, Nov 19, 2009
    #2
    1. Advertising

  3. Doug

    Doug Guest

    Hi! Thanks for clearing this up!!
     
    Doug, Nov 19, 2009
    #3
  4. Doug

    sturlamolden Guest

    On 19 Nov, 01:14, Doug <> wrote:

    > Thanks for your help!!


    A carriage return in unicode is

    u"\r"

    how this is written as bytes is dependent on the encoder.

    Don't try to outsmart the UTF-8 codec, it knows how to translate "\r"
    to UTF-8.


    Sturla Molden
     
    sturlamolden, Nov 20, 2009
    #4
  5. On Thu, 19 Nov 2009 23:22:22 -0800, Scott David Daniels
    <> declaimed the following in
    gmane.comp.python.general:

    > This is the one thing from standards that I believe Microsoft got right
    > where others did not. The ASCII (American Standard for Information
    > Interchange) standard end of line is _both_ carriage return (\r) _and_
    > line feed (\n) -- I believe in that order.
    >


    And so are most internet protocols (SMTP/NNTP, probably TELNET)

    > The Unix operating system, in its enthusiasm to make _everything_
    > simpler (against Einstein's advice, "Everything should be made as simple
    > as possible, but not simpler.") decided that end-of-line should be a
    > simple line feed and not carriage return line feed. Before they made
    > that decision, there was debate about the order of cr-lf or lf-cr, or
    > inventing a new EOL character ('\037' == '\x1F' was the candidate).
    >


    Ah well... then there are the systems that used <cr> as the line end
    <G>

    > If you've actually typed on a physical typewriter, you know that moving
    > the carriage back is a distinct operation from rolling the platen
    > forward; both operations are accomplished when you push the carriage
    > back using the bar, but you know they are distinct. Hell, MIT even had


    Of course, if you are describing a /real/ /manual/ typewriter, you
    would rapidly discover that the sequence is <lf><cr> -- since pushing
    the bar would often trigger the line feed before it would slide the
    carriage to the right.

    But on a teletype, it would be <cr><lf>, and maybe a few <rub-outs>
    for timing -- as the <cr> was the slower operation, and would complete
    while the other characters were operated upon...

    > Lots of people talk about "dos-mode files" and "windows files" as if
    > Microsoft got it wrong; it did not -- Unix made up a convenient fiction
    > and people went along with it. (And, yes, if Unix had been there first,
    > their convention was, in fact, better).
    >

    Pardon... but the Teletype beats both...
    --
    Wulfraed Dennis Lee Bieber KD6MOG
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Nov 21, 2009
    #5
  6. On Thu, 19 Nov 2009 23:22:22 -0800, Scott David Daniels wrote:

    > MRAB wrote:
    >> u'\u240D' isn't a carriage return (that's u'\r') but a symbol (a
    >> visible "CR" graphic) for carriage return. Windows programs normally
    >> expect lines to end with '\r\n'; just use u'\n' in programs and open
    >> the text files in text mode ('r' or 'w').

    >
    > <rant>
    > This is the one thing from standards that I believe Microsoft got right
    > where others did not.


    Oh please, that's historical revisionism -- \r\n wasn't invented by
    Microsoft. Microsoft didn't "get it right", they simply copied what CP/M
    did, on account of the original MS-DOS being essentially a clone of CP/M.

    And of course the use of \r\n predates computers -- CR+LF (Carriage
    Return + LineFeed) were necessary to instruct the print head on teletype
    printers to move down one line and return to the left. It was a physical
    necessity for the oldest computer operating systems, because the only
    printers available were teletypes.


    > The ASCII (American Standard for Information
    > Interchange) standard end of line is _both_ carriage return (\r) _and_
    > line feed (\n)


    I doubt that very much. Do you have a reference for this?

    It is true that the predecessor to ANSI (not ASCII), ASA, specified \r\n
    as the line terminator, but ISO specified that both \n and \r\n should be
    accepted.


    > I believe in that order.


    You "believe" in that order? But you're not sure?

    That's the trouble with \r\n, or \n\r -- it's an arbitrary choice, and
    therefore hard to remember which it is. I've even seen proprietary
    business-to-business software where the developers (apparently) couldn't
    remember which was the standard, so when exporting data to text, you had
    to choose which to use for line breaks.

    Of course, being Windows software, they didn't think that you might want
    to transfer the text file to a Unix system, or a Mac, and so didn't offer
    \n or \r alone as line terminators.


    > The Unix operating system, in its enthusiasm to make _everything_
    > simpler (against Einstein's advice, "Everything should be made as simple
    > as possible, but not simpler.") decided that end-of-line should be a
    > simple line feed and not carriage return line feed.


    Why is it "too simple" to have line breaks be a single character? What is
    the downside of the Unix way? Why is \r\n "better"? We're not using
    teletypes any more.

    Or for that matter, classic Mac OS, which used a single \r as newline.

    Likewise for other OSes, such as Commodore, Amiga, Multics...


    > Before they made
    > that decision, there was debate about the order of cr-lf or lf-cr, or
    > inventing a new EOL character ('\037' == '\x1F' was the candidate).


    IBM operating systems that use EBCDIC used the NEL (NExt Line) character
    for line breaks, keeping CR and LF for other uses.

    The Unicode standard also specifies that any of the following be
    recognised as line separators or terminators:

    LF, CR, CR+LF, NEL, FF (FormFeed, \f), LS (LineSeparator, U+2028) and PS
    (ParagraphSeparator, U+2029).


    > If you've actually typed on a physical typewriter, you know that moving
    > the carriage back is a distinct operation from rolling the platen
    > forward;


    I haven't typed on a physical typewriter for nearly a quarter of a
    century.

    If you've typed on a physical typewriter, you'll know that to start a new
    page, you have to roll the platen forward until the page ejects, then
    move the typewriter guide forward to leave space, then feed a new piece
    of paper into the typewriter by hand, then roll the platen again until
    the page is under the guide, then push the guide back down again. That's
    FIVE distinct actions, and if you failed to do them, you would type but
    no letters would appear on the (non-existent) page. Perhaps we should
    specify that text files need a five-character sequence to specify a new
    page too?


    > both operations are accomplished when you push the carriage
    > back using the bar, but you know they are distinct. Hell, MIT even had
    > "line starve" character that moved the cursor up (or rolled the platen
    > back).
    > </rant>
    >
    > Lots of people talk about "dos-mode files" and "windows files" as if
    > Microsoft got it wrong; it did not -- Unix made up a convenient fiction
    > and people went along with it. (And, yes, if Unix had been there first,
    > their convention was, in fact, better).


    This makes zero sense. If Microsoft "got it right", then why is the Unix
    convention "convenient" and "better"? Since we're not using teletype
    machines, I would say Microsoft is now using an *inconvenient* fiction.




    --
    Steven
     
    Steven D'Aprano, Nov 21, 2009
    #6
  7. Doug

    sturlamolden Guest

    On 21 Nov, 09:12, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:

    > Oh please, that's historical revisionism -- \r\n wasn't invented by
    > Microsoft. Microsoft didn't "get it right", they simply copied what CP/M
    > did, on account of the original MS-DOS being essentially a clone of CP/M.


    Actyually \r\n goes back to early mechanical typewriters with
    typebars, such as the Hermes. The operator would hit CR to return the
    paper carriage and LF to move down to the next line.
     
    sturlamolden, Nov 21, 2009
    #7
  8. Doug

    sturlamolden Guest

    On 21 Nov, 08:10, Dennis Lee Bieber <> wrote:

    >         Of course, if you are describing a /real/ /manual/ typewriter, you
    > would rapidly discover that the sequence is <lf><cr> -- since pushing
    > the bar would often trigger the line feed before it would slide the
    > carriage to the right.
    >
    >         But on a teletype, it would be <cr><lf>, and maybe a few <rub-outs>
    > for timing -- as the <cr> was the slower operation, and would complete
    > while the other characters were operated upon...


    Ah, yes you are right :)

    The sequence is <lf><cr> on a typewriter.

    Which is why the RETURN button often had the symbol

    |
    <----|
     
    sturlamolden, Nov 21, 2009
    #8
  9. Doug

    Steve Howell Guest

    On Nov 21, 12:12 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Thu, 19 Nov 2009 23:22:22 -0800, Scott David Daniels wrote:
    >
    > > If you've actually typed on a physical typewriter, you know that moving
    > > the carriage back is a distinct operation from rolling the platen
    > > forward;

    >
    > I haven't typed on a physical typewriter for nearly a quarter of a
    > century.
    >
    > If you've typed on a physical typewriter, you'll know that to start a new
    > page, you have to roll the platen forward until the page ejects, then
    > move the typewriter guide forward to leave space, then feed a new piece
    > of paper into the typewriter by hand, then roll the platen again until
    > the page is under the guide, then push the guide back down again. That's
    > FIVE distinct actions, and if you failed to do them, you would type but
    > no letters would appear on the (non-existent) page. Perhaps we should
    > specify that text files need a five-character sequence to specify a new
    > page too?
    >
    > > both operations are accomplished when you push the carriage
    > > back using the bar, but you know they are distinct.  Hell, MIT even had
    > > "line starve" character that moved the cursor up (or rolled the platen
    > > back).
    > > </rant>

    >
    > > Lots of people talk about "dos-mode files" and "windows files" as if
    > > Microsoft got it wrong; it did not -- Unix made up a convenient fiction
    > > and people went along with it. (And, yes, if Unix had been there first,
    > > their convention was, in fact, better).

    >
    > This makes zero sense. If Microsoft "got it right", then why is the Unix
    > convention "convenient" and "better"? Since we're not using teletype
    > machines, I would say Microsoft is now using an *inconvenient* fiction.
    >
    > --
    > Steven


    It's been a long time since I have typed on a physical typewriter as
    well, but I still vaguely remember all the crazy things I had to do to
    get the tab key to produce a predictable indentation on the paper
    output.

    I agree with Steven that "\r\n" is completely insane. If you are
    going to couple character sets to their legacy physical
    implementations, you should also have a special extra character to dot
    your i's and cross your t's. Apparently neither Unix or Microsoft got
    that right. I mean, think about it, dotting the i is a distinct
    operation from creating the undotted "i." ;)
     
    Steve Howell, Nov 22, 2009
    #9
  10. Steve Howell wrote:
    > If you are
    > going to couple character sets to their legacy physical
    > implementations, you should also have a special extra character to dot
    > your i's and cross your t's.


    No, no, no. For that device you need to output a series
    of motion vectors for the scribing point. Plus control
    characters for "dip nib" and "apply blotter", and
    possibly also "pluck goose" for when the print head
    becomes worn.

    --
    Greg
     
    Gregory Ewing, Nov 22, 2009
    #10
  11. Doug

    Steve Howell Guest

    On Nov 21, 11:33 pm, Gregory Ewing <>
    wrote:
    > Steve Howell wrote:
    > > If you are
    > > going to couple character sets to their legacy physical
    > > implementations, you should also have a special extra character to dot
    > > your i's and cross your t's.

    >
    > No, no, no. For that device you need to output a series
    > of motion vectors for the scribing point. Plus control
    > characters for "dip nib" and "apply blotter", and
    > possibly also "pluck goose" for when the print head
    > becomes worn.
    >


    Greg, at the first reading of your response, it sounded overly
    complicated for me to have to "dip nib" and "pluck goose" every time
    I just want to semantically indicate the ninth letter of the English
    alphabet, but that's easily solved with a wizard interface, I guess.
    Maybe every time I am trying to decide which letter to type in Word,
    there could be some kind of animated persona that helps me choose the
    character. There could be a visual icon of an "eye" that reminds me
    of the letter that I am trying to type, and I could configure the
    depth to which I dib the nib with some kind of slider interface. It
    actually sounds quite simple and elegant, the more that I think about
    it.
     
    Steve Howell, Nov 22, 2009
    #11
  12. Doug

    Aahz Guest

    In article <>,
    Dennis Lee Bieber <> wrote:
    >On Thu, 19 Nov 2009 23:22:22 -0800, Scott David Daniels
    ><> declaimed the following in
    >gmane.comp.python.general:
    >>
    >> If you've actually typed on a physical typewriter, you know that moving
    >> the carriage back is a distinct operation from rolling the platen
    >> forward; both operations are accomplished when you push the carriage
    >> back using the bar, but you know they are distinct.

    >
    > Of course, if you are describing a /real/ /manual/ typewriter, you
    >would rapidly discover that the sequence is <lf><cr> -- since pushing
    >the bar would often trigger the line feed before it would slide the
    >carriage to the right.


    Often, but not always; it certainly was possible on most typewriters to
    return the carriage without a line feed -- and occasionally desirable for
    overstrike.
    --
    Aahz () <*> http://www.pythoncraft.com/

    The best way to get information on Usenet is not to ask a question, but
    to post the wrong information.
     
    Aahz, Nov 28, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ferdi
    Replies:
    5
    Views:
    4,719
    Jacob Yang [MSFT]
    Aug 24, 2003
  2. Kevin Spencer
    Replies:
    3
    Views:
    3,249
    Doc Wally
    Oct 11, 2003
  3. Andreas Leitgeb
    Replies:
    0
    Views:
    469
    Andreas Leitgeb
    May 15, 2009
  4. Xeno Campanoli
    Replies:
    0
    Views:
    244
    Xeno Campanoli
    Feb 13, 2006
  5. Steve Anderson
    Replies:
    3
    Views:
    271
    Steve Anderson
    Jun 21, 2004
Loading...

Share This Page