Unicode Support in Ruby, Perl, Python, Emacs Lisp

Discussion in 'Python' started by Xah Lee, Oct 7, 2010.

  1. Xah Lee

    Xah Lee Guest

    here's my experiences dealing with unicode in various langs.

    Unicode Support in Ruby, Perl, Python, Emacs Lisp

    Xah Lee, 2010-10-07

    I looked at Ruby 2 years ago. One problem i found is that it does not
    support Unicode well. I just checked today, it still doesn't. Just do
    a web search on blog and forums on “ruby unicodeâ€. e.g.: Source,
    Source, Source, Source.

    Perl's exceedingly lousy unicode support hack is well known. In fact
    it is the primary reason i “switched†to python for my scripting needs
    in 2005. (See: Unicode in Perl and Python)

    Python 2.x's unicode support is also not ideal. You have to declare
    your source code with header like 「#-*- coding: utf-8 -*-ã€, and you
    have to declare your string as unicode with “uâ€, e.g. 「u"林花è¬äº†æ˜¥ç´…"ã€. In
    regex, you have to use unicode flag such as 「re.search(r'\.html
    $',child,re.U)ã€. And when processing files, you have to read in with
    「unicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
    do「outF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
    files, and if one of the file contains a bad char or doesn't use
    encoding you expected, your python script chokes dead in the middle,
    you don't even know which file it is or which line unless your code
    print file names.

    Also, if the output shell doesn't support unicode or doesn't match
    with the encoding specified in your python print, you get gibberish.
    It is often a headache to figure out the locale settings, what
    encoding the terminal support or is configured to handle, the encoding
    of your file, the which encoding the “print†is using. It gets more
    complex if you are going thru a network, such as ssh. (most shells,
    terminals, as of 2010-10, in practice, still have problems dealing
    with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
    Apple Terminal.))

    Python 3 supposedly fixed the unicode problem, but i haven't used it.
    Last time i looked into whether i should adopt python 3, but
    apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
    pissed that Python is going more and more into OOP mumbo jumbo with
    lots ad hoc syntax (e.g. “viewsâ€, “iteratorsâ€, “list comprehensionâ€.))

    I'll have to say, as far as text processing goes, the most beautiful
    lang with respect to unicode is emacs lisp. In elisp code (e.g.
    Generate a Web Links Report with Emacs Lisp ), i don't have to declare
    none of the unicode or encoding stuff. I simply write code to process
    string or buffer text, without even having to know what encoding it
    is. Emacs the environment takes care of all that.

    It seems that javascript and PHP also support unicode well, but i
    don't have extensive experience with them. I suppose that elisp, php,
    javascript, all support unicode well because these langs have to deal
    with unicode in practical day-to-day situations.


    --------------------------------------------------
    for links, see
    http://xahlee.blogspot.com/2010/10/unicode-support-in-ruby-perl-python.html

    Xah ∑ xahlee.org ☄
     
    Xah Lee, Oct 7, 2010
    #1
    1. Advertising

  2. Xah Lee

    Bigos Guest

    On Oct 7, 7:13 pm, Xah Lee <> wrote:
    > here's my experiences dealing with unicode in various langs.
    >
    > Unicode Support in Ruby, Perl, Python, Emacs Lisp
    >
    > Xah Lee, 2010-10-07
    >
    > I looked at Ruby 2 years ago. One problem i found is that it does not
    > support Unicode well. I just checked today, it still doesn't. Just do
    > a web search on blog and forums on “ruby unicodeâ€. e.g.: Source,
    > Source, Source, Source.
    >
    > Perl's exceedingly lousy unicode support hack is well known. In fact
    > it is the primary reason i “switched†to python for my scripting needs
    > in 2005. (See: Unicode in Perl and Python)
    >
    > Python 2.x's unicode support is also not ideal. You have to declare
    > your source code with header like 「#-*- coding: utf-8 -*-ã€, and you
    > have to declare your string as unicode with “uâ€, e.g. 「u"林花è¬äº†æ˜¥ç´…"ã€. In
    > regex, you have to use unicode flag such as 「re.search(r'\.html
    > $',child,re.U)ã€. And when processing files, you have to read in with
    > 「unicode(inF.read(),'utf-8')ã€, and printing out unicode you have to
    > do「outF.write(outtext.encode('utf-8'))ã€. If you are processing lots of
    > files, and if one of the file contains a bad char or doesn't use
    > encoding you expected, your python script chokes dead in the middle,
    > you don't even know which file it is or which line unless your code
    > print file names.
    >
    > Also, if the output shell doesn't support unicode or doesn't match
    > with the encoding specified in your python print, you get gibberish.
    > It is often a headache to figure out the locale settings, what
    > encoding the terminal support or is configured to handle, the encoding
    > of your file, the which encoding the “print†is using. It gets more
    > complex if you are going thru a network, such as ssh. (most shells,
    > terminals, as of 2010-10, in practice, still have problems dealing
    > with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's
    > Apple Terminal.))
    >
    > Python 3 supposedly fixed the unicode problem, but i haven't used it.
    > Last time i looked into whether i should adopt python 3, but
    > apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite
    > pissed that Python is going more and more into OOP mumbo jumbo with
    > lots ad hoc syntax (e.g. “viewsâ€, “iteratorsâ€, “list comprehensionâ€.))
    >
    > I'll have to say, as far as text processing goes, the most beautiful
    > lang with respect to unicode is emacs lisp. In elisp code (e.g.
    > Generate a Web Links Report with Emacs Lisp ), i don't have to declare
    > none of the unicode or encoding stuff. I simply write code to process
    > string or buffer text, without even having to know what encoding it
    > is. Emacs the environment takes care of all that.
    >
    > It seems that javascript and PHP also support unicode well, but i
    > don't have extensive experience with them. I suppose that elisp, php,
    > javascript, all support unicode well because these langs have to deal
    > with unicode in practical day-to-day situations.
    >
    > --------------------------------------------------
    > for links, seehttp://xahlee.blogspot.com/2010/10/unicode-support-in-ruby-perl-pytho...
    >
    >  Xah ∑ xahlee.org ☄


    Maybe you have checked wrong version. There two versions of Ruby out
    there one does support unicode and the other doesn't. Latest version
    ie. 1.9.x branch has made some progress in that regard. Please check
    the following links to see if the solve your problem.

    http://nuclearsquid.com/writings/ruby-1-9-encodings.html
    http://loopkid.net/articles/2008/07/07/ruby-1-9-utf-8-mostly-works
    http://stackoverflow.com/questions/1627767/rubys-stringgsub-unicode-and-non-word-characters

    I think latest recommended version of Ruby is ruby 1.9.2p0, please try
    it to see if it works for you. Of course it is not as good as Lisp,
    and in Rails code you see people writing the same sequences of
    characters over and over again, but some people like it because it is
    better than other languages they used before. If it's a stepping stone
    towards Lisp then it is a good thing imho.
     
    Bigos, Oct 9, 2010
    #2
    1. Advertising

  3. Xah Lee

    Xah Lee Guest

    2010-10-09

    On Oct 9, 3:45 pm, Sean McAfee <> wrote:
    > Xah Lee <> writes:
    > > Perl's exceedingly lousy unicode support hack is well known. In fact
    > > it is the primary reason i “switched†to python for my scripting needs
    > > in 2005. (See: Unicode in Perl and Python)

    >
    > I think your assessment is antiquated.  I've been doing Unicode
    > programming with Perl for about three years, and it's generally quite
    > wonderfully transparent.


    you are probably right. The last period i did serious perl is 1998 to
    2004. Since, have pretty much lost contact with perl community.

    i have like 5 years of 8 hours day experience with perl... the app we
    wrote is probably the largest perl web app at the time, say within the
    top 10 largest perl web apps, during the dot com days.

    spend 2 years with python about 2005, 2006, but mostly just personal
    dabbling.

    my dilema is this... i am really tired of perl, so i thougth python is
    my solution. Comparing the syntax, semantics, etc, i really do find
    python better, but to know python as well as i know perl, or, to know
    a lang really as a expert (e.g. intimately familiar with all the ins
    and outs of constructs, idioms, their speeds, libraries out there,
    their nature, which are used, their bugs etc), takes years. So,
    whenever i have this psychological urge to totally ditch perl and hug
    python 100% ... but it takes a huge amount of time to dig into a lang
    well again, so sometimes i thought of sticking with my perl due to my
    existing knowledge and forthwith stop wasting valuable time, but then,
    whenever i work in perl with its hack nature and crooked community
    (all those mongers ****), especially the syntax for nested list/hash
    that's more than 3 levels (and my code almost always rely on nested
    list/hash to do things since am a functional programer), and compare
    to python's syntax on nested structure, i ask my self again, is this
    shit really what i want to keep on at?

    and python 3 comes in, and over the years i learned, that Guido really
    hates functional programing (he understands it nil), and python is
    moving more innto oop mumbo jumbo with more special syntaxes and
    special semantics. (and perl is trivially far more capable at
    functional programing than python) So, this puts a damnation in my
    mental struggle for python.

    in the end i really haven't decided on anything, as usual... it's not
    really concrete, answerable question anyway, it's just psy struggle on
    some fuzzy ideal about efficiency and perfect lang.

    and there's ruby... (among others) and because i'm such a douchbag for
    langs, now and then i suppose i waste my time to venture and read
    about ruby, the unconcious execuse is that maybe ruby will turn out to
    simply solve all my life's problems, but nagging in the back of my
    mind is the reality that, yeah, go spend 3 years 8 hours a day on
    ruby, then possibly it'll be practically useful to me as i do with
    perl already, and, no, it won't bring you anything extra as far as
    lang goes, for that you go to OCaml/F#, erlang, Mathematica ... and
    who knows what kinda hidden needle in the eye i'll discover on my road
    in ruby.

    btw, this is all just a geek's mental disorder, common with many who's
    into lang design and beauty etc type of shit. (high percentage of this
    crowd hang in newsgroups) But the reality is that, this psychological
    problem really don't have much practical justification ... it's just
    fret, fret, fret. Fret, fret, fret. Years of fretting, while others
    have written great apps all over the web.

    in practice, i do not even have a need for perl or python in my work
    since about 2006, except a few find/replace scripts for text
    processing that i've written in the past. And, since about 2007, i've
    been increasingly writing lots and lots more in elisp. (and this emacs
    beast, is really a true love more than anything) So these days, almost
    all of my scripts are in elisp. (and my job these days is mainly just
    text processing programing)

    • 〈Xah on Programing Languages〉
    http://xahlee.org/Periodic_dosage_dir/comp_lang.html

    > On the programmers' web site stackoverflow.com, I flag questions with
    > the "unicode" tag, and of questions that mention a specific language,
    > Python and C++ seem to come up the most often.
    >
    > > I'll have to say, as far as text processing goes, the most beautiful
    > > lang with respect to unicode is emacs lisp. In elisp code (e.g.
    > > Generate a Web Links Report with Emacs Lisp ), i don't have to declare
    > > none of the unicode or encoding stuff. I simply write code to process
    > > string or buffer text, without even having to know what encoding it
    > > is. Emacs the environment takes care of all that.

    >
    > It's not quite perfect, though.  I recently discovered that if I enter a
    > Chinese character using my Mac's Chinese input method, and then enter
    > the same character using a Japanese input method, Emacs regards them as
    > different characters, even though they have the same Unicode code point.
    > For example, from describe-char:
    >
    >   character: 一 (43323, #o124473, #xa93b, U+4E00)
    >   character: 一 (55404, #o154154, #xd86c, U+4E00)


    that's because you are using pre emacs 23. Try to switch to emacs 23,
    it uses utf-8 to represent chars internally.

    > On saving and reverting a file containing such text, the characters are
    > "normalized" to the Japanese version.
    >
    > I suppose this might conceivably be the correct behavior, but it sure
    > was a surprise that (equal "一" "一") can be nil.


    (equal "一" "一")

    with emacs 23.*, this eval to true.

    • 〈New Features in Emacs 23〉
    http://xahlee.org/emacs/emacs23_features.html

    • 〈Emacs and Unicode Tips〉
    http://xahlee.org/emacs/emacs_n_unicode.html

    • 〈All about Unicode〉
    http://xahlee.org/Periodic_dosage_dir/unicode.html

    Xah ∑ xahlee.org ☄
     
    Xah Lee, Oct 10, 2010
    #3
  4. On Sat, 09 Oct 2010 13:06:32 -0700, Bigos wrote:
    [...]
    > Maybe you have checked wrong version. There two versions of Ruby out
    > there one does support unicode and the other doesn't.


    Please don't feed the trolls. Xah Lee is a known troll who cross-posts to
    irrelevant newsgroups with his blatherings. He is not interested in
    learning anything which challenges his opinions, and rarely if every
    engages in dialog with those who respond.

    Since your reply has little or nothing to do with the newsgroups you have
    sent it to, it is also spamming. While we're all extremely impressed by
    your assertion that Lisp is the bestest programming language evar, please
    keep your fan-boy gushing to comp.lang.lisp and don't cross-post again.

    Followups to /dev/null.


    --
    Steven
     
    Steven D'Aprano, Oct 10, 2010
    #4
  5. Sean McAfee <> writes:

    > Xah Lee <> writes:
    >> Perl's exceedingly lousy unicode support hack is well known. In fact
    >> it is the primary reason i “switched†to python for my scripting needs
    >> in 2005. (See: Unicode in Perl and Python)

    >
    > I think your assessment is antiquated. I've been doing Unicode
    > programming with Perl for about three years, and it's generally quite
    > wonderfully transparent.
    >
    > On the programmers' web site stackoverflow.com, I flag questions with
    > the "unicode" tag, and of questions that mention a specific language,
    > Python and C++ seem to come up the most often.
    >
    >> I'll have to say, as far as text processing goes, the most beautiful
    >> lang with respect to unicode is emacs lisp. In elisp code (e.g.
    >> Generate a Web Links Report with Emacs Lisp ), i don't have to declare
    >> none of the unicode or encoding stuff. I simply write code to process
    >> string or buffer text, without even having to know what encoding it
    >> is. Emacs the environment takes care of all that.

    >
    > It's not quite perfect, though. I recently discovered that if I enter a
    > Chinese character using my Mac's Chinese input method, and then enter
    > the same character using a Japanese input method, Emacs regards them as
    > different characters, even though they have the same Unicode code point.
    > For example, from describe-char:
    >
    > character: 一 (43323, #o124473, #xa93b, U+4E00)
    > character: 一 (55404, #o154154, #xd86c, U+4E00)
    >
    > On saving and reverting a file containing such text, the characters are
    > "normalized" to the Japanese version.
    >
    > I suppose this might conceivably be the correct behavior, but it sure
    > was a surprise that (equal "一" "一") can be nil.


    Your headers state:

    User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

    That's an old version of Emacs, more than 2 years old. 23.1 has been
    released more than a year ago. The current version is 23.2.

    --
    David Kastrup
     
    David Kastrup, Oct 10, 2010
    #5
  6. Xah Lee

    Nobody Guest

    On Sat, 09 Oct 2010 15:45:42 -0700, Sean McAfee wrote:

    >> I'll have to say, as far as text processing goes, the most beautiful
    >> lang with respect to unicode is emacs lisp. In elisp code (e.g.
    >> Generate a Web Links Report with Emacs Lisp ), i don't have to declare
    >> none of the unicode or encoding stuff. I simply write code to process
    >> string or buffer text, without even having to know what encoding it
    >> is. Emacs the environment takes care of all that.

    >
    > It's not quite perfect, though. I recently discovered that if I enter a
    > Chinese character using my Mac's Chinese input method, and then enter
    > the same character using a Japanese input method, Emacs regards them as
    > different characters, even though they have the same Unicode code point.
    > For example, from describe-char:
    >
    > character: 一 (43323, #o124473, #xa93b, U+4E00)
    > character: 一 (55404, #o154154, #xd86c, U+4E00)
    >
    > On saving and reverting a file containing such text, the characters are
    > "normalized" to the Japanese version.


    I don't know about GNU Emacs, but XEmacs doesn't use Unicode internally,
    it uses byte-strings with associated encodings. Some of us like it that
    way, as converting to Unicode may not be reversible, and it's often
    important to preserve exact byte sequences.

    FWIW, I'd expect Ruby to have worse support for Unicode, as its creator is
    Japanese. Unicode is still far more popular in locales which historically
    used ASCII or "almost ASCII" (e.g. ISO-646-*, ISO-8859-*) encodings than
    in locales which had to use a radically different encoding.
     
    Nobody, Oct 10, 2010
    #6
  7. On Sun, 10 Oct 2010 11:34:02 +0200, David Kastrup wrote:
    [unnecessary quoting removed]
    > Your headers state:
    >
    > User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)


    Please stop spamming multiple newsgroups. I'm sure this is of great
    interest to the Emacs newsgroup, but not of Python.

    Followups to /dev/null.

    --
    Steven
     
    Steven D'Aprano, Oct 10, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ekzept
    Replies:
    0
    Views:
    373
    ekzept
    Aug 10, 2007
  2. nanothermite911fbibustards
    Replies:
    0
    Views:
    379
    nanothermite911fbibustards
    Jun 16, 2010
  3. Adam Funk
    Replies:
    4
    Views:
    235
    Adam Funk
    Jan 29, 2007
  4. Xah Lee
    Replies:
    1
    Views:
    505
    Dan Espen
    Apr 13, 2012
  5. Xah Lee
    Replies:
    1
    Views:
    231
    Dan Espen
    Apr 13, 2012
Loading...

Share This Page