Extended ASCII character handeling

Discussion in 'Ruby' started by Don Norcott, Nov 17, 2010.

  1. Don Norcott

    Don Norcott Guest

    "200 Millionen Jahre sp=C3=A4ter # 17.39
    \n",
    "200 Millionen Jahre sp=C3=A4ter # 9.87
    3404211707 \n",
    "A l'assaut de l'invisible 1977 # 4.91
    \n",
    "A l'assaut de l'invisible 1990 # 5.18
    226603779 \n",

    The above 4 lines are data I was attempting to load into an array to
    test some code. I was getting what I thought were strange results until
    I realized not all characters were being loaded into the element
    resulting in column alignment problems.

    The data above was cut from a file that had been manipulated a dozen
    times in ruby arrays before being written to a file. So it appears the
    default way ruby handles extended ASCII(?) is fine.

    I have two questions
    1) Should I ever have to worry about data being scraped from web pages
    not being handled correctly by ruby.

    2)How do I flag this data to allow me to manipulate it properly. That is
    load it into an array or write to a file.

    Tried playing with the following but even if the code below is correct
    the extended ascii characters are lost by the time it gets to IRB

    str =3D String.new
    str.encode(("US-ASCII")
    str =3D "Millionen Jahre sp=C3=A4ter"

    Any suggestions where I might find some insight.

    Thanks Don

    -- =

    Posted via http://www.ruby-forum.com/.=
     
    Don Norcott, Nov 17, 2010
    #1
    1. Advertising

  2. On 17.11.2010 17:01, Don Norcott wrote:
    > "200 Millionen Jahre später # 17.39
    > \n",
    > "200 Millionen Jahre später # 9.87
    > 3404211707 \n",
    > "A l'assaut de l'invisible 1977 # 4.91
    > \n",
    > "A l'assaut de l'invisible 1990 # 5.18
    > 226603779 \n",
    >
    > The above 4 lines are data I was attempting to load into an array to
    > test some code. I was getting what I thought were strange results until
    > I realized not all characters were being loaded into the element
    > resulting in column alignment problems.
    >
    > The data above was cut from a file that had been manipulated a dozen
    > times in ruby arrays before being written to a file. So it appears the
    > default way ruby handles extended ASCII(?) is fine.
    >
    > I have two questions
    > 1) Should I ever have to worry about data being scraped from web pages
    > not being handled correctly by ruby.


    Depends how you read the data from webpages.

    > 2)How do I flag this data to allow me to manipulate it properly. That is
    > load it into an array or write to a file.


    You need to set encodings properly. You can do that when opening the
    file. Example:

    irb(main):001:0> io = File.open "x","r"
    => #<File:x>
    irb(main):002:0> io.external_encoding
    => #<Encoding:UTF-8>
    irb(main):003:0> io.internal_encoding
    => nil
    irb(main):004:0> io.read.encoding
    => #<Encoding:UTF-8>
    irb(main):005:0> io.close
    => nil

    irb(main):006:0> io = File.open "x","r:ASCII"
    => #<File:x>
    irb(main):007:0> io.external_encoding
    => #<Encoding:US-ASCII>
    irb(main):008:0> io.internal_encoding
    => nil
    irb(main):009:0> io.read.encoding
    => #<Encoding:US-ASCII>
    irb(main):010:0> io.close
    => nil

    See http://blog.grayproductions.net/articles/understanding_m17n

    > Tried playing with the following but even if the code below is correct
    > the extended ascii characters are lost by the time it gets to IRB
    >
    > str = String.new
    > str.encode(("US-ASCII")
    > str = "Millionen Jahre später"


    This won't work - ever. You set the encoding for an instance and then
    you reassign str to point to another instance, so all your encoding
    settings are lost. Also, there is no "ü" in ASCII which is 7bit!

    irb(main):011:0> s="a"
    => "a"
    irb(main):012:0> s.encoding
    => #<Encoding:UTF-8>
    irb(main):013:0> t = s.encode "ASCII"
    => "a"
    irb(main):014:0> t.encoding
    => #<Encoding:US-ASCII>

    Now with "ü":

    irb(main):015:0> s="ü"
    => "ü"
    irb(main):016:0> s.encoding
    => #<Encoding:UTF-8>
    irb(main):017:0> t = s.encode "ASCII"
    Encoding::UndefinedConversionError: "\xC3\xBC" from UTF-8 to US-ASCII
    from (irb):17:in `encode'
    from (irb):17
    from /usr/local/bin/irb19:12:in `<main>'

    Kind regards

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Nov 17, 2010
    #2
    1. Advertising

  3. Don Norcott

    Don Norcott Guest

    I am using nokogiri (with Mechanize) to scrape the data and the data I
    am concerned with is extracted only from displayable fields <table
    class=3D"result> .... </table>

    The code set/language references I see are
    <meta content=3D"text/html; charset=3DISO-8859-1" http-equiv=3D"Content-T=
    ype">
    Which is I believe, what I am calling Extended ASCII(8 bit 0 - 255)

    AND

    //<![CDATA[ var awsDomain =3D 'xxxxxxxx.xxx';
    var surveyLink =3D "sm=3D93_2fjk6BaUHEqrn2qpdbknQ_3d_d"
    var twoLetterISOCode =3D 'en'; //]]>

    The scrapped data has never caused a problem within the ruby program
    (would have been very obvious). Can I safely assume that code sets will
    never present a problem for this specific application as long as the
    retrieval methods do not change????.
    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

    That being said when I open the file with io it reports
    #<Encoding:IBM437> which would contain the characters giving problems
    (but not there correct representation). That is to say the IBM437 for
    character E4 is a Graphic character not the accented French 'a' in
    "sp=C3=A4ter". The graphic is what is also being displayed in the IRB
    console.

    I have gone through most of the Shades of Gray link and only thing that
    I thought might have been of value is the LC_TYPE but either UTF-9 or
    ISO-8859-1 both work identically in my situation. I have removed
    LC_TYPE since there is no problem with internal data and it might cause
    a problem down the line when I have forgotten about it.

    Also tried saving code & data to a file and running the file (ruby
    xxx.rb) and still reports a multibyte error.

    Played with ruby command line encoding settings (ruby -E XXX)and still
    received errors regardless of code set I picked - may be related to
    LC_TYPE as did not reboot so still valid??

    Error is
    CodeSet.rb:4: invalid multibyte char (US-ASCII) which is 7 bit.

    Extended ASCII code sets ISO-8859 & IBM437 are 8 bit but can not seem to
    set this.


    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
    =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

    I can edit the data file externally and read the data into an array
    without problems.
    So will assume no need to pursue the code set settings at this time.

    Will not update unless I have a revelation.

    By the recommended link was excellent, will save URL as a resource.

    -- =

    Posted via http://www.ruby-forum.com/.=
     
    Don Norcott, Nov 17, 2010
    #3
  4. Don Norcott wrote in post #962171:
    > I have two questions
    > 1) Should I ever have to worry about data being scraped from web pages
    > not being handled correctly by ruby.


    In ruby 1.9, you have to worry about this very much.

    Strings in ruby 1.9 are two-dimensional: they have a sequence of bytes,
    and they have an encoding. There are additional 'dimensions' based on
    the string's content - empty, ascii_compatible, valid_encoding.

    If your scraper library doesn't document how it choses the encodings to
    tag each string it returns, and doesn't document how it handles invalid
    encodings if it comes across them, then you have to test its behaviour
    for all the various edge cases.

    You never have this issue with ruby 1.8, because a string is just a
    string of bytes. Of course, the "garbage in, garbage out" principle
    still applies; you just don't choke on the garbage.

    > 2)How do I flag this data to allow me to manipulate it properly. That is
    > load it into an array or write to a file.


    That's a short question with a long answer, and I'm afraid my own
    attempt to answer it is incomplete:
    https://github.com/candlerb/string19/blob/master/string19.rb

    If you're reading stuff from a file or a socket yourself, you can
    control the process. If you're trusting a third-party library to fetch
    data from somewhere, then you have to trust that library to do the right
    thing in the situations you're interested in.

    > Tried playing with the following but even if the code below is correct
    > the extended ascii characters are lost by the time it gets to IRB


    irb is not a good predictor of encoding behaviour for ruby 1.9, and
    you'd be better writing standalone .rb scripts that you run.

    Note that it's one of the 1.9 language inconsistencies that transcoding
    is *not* done on output by default. So if you have a read a string from
    a file, and carefully tag it as say UTF-8, but your terminal is IBM437,
    then

    puts my_string

    will just squirt the UTF-8 bytes to the terminal and they'll display
    wrongly. You can try something like this:

    STDOUT.set_encoding "IBM437"
    or
    STDOUT.set_encoding "locale"

    Regards,

    Brian.

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Nov 18, 2010
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest
    Replies:
    1
    Views:
    820
    Catalin Pitis
    Oct 21, 2004
  2. Guest
    Replies:
    1
    Views:
    491
    Ron Natalie
    Oct 21, 2004
  3. Replies:
    4
    Views:
    465
    F. GEIGER
    Mar 21, 2005
  4. John Gregory
    Replies:
    0
    Views:
    314
    John Gregory
    Jul 5, 2009
  5. James O'Brien
    Replies:
    3
    Views:
    316
    Ben Morrow
    Mar 5, 2004
Loading...

Share This Page