Extended ASCII character handeling

D

Don Norcott

"200 Millionen Jahre sp=C3=A4ter # 17.39
\n",
"200 Millionen Jahre sp=C3=A4ter # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",

The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

str =3D String.new
str.encode(("US-ASCII")
str =3D "Millionen Jahre sp=C3=A4ter"

Any suggestions where I might find some insight.

Thanks Don

-- =

Posted via http://www.ruby-forum.com/.=
 
R

Robert Klemme

"200 Millionen Jahre später # 17.39
\n",
"200 Millionen Jahre später # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",

The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

Depends how you read the data from webpages.
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

You need to set encodings properly. You can do that when opening the
file. Example:

irb(main):001:0> io = File.open "x","r"
=> #<File:x>
irb(main):002:0> io.external_encoding
=> #<Encoding:UTF-8>
irb(main):003:0> io.internal_encoding
=> nil
irb(main):004:0> io.read.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> io.close
=> nil

irb(main):006:0> io = File.open "x","r:ASCII"
=> #<File:x>
irb(main):007:0> io.external_encoding
=> #<Encoding:US-ASCII>
irb(main):008:0> io.internal_encoding
=> nil
irb(main):009:0> io.read.encoding
=> #<Encoding:US-ASCII>
irb(main):010:0> io.close
=> nil

See http://blog.grayproductions.net/articles/understanding_m17n
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

str = String.new
str.encode(("US-ASCII")
str = "Millionen Jahre später"

This won't work - ever. You set the encoding for an instance and then
you reassign str to point to another instance, so all your encoding
settings are lost. Also, there is no "ü" in ASCII which is 7bit!

irb(main):011:0> s="a"
=> "a"
irb(main):012:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> t = s.encode "ASCII"
=> "a"
irb(main):014:0> t.encoding
=> #<Encoding:US-ASCII>

Now with "ü":

irb(main):015:0> s="ü"
=> "ü"
irb(main):016:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):017:0> t = s.encode "ASCII"
Encoding::UndefinedConversionError: "\xC3\xBC" from UTF-8 to US-ASCII
from (irb):17:in `encode'
from (irb):17
from /usr/local/bin/irb19:12:in `<main>'

Kind regards

robert
 
D

Don Norcott

I am using nokogiri (with Mechanize) to scrape the data and the data I
am concerned with is extracted only from displayable fields <table
class=3D"result> .... </table>

The code set/language references I see are
<meta content=3D"text/html; charset=3DISO-8859-1" http-equiv=3D"Content-T=
ype">
Which is I believe, what I am calling Extended ASCII(8 bit 0 - 255)

AND

//<![CDATA[ var awsDomain =3D 'xxxxxxxx.xxx';
var surveyLink =3D "sm=3D93_2fjk6BaUHEqrn2qpdbknQ_3d_d"
var twoLetterISOCode =3D 'en'; //]]>

The scrapped data has never caused a problem within the ruby program
(would have been very obvious). Can I safely assume that code sets will
never present a problem for this specific application as long as the
retrieval methods do not change????.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

That being said when I open the file with io it reports
#<Encoding:IBM437> which would contain the characters giving problems
(but not there correct representation). That is to say the IBM437 for
character E4 is a Graphic character not the accented French 'a' in
"sp=C3=A4ter". The graphic is what is also being displayed in the IRB
console.

I have gone through most of the Shades of Gray link and only thing that
I thought might have been of value is the LC_TYPE but either UTF-9 or
ISO-8859-1 both work identically in my situation. I have removed
LC_TYPE since there is no problem with internal data and it might cause
a problem down the line when I have forgotten about it.

Also tried saving code & data to a file and running the file (ruby
xxx.rb) and still reports a multibyte error.

Played with ruby command line encoding settings (ruby -E XXX)and still
received errors regardless of code set I picked - may be related to
LC_TYPE as did not reboot so still valid??

Error is
CodeSet.rb:4: invalid multibyte char (US-ASCII) which is 7 bit.

Extended ASCII code sets ISO-8859 & IBM437 are 8 bit but can not seem to
set this.


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I can edit the data file externally and read the data into an array
without problems.
So will assume no need to pursue the code set settings at this time.

Will not update unless I have a revelation.

By the recommended link was excellent, will save URL as a resource.

-- =

Posted via http://www.ruby-forum.com/.=
 
B

Brian Candler

Don Norcott wrote in post #962171:
I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

In ruby 1.9, you have to worry about this very much.

Strings in ruby 1.9 are two-dimensional: they have a sequence of bytes,
and they have an encoding. There are additional 'dimensions' based on
the string's content - empty, ascii_compatible, valid_encoding.

If your scraper library doesn't document how it choses the encodings to
tag each string it returns, and doesn't document how it handles invalid
encodings if it comes across them, then you have to test its behaviour
for all the various edge cases.

You never have this issue with ruby 1.8, because a string is just a
string of bytes. Of course, the "garbage in, garbage out" principle
still applies; you just don't choke on the garbage.
2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

That's a short question with a long answer, and I'm afraid my own
attempt to answer it is incomplete:
https://github.com/candlerb/string19/blob/master/string19.rb

If you're reading stuff from a file or a socket yourself, you can
control the process. If you're trusting a third-party library to fetch
data from somewhere, then you have to trust that library to do the right
thing in the situations you're interested in.
Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

irb is not a good predictor of encoding behaviour for ruby 1.9, and
you'd be better writing standalone .rb scripts that you run.

Note that it's one of the 1.9 language inconsistencies that transcoding
is *not* done on output by default. So if you have a read a string from
a file, and carefully tag it as say UTF-8, but your terminal is IBM437,
then

puts my_string

will just squirt the UTF-8 bytes to the terminal and they'll display
wrongly. You can try something like this:

STDOUT.set_encoding "IBM437"
or
STDOUT.set_encoding "locale"

Regards,

Brian.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top