Encoding issues when parsing HTML in 1.9

ctdev · Mar 30, 2011

Hi, I'm having some encoding problems while parsing HTML with Nokogiri
in 1.9.

I was first getting errors on non-breaking space characters (code
160), but managed to resolve this by setting the encoding at the top
of my script file ('# coding: utf-8').

However now I'm trying to do simple string substitution with gsub()
and am getting the error:

invalid byte sequence in UTF-8

An example of where this is bombing is the word "PROT\xC9GÉ" as parsed
by Nokogiri. Removing the encoding setting from my script causes the
original problems, so I seem to be stuck.

Has anybody worked through these issues successfully? Google turns up
a number of discussions without many solutions.

brabuhr · Mar 30, 2011

However now I'm trying to do simple string substitution with gsub()
and am getting the error:

=A0invalid byte sequence in UTF-8

An example of where this is bombing is the word "PROT\xC9G=C9" as parsed
by Nokogiri.

What is the encoding of your input HTML file?

ctdev · Mar 30, 2011

Hello,

What is the encoding of your input HTML file?

Opening one of the files in IRB and checking external_encoding.name
returns "UTF-8".

This is from a group of pages I scraped with Hpricot (before switching
to Nokogiri) and saved locally.

The site itself comes from a Microsoft environment and there seems to
be much weirdness in the files. I'll need to anticipate and
accommodate that in my code.

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I'd rather stick with Ruby).

ctdev · Mar 30, 2011

I also tried the following on a test string:

s.encode("UTF-8", :invalid => :replace, :undef =>:replace, :replace
=> "?")

But it doesn't seem to replace the invalid character(s), the very
one(s) it's complaining about!

So I'm stuck because I'm getting the "invalid byte sequence" error,
yet the above function won't replace the invalid bytes.

TFM says:

":invalid : If the value is :replace, encode replaces invalid byte
sequences in str with the replacement character"

That's exactly what I'm trying to do but it isn't working. It isn't
replacing the invalid byte sequence it's complaining about with the
replacement character.

Robert Klemme · Mar 30, 2011

Opening one of the files in IRB and checking external_encoding.name
returns "UTF-8".

That was not the question. He wanted to know the encoding of the
_file_. You should be able to identify this from the HTTP response.

This is from a group of pages I scraped with Hpricot (before switching
to Nokogiri) and saved locally.

The site itself comes from a Microsoft environment and there seems to
be much weirdness in the files. I'll need to anticipate and
accommodate that in my code.

Weirdness with regard to encodings or other weirdness?

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I'd rather stick with Ruby).

IMHO it is usually simpler to stay in one ecosystem. If the server
sends the correct encoding I would expect Hpricot and Nokogiri to
treat the file properly. If you fetched the files with a pre 1.9
version then maybe you have to refetch them.

Cheers

robert

brabuhr · Mar 30, 2011

Opening one of the files in IRB and checking external_encoding.name
returns "UTF-8".

That doesn't detect the true file encoding (indeed, the file is either
in a different encoding or the file is corrupt, hence your invalid
byte sequence).

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

ruby -v -e 'puts File.open("/etc/passwd").external_encoding'
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
US-ASCII

LC_CTYPE=ja_JP.sjis ruby -v -e 'puts File.open("/etc/passwd").external_encoding'
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
Shift_JIS

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I'd rather stick with Ruby).

Well, another language might ignore the invalid characters so it would
look like it worked fine, but your output could actually be invalid.

brabuhr · Mar 30, 2011

I also tried the following on a test string:

=A0s.encode("UTF-8", :invalid =3D> :replace, :undef =3D>:replace, :replac= e
=3D> "?")

But it doesn't seem to replace the invalid character(s)

Could that be an optimization in encode: since the string is already
thought to be UTF-8, just return it?

s =3D "PROT\xC9G=C9"=3D> "PROT\xC9G\u00C9"
s.encode("UTF-8", :invalid =3D> :replace, :undef =3D>:replace, :replace =

=3D> "?")
=3D> "PROT\xC9G\u00C9"

s.

encode('ISO8859-9', :invalid =3D> :replace, :undef =3D>:replace, :replace=
=3D> "#").
encode("UTF-8", :invalid =3D> :replace, :undef =3D>:replace, :replace =3D=

"?")

=3D> "PROT#G\u00C9"

ctdev · Mar 30, 2011

Could that be an optimization in encode: since the string is already

thought to be UTF-8, just return it?

Not sure, it isn't obvious (to me) looking at encode()'s source.

There's no charset specified in the response headers from IIS. The
Content-Type meta tag specifies "text/html; charset=UTF-8" though I'm
not sure if Firefox respects that.

`file -I` on one of the downloaded files displays "text/html;
charset=unknown-8bit."

Firefox is choosing UTF-8 but the special characters aren't displayed
properly. Switching from within the browser to one of the Western
encodings displays the characters correctly (as mentioned this is all
MS stuff and I assume people just copy and paste from MS Office).

brabuhr · Mar 30, 2011

Firefox is choosing UTF-8 but the special characters aren't displayed
properly. Switching from within the browser to one of the Western
encodings displays the characters correctly (as mentioned this is all
MS stuff and I assume people just copy and paste from MS Office).

Okay, try specifying that encoding when you parse it with Nokogiri?

ctdev · Mar 30, 2011

Okay, try specifying that encoding when you parse it with Nokogiri?

I resolved this problem by opening and rewriting the original files
with a specified mode as described in Overbryd's answer:

http://stackoverflow.com/questions/...t-a-string-from-windows-1252-to-utf-8-in-ruby

So:

old = File.open("old", "r:windows-1252:utf-8")
new = File.open("new", "w+:utf-8") {|f| f.write(old.read)}

Everything works now. The characters were all converted and I was able
to remove the encoding directive and non-breaking space literals from
my script by using '\u00A0' in the regex I'm passing to the split
function.

Thanks for the help.

ctdev · Mar 30, 2011

Okay, try specifying that encoding when you parse it with Nokogiri?

And you're right, I'll have to see if/how that translates to Nokogiri
for future downloads.

brabuhr · Mar 30, 2011

I resolved this problem by opening and rewriting the original files
with a specified mode as described in Overbryd's answer:

http://stackoverflow.com/questions/...t-a-string-from-windows-1252-to-utf-8-in-ruby

So:

old = File.open("old", "r:windows-1252:utf-8")
new = File.open("new", "w+:utf-8") {|f| f.write(old.read)}

Cool; thanks

1.9 CSV Parsing Issues	5	Nov 4, 2010
Default encoding in ruby 1.9	2	Jun 19, 2009
[ANN] nokogiri 1.3.0 Released	14	May 30, 2009
OSX -flat_namespace issues show up in hitimes, nokogiri andamalgalite	3	Aug 24, 2009
[ANN] Nokogiri 1.4.4 Released	0	Nov 16, 2010
parsing text from "ethtool" command	3	Nov 1, 2011
XML Parsing Problem in Internet Explorer	1	Oct 11, 2008
Encoding detection in the html parser from libxml2	0	Feb 7, 2006

Encoding issues when parsing HTML in 1.9

ctdev

brabuhr

ctdev

ctdev

Robert Klemme

brabuhr

brabuhr

ctdev

brabuhr

ctdev

ctdev

brabuhr

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads