Setting encoding of pages in Capybara

J

James Coglan

[Note: parts of this message were removed to make it a legal post.]

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the response.
e.g. running this script:

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::proxy
def rewrite_env(env)
env['HTTP_HOST'] = 'l-tike.com'
env
end
end

session = Capybara::Session.new:)rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

You'll see weird characters in the output, and I can't find nodes that
should be there with css/xpath. How do I set the encoding so that Nokogiri
parses the page properly?
 
M

Mike Dalessio

[Note: parts of this message were removed to make it a legal post.]

Hi,

Hi all,

Quick encoding question: say I'm trying to grab data from a Japanese page
using Capybara and Rack::Test, and I get badly encoded text in the
response.
e.g. running this script:

First, a quick note, that this question is probably more appropriate for the
capybara or nokogiri mailing lists. You're likely to get a quicker response
from those groups.

require 'rubygems'
require 'capybara'
require 'rack/test'
require 'rack/proxy'

Capybara.default_selector = :css

class Japan < Rack::proxy
def rewrite_env(env)
env['HTTP_HOST'] = 'l-tike.com'
env
end
end

session = Capybara::Session.new:)rack_test, Japan.new)
session.visit '/pickup/concert_more.html'
puts session.body

It looks like this page claims (in its header) to be encoding in SHIFT_JIS,
but the page is encoded in UTF-8. LibXML's guesses at encoding are not
perfect, and in this case the misleading information causes it to trust the
header and use the wrong encoding.

If this page is edited to contain

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

instead of

<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

then all is well.


Perhaps someone with more experience than me using non-western character
sets will have a deeper insight into libxml's behavior here?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,785
Messages
2,569,624
Members
45,319
Latest member
LorenFlann

Latest Threads

Top