handling special characters

Discussion in 'Ruby' started by Sajal Kayan, Aug 4, 2008.

  1. Sajal Kayan

    Sajal Kayan Guest

    Hi all.

    I am very new to Ruby (5 days old) so my question might sound very
    noobish. I am posting it only cause I couldn't find a solution.

    I am using ruby to scrape content of a site.

    To be precise I am having problems with the ’ character.

    Sample source page :
    http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023

    The encoding is in tis-620 and I use Iconv to convert it to utf8,
    however the special quote character gives the following error on iconv

    /home/....../main.rb:37:in `iconv': "\222s announcement "...
    (Iconv::IllegalSequence)

    the affected code area

    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/�/, "\'")
    puts body
    body = Iconv.iconv("utf8", "tis-620", body) #<-- this is line 37
    puts body

    Or try the following on irb

    require 'rubygems'
    require 'net/http'
    require 'open-uri'
    require 'iconv'
    story =
    Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
    ))
    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/’/, "\'")
    puts body

    no matter whatever i put in the "’" it doesn't replace anything and the
    iconv still gives errors.

    I am looking for pointers on one of the following.

    1) how do i replace "’" to "'" ?
    or 2) How can I make iconv ignore the "’" ?

    At first I thought this to be a I18n issue, but i guess getting rid of
    the special character would be a simple string manipulation which i dont
    get.
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #1
    1. Advertising

  2. Sajal Kayan

    Sajal Kayan Guest

    and oh. you would also need to
    require 'mechanize'

    in the irb to emulate the issue


    require 'rubygems'
    require 'net/http'
    require 'open-uri'
    require 'mechanize'
    require 'iconv'
    story =
    Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
    ))
    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/’/, "\'")
    puts body
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #2
    1. Advertising

  3. Sajal Kayan

    Sajal Kayan Guest

    Heesob Park wrote:

    > The ' character (0x92) is not in tis-620 but in windows-874 character
    > set.
    >
    > Refer to
    > http://www.langbox.com/codeset/tis620.html
    > http://www.microsoft.com/globaldev/reference/sbcs/874.mspx
    >
    > Try
    > body = Iconv.iconv("utf-8", "windows-874", body).join
    >
    > Regards,
    >
    > Park Heesob



    Awesome works like a charm now. Thanks for the prompt response.

    Seems like the source site was putting in the wrong html headers.

    You saved me from going bald :D
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    33,136
    Stefan Mueller
    Jul 23, 2006
  2. Replies:
    1
    Views:
    1,595
    Greg R. Broderick
    Apr 5, 2007
  3. Replies:
    2
    Views:
    1,114
    Ingo Menger
    May 31, 2007
  4. rvino
    Replies:
    0
    Views:
    4,680
    rvino
    Aug 14, 2007
  5. majna
    Replies:
    4
    Views:
    703
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page