handling special characters

Discussion in 'Ruby' started by Sajal Kayan, Aug 4, 2008.

  1. Sajal Kayan

    Sajal Kayan Guest

    Hi all.

    I am very new to Ruby (5 days old) so my question might sound very
    noobish. I am posting it only cause I couldn't find a solution.

    I am using ruby to scrape content of a site.

    To be precise I am having problems with the ’ character.

    Sample source page :
    http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023

    The encoding is in tis-620 and I use Iconv to convert it to utf8,
    however the special quote character gives the following error on iconv

    /home/....../main.rb:37:in `iconv': "\222s announcement "...
    (Iconv::IllegalSequence)

    the affected code area

    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/�/, "\'")
    puts body
    body = Iconv.iconv("utf8", "tis-620", body) #<-- this is line 37
    puts body

    Or try the following on irb

    require 'rubygems'
    require 'net/http'
    require 'open-uri'
    require 'iconv'
    story =
    Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
    ))
    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/’/, "\'")
    puts body

    no matter whatever i put in the "’" it doesn't replace anything and the
    iconv still gives errors.

    I am looking for pointers on one of the following.

    1) how do i replace "’" to "'" ?
    or 2) How can I make iconv ignore the "’" ?

    At first I thought this to be a I18n issue, but i guess getting rid of
    the special character would be a simple string manipulation which i dont
    get.
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #1
    1. Advertisements

  2. Sajal Kayan

    Sajal Kayan Guest

    and oh. you would also need to
    require 'mechanize'

    in the irb to emulate the issue


    require 'rubygems'
    require 'net/http'
    require 'open-uri'
    require 'mechanize'
    require 'iconv'
    story =
    Hpricot(open('http://thainews.prd.go.th/newsenglish/previewnews.php?news_id=255108040023'
    ))
    body = story.search("//font[@color=\"#333333\"]").inner_html
    body = body.gsub(/<(.|\n)+?>/, "")
    body = body.gsub(/’/, "\'")
    puts body
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #2
    1. Advertisements

  3. Sajal Kayan

    Sajal Kayan Guest

    Heesob Park wrote:

    > The ' character (0x92) is not in tis-620 but in windows-874 character
    > set.
    >
    > Refer to
    > http://www.langbox.com/codeset/tis620.html
    > http://www.microsoft.com/globaldev/reference/sbcs/874.mspx
    >
    > Try
    > body = Iconv.iconv("utf-8", "windows-874", body).join
    >
    > Regards,
    >
    > Park Heesob



    Awesome works like a charm now. Thanks for the prompt response.

    Seems like the source site was putting in the wrong html headers.

    You saved me from going bald :D
    --
    Posted via http://www.ruby-forum.com/.
     
    Sajal Kayan, Aug 4, 2008
    #3
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    34,234
    Stefan Mueller
    Jul 23, 2006
  2. Replies:
    1
    Views:
    1,909
    Greg R. Broderick
    Apr 5, 2007
  3. Replies:
    2
    Views:
    819
    Joseph Kesselman
    Apr 17, 2007
  4. Replies:
    2
    Views:
    1,338
    Ingo Menger
    May 31, 2007
  5. rvino
    Replies:
    0
    Views:
    4,988
    rvino
    Aug 14, 2007
  6. hackingKK
    Replies:
    1
    Views:
    359
    Thomas 'PointedEars' Lahn
    Jul 28, 2011
  7. Elisabeth Svensson via .NET 247

    Handling special characters when exporting DataGrid to Excel

    Elisabeth Svensson via .NET 247, Aug 4, 2004, in forum: ASP .Net Datagrid Control
    Replies:
    1
    Views:
    870
    Pete Fearn
    Dec 21, 2004
  8. majna
    Replies:
    4
    Views:
    1,095
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...