How to convert MS Word special characters to HTML codes?

Discussion in 'Ruby' started by Paul, Mar 31, 2012.

  1. Paul

    Paul Guest

    Hi there, I have been pouring over a character conversion problem for
    a day now and need some help. I created a ruby script that scans an
    Excel spreadsheet and puts the content into a custom XML file. (works
    fine) I am using Ruby 1.9.2.

    When I tried importing the XML file into the destination program, it
    fails. It turns out the Excel spreadsheet data had some text copied
    from MS Word and so every now and then there is an embedded 'long
    dash' or ellipses character that is above the regular ascii set, so
    the import function fails due to these unexpected binary characters.

    I can find these lines and specific characters when the script reads
    the data. What I'd *like* to do is convert these special (unicode)
    characters to their HTML equivalents. After hours of searching blog
    posts and skimming through old posts here, I am still stuck.

    If I can't find a way to convert these characters, I'll just remove
    them. I'd really like to try and keep them somehow.

    Can someone please help point me to some references or offer some
    advice on how I can convert them to HTML or ascii equivalents?

    Here's an example. In ruby 1.9.2, I see the following line in my
    output file:

    "* \x85 ellipsis\n"

    -> according to an HTML lookup table, I could replace \x85 with

    Is there an easy way to convert these characters? I've tried the CGI
    and ICONV libraries and they both skip over these characters. I would
    prefer to have a routine that can find and replace each of the special
    characters rather than write a regex for each character myself. I have
    encountered 5 special characters so far. There might be more as I go
    through the data.



    Paul, Mar 31, 2012
    1. Advertisements

  2. Paul

    Paul Guest

    Nevermind. I think I figured out the problem.

    While creating my script, I ran it from SciTE (version 3.0.4 in win7).
    Something about running the script from that environment works
    differently than when I run it from the command line.

    When I run the script from the command line, everything works
    correctly and the CGI library converts the characters well enough.

    No worries.
    Paul, Apr 2, 2012
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.