Difficulty cleaning oddly encoded whitespace (from MS HTML)

Discussion in 'Perl Misc' started by David R. Throop, Feb 4, 2004.

  1. I'm perplexed. I'm writing a PERL script that reads a single large
    many-sectioned HTML document, breaks it into smaller files and
    extracts some information for another text-manipulation tool to read.
    The first HTML file comes from saving a 150+ page MS-Word file as HTML.

    I'm having fits with some nonstandard whitespace in the HMTL file. It
    appears like a long whitespace and acts as a single character, but it
    doesn't patternmatch a \s. When I view it in Emacs, it appears as
    %/1\200\216iso8859-15^B\201 \201 \201

    where \200 \216 ^B and \201 are all single characters. But text
    containing the odd whitespace fails to patternmatch those characters.
    I Googled on iso8859 and found enough to get some idea that I'm
    dealing with some specially encoded character, but everything I found
    assumed I already knew about the encoding.

    All I want to do is to turn this oddspace into regular whitespace.
    Anybody?

    Thanks

    David Throop
     
    David R. Throop, Feb 4, 2004
    #1
    1. Advertising

  2. David R. Throop

    Ben Morrow Guest

    (David R. Throop) wrote:
    > I'm perplexed. I'm writing a PERL script that reads a single large
    > many-sectioned HTML document, breaks it into smaller files and
    > extracts some information for another text-manipulation tool to read.
    > The first HTML file comes from saving a 150+ page MS-Word file as HTML.
    >
    > I'm having fits with some nonstandard whitespace in the HMTL file. It
    > appears like a long whitespace and acts as a single character, but it
    > doesn't patternmatch a \s. When I view it in Emacs, it appears as
    > %/1\200\216iso8859-15^B\201 \201 \201


    Hmmmm... I bet that's mangled UTF8. What 'iso8859-15' is doing in
    there I'm not sure, but anyhow... Which perl are you using? If you're
    using 5.8, try pushing :utf8 or (better) :encoding(utf8) onto your
    input filehandle. If you're stuck with 5.6, you can try one of the
    Unicode:: modules, but if you're doing character encoding stuff you'd
    be much better off with 5.8.

    Ben

    --
    It will be seen that the Erwhonians are a meek and long-suffering people,
    easily led by the nose, and quick to offer up common sense at the shrine of
    logic, when a philosopher convinces them that their institutions are not based
    on the strictest morality. [Samuel Butler, paraphrased]
     
    Ben Morrow, Feb 4, 2004
    #2
    1. Advertising

  3. (David R. Throop) wrote:
    > I'm perplexed. I'm writing a PERL script that reads a single large
    > many-sectioned HTML document, breaks it into smaller files and
    > extracts some information for another text-manipulation tool to

    read.
    > The first HTML file comes from saving a 150+ page MS-Word file as

    HTML.
    >
    > I'm having fits with some nonstandard whitespace in the HMTL file.

    It
    > appears like a long whitespace and acts as a single character, but

    it
    > doesn't patternmatch a \s. When I view it in Emacs, it appears as
    > %/1\200\216iso8859-15^B\201 \201 \201


    I also had some strange behaviour when handling non English text. Try
    setting the locale to POSIX ( on GNU/LINUX do export LC_ALL=POSIX ).

    In Perl 5.8.1 or later, you can parse a UTF-8 text and output it
    correctly in UTF-8 without using binmode. The above is not necessary
    then.

    ++imanshu.
     
    Himanshu Garg, Feb 5, 2004
    #3
  4. In article <bvs0se$cm0$>,
    Ben Morrow <> wrote:

    > Which perl are you using? If you're using 5.8, try pushing :utf8 or
    > (better) :encoding(utf8) onto your input filehandle.


    Thanks. I took your suggestion and upgraded to 5.8; needed to, anyways.

    Let me beg one more question (cuz my PERL 5 Camel Book won't tell me.)
    What's the syntax for opening with :encoding(utf8) ? I've tried
    open($FILEname, :encoding(utf8))
    open("$FILEname :encoding(utf8)")
    and a few other variations and I keep losing.

    David Throop

    ------
     
    David R. Throop, Feb 5, 2004
    #4
  5. David R. Throop

    Petri Guest

    In article <bvufkq$ah$>, David R. Throop says...
    >> Which perl are you using? If you're using 5.8, try pushing :utf8
    >> or (better) :encoding(utf8) onto your input filehandle.


    > Let me beg one more question (cuz my PERL 5 Camel Book won't tell
    > me.)
    > What's the syntax for opening with :encoding(utf8) ? I've tried
    > open($FILEname, :encoding(utf8))
    > open("$FILEname :encoding(utf8)")
    > and a few other variations and I keep losing.


    Try:
    perldoc -f open

    ---8<---
    You may use the three-argument form of open to specify IO
    "layers" (sometimes also referred to as "disciplines") to be
    applied to the handle that affect how the input and output are
    processed (see open and PerlIO for more details). For example

    open(FH, "<:utf8", "file")

    will open the UTF-8 encoded file containing Unicode characters,
    see perluniintro. (Note that if layers are specified in the
    three-arg form then default layers set by the "open" pragma are
    ignored.)
    ---8<---

    Hope this helps!

    Petri
     
    Petri, Feb 8, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Craig Buchanan

    parent child dropdownlists acting oddly

    Craig Buchanan, Jun 27, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    477
    Craig Buchanan
    Jun 27, 2003
  2. Oli Filth
    Replies:
    9
    Views:
    3,361
    Uncle Pirate
    Jan 17, 2005
  3. AutahG
    Replies:
    0
    Views:
    429
    AutahG
    Mar 1, 2008
  4. Replies:
    10
    Views:
    786
    Eric Brunel
    Dec 16, 2008
  5. MRAB
    Replies:
    3
    Views:
    401
Loading...

Share This Page