Difficulty cleaning oddly encoded whitespace (from MS HTML)

David R. Throop · Feb 4, 2004

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

where \200 \216 ^B and \201 are all single characters. But text
containing the odd whitespace fails to patternmatch those characters.
I Googled on iso8859 and found enough to get some idea that I'm
dealing with some specially encoded character, but everything I found
assumed I already knew about the encoding.

All I want to do is to turn this oddspace into regular whitespace.
Anybody?

Thanks

David Throop

Ben Morrow · Feb 4, 2004

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

Hmmmm... I bet that's mangled UTF8. What 'iso8859-15' is doing in
there I'm not sure, but anyhow... Which perl are you using? If you're
using 5.8, try pushing :utf8 or (better) :encoding(utf8) onto your
input filehandle. If you're stuck with 5.6, you can try one of the
Unicode:: modules, but if you're doing character encoding stuff you'd
be much better off with 5.8.

Ben

Himanshu Garg · Feb 5, 2004

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

I also had some strange behaviour when handling non English text. Try
setting the locale to POSIX ( on GNU/LINUX do export LC_ALL=POSIX ).

In Perl 5.8.1 or later, you can parse a UTF-8 text and output it
correctly in UTF-8 without using binmode. The above is not necessary
then.

++imanshu.

David R. Throop · Feb 5, 2004

Ben Morrow said:
Which perl are you using? If you're using 5.8, try pushing :utf8 or
(better) :encoding(utf8) onto your input filehandle.

Thanks. I took your suggestion and upgraded to 5.8; needed to, anyways.

Let me beg one more question (cuz my PERL 5 Camel Book won't tell me.)
What's the syntax for opening with :encoding(utf8) ? I've tried
open($FILEname, :encoding(utf8))
open("$FILEname :encoding(utf8)")
and a few other variations and I keep losing.

David Throop

------

Petri · Feb 8, 2004

Let me beg one more question (cuz my PERL 5 Camel Book won't tell
me.)
What's the syntax for opening with :encoding(utf8) ? I've tried
open($FILEname, :encoding(utf8))
open("$FILEname :encoding(utf8)")
and a few other variations and I keep losing.

Try:
perldoc -f open

---8<---
You may use the three-argument form of open to specify IO
"layers" (sometimes also referred to as "disciplines") to be
applied to the handle that affect how the input and output are
processed (see open and PerlIO for more details). For example

open(FH, "<:utf8", "file")

will open the UTF-8 encoded file containing Unicode characters,
see perluniintro. (Note that if layers are specified in the
three-arg form then default layers set by the "open" pragma are
ignored.)
---8<---

Hope this helps!

Petri

[SUMMARY] Code Cleaning (#26)	5	Apr 7, 2005
Extract URL from HTML	2	Jul 27, 2004
bad data from urllib when run from MS .bat file	8	Sep 18, 2004
Opera to MS: Get real about interoperability, Mr Gates ;)	0	Feb 14, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Aug 1, 2007

Difficulty cleaning oddly encoded whitespace (from MS HTML)

David R. Throop

Ben Morrow

Himanshu Garg

David R. Throop

Petri

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads