Difficulty cleaning oddly encoded whitespace (from MS HTML)

  • Thread starter David R. Throop
  • Start date
D

David R. Throop

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

where \200 \216 ^B and \201 are all single characters. But text
containing the odd whitespace fails to patternmatch those characters.
I Googled on iso8859 and found enough to get some idea that I'm
dealing with some specially encoded character, but everything I found
assumed I already knew about the encoding.

All I want to do is to turn this oddspace into regular whitespace.
Anybody?

Thanks

David Throop
 
B

Ben Morrow

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

Hmmmm... I bet that's mangled UTF8. What 'iso8859-15' is doing in
there I'm not sure, but anyhow... Which perl are you using? If you're
using 5.8, try pushing :utf8 or (better) :encoding(utf8) onto your
input filehandle. If you're stuck with 5.6, you can try one of the
Unicode:: modules, but if you're doing character encoding stuff you'd
be much better off with 5.8.

Ben
 
H

Himanshu Garg

I'm perplexed. I'm writing a PERL script that reads a single large
many-sectioned HTML document, breaks it into smaller files and
extracts some information for another text-manipulation tool to read.
The first HTML file comes from saving a 150+ page MS-Word file as HTML.

I'm having fits with some nonstandard whitespace in the HMTL file. It
appears like a long whitespace and acts as a single character, but it
doesn't patternmatch a \s. When I view it in Emacs, it appears as
%/1\200\216iso8859-15^B\201 \201 \201

I also had some strange behaviour when handling non English text. Try
setting the locale to POSIX ( on GNU/LINUX do export LC_ALL=POSIX ).

In Perl 5.8.1 or later, you can parse a UTF-8 text and output it
correctly in UTF-8 without using binmode. The above is not necessary
then.

++imanshu.
 
D

David R. Throop

Ben Morrow said:
Which perl are you using? If you're using 5.8, try pushing :utf8 or
(better) :encoding(utf8) onto your input filehandle.

Thanks. I took your suggestion and upgraded to 5.8; needed to, anyways.

Let me beg one more question (cuz my PERL 5 Camel Book won't tell me.)
What's the syntax for opening with :encoding(utf8) ? I've tried
open($FILEname, :encoding(utf8))
open("$FILEname :encoding(utf8)")
and a few other variations and I keep losing.

David Throop

------
 
P

Petri

Let me beg one more question (cuz my PERL 5 Camel Book won't tell
me.)
What's the syntax for opening with :encoding(utf8) ? I've tried
open($FILEname, :encoding(utf8))
open("$FILEname :encoding(utf8)")
and a few other variations and I keep losing.

Try:
perldoc -f open

---8<---
You may use the three-argument form of open to specify IO
"layers" (sometimes also referred to as "disciplines") to be
applied to the handle that affect how the input and output are
processed (see open and PerlIO for more details). For example

open(FH, "<:utf8", "file")

will open the UTF-8 encoded file containing Unicode characters,
see perluniintro. (Note that if layers are specified in the
three-arg form then default layers set by the "open" pragma are
ignored.)
---8<---

Hope this helps!

Petri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top