Meaning of "Malformed UTF-8 character"?

D

DM

I'm using Perl 5.8.0 on RH Enterprise Linux. I'm trying to match this pattern:

$pattern = "href=.*?\\.pdf\[^>]*?>";


I'm seeing many of these errors:

Malformed UTF-8 character (unexpected continuation byte 0x96, with no preceding
start byte) in pattern match (m//) at /home/emicha/bin/moveFileType.pl line 79,
<INFILE> line 149.


And a few of these:

Malformed UTF-8 character (unexpected non-continuation byte 0x20, immediately
after start byte 0xe9) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 51.


A Google search found a little bit of information on the second, but nothing
useful, and virtually nothing on the first.


Any help in interpreting/resolving these would be greatly appreciated.

Thanks,

dm
 
D

DM

DM said:
I'm using Perl 5.8.0 on RH Enterprise Linux. I'm trying to match this
pattern:

$pattern = "href=.*?\\.pdf\[^>]*?>";


I'm seeing many of these errors:

Malformed UTF-8 character (unexpected continuation byte 0x96, with no
preceding start byte) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 149.


And a few of these:

Malformed UTF-8 character (unexpected non-continuation byte 0x20,
immediately after start byte 0xe9) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 51.


A Google search found a little bit of information on the second, but
nothing useful, and virtually nothing on the first.


Any help in interpreting/resolving these would be greatly appreciated.

Thanks,

dm

Solution found. Sorry, my Google search wasn't thorough enough the first time.

It has to do with the "LANG" and "LC_CTYPE" environment variables. When I run
the script like this...

# LANG=en_US LC_CTYPE=en_US perl -w /home/emicha/bin/moveFileType.pl

....there are no errors.

Thanks,

dm
 
A

Alan J. Flavell

I'm using Perl 5.8.0 on RH Enterprise Linux. I'm trying to match this pattern:

$pattern = "href=.*?\\.pdf\[^>]*?>";

I'm seeing many of these errors:

Malformed UTF-8 character (unexpected continuation byte 0x96, with no
preceding start byte) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 149.

The error is self-explanatory, in its own terms. Sounds as if you're
not familiar with those terms yet...

Hmmm, 5.8.0. My hunch is that you've got utf8 in your locale, but
your data isn't really in utf8. See earlier discussions of this issue
in (e.g) redhat 9, where the problem frequently arose.
And a few of these:

Malformed UTF-8 character (unexpected non-continuation byte 0x20, immediately
after start byte 0xe9) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 51.

Again, self-explanatory in its own terms, but if the data is defective
like this, we need to see where the data came from.
Any help in interpreting/resolving these would be greatly appreciated.

The quick fix is to try taking the utf8 out of your locale.

Beyond that, I'd say we'd need a minimal but complete example which
reproduces that problem and which we can run for ourselves, as a
starting point to explain to you what's going wrong and how to fix it.
 
D

DM

Alan said:
I'm using Perl 5.8.0 on RH Enterprise Linux. I'm trying to match this pattern:

$pattern = "href=.*?\\.pdf\[^>]*?>";

I'm seeing many of these errors:

Malformed UTF-8 character (unexpected continuation byte 0x96, with no
preceding start byte) in pattern match (m//) at
/home/emicha/bin/moveFileType.pl line 79, <INFILE> line 149.


The error is self-explanatory, in its own terms. Sounds as if you're
not familiar with those terms yet...

True. I don't know what a "start byte" or "continuation byte" are.
Hmmm, 5.8.0. My hunch is that you've got utf8 in your locale, but
your data isn't really in utf8. See earlier discussions of this issue
in (e.g) redhat 9, where the problem frequently arose.

Since this is RH Enterprise Linux, based on RH 9, that is likely. I'll check the
earlier discussions.
Again, self-explanatory in its own terms, but if the data is defective
like this, we need to see where the data came from.

The data is a bunch of HTML files that were produced at various times using a
variety of software on a variety of platforms. However, the majority were
produced using Dreamweaver for the Mac. I believe the files mostly use the
"Latin-1" encoding, but I'm not 100% sure.
 
A

Alan J. Flavell

True. I don't know what a "start byte" or "continuation byte" are.

That's soon remedied. Take for example this tutorial here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

and look at the second and third bullets. The third bullet says

The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how
many bytes follow for this character. All further bytes in a
multibyte sequence are in the range 0x80 to 0xBF. This allows easy
resynchronization and makes the encoding stateless and robust against
missing bytes.

In that sense, the "start byte" of a properly formed sequence
would be one in the range 0xC0 to 0xFD, and its value would indicate
how many continueation bytes are expected to follow.

If your data contains byte sequences which fail these tests, then it
cannot possibly be utf-8 encoding, and the rules of the game say that
(if it was supposed to be utf-8) it must be declared invalid (in order
to eliminate possible security compromises by presenting spoof data).

I think you should now be able to see what the error reports were
complaining about in your data.

Of course, in your case it proves nothing more than that the original
assumption, that this was utf-8 encoding, was wrong. Perl (5.8.0)
has made that assumption based on what it found in the locale, which,
as we discussed before, for RH8 and 9 contains utf8.

You find more about this in the Perl context by reading perldoc
uniintro and perldoc unicode, or their corresponding webified versions
at e.g http://www.perldoc.com/perl5.8.0/pod.html

It turned out to cause so much confusion that later versions of Perl
took out this default assumption.
The data is a bunch of HTML files that were produced at various
times using a variety of software on a variety of platforms.
However, the majority were produced using Dreamweaver for the Mac. I
believe the files mostly use the "Latin-1" encoding, but I'm not
100% sure.

Right. You're probably expecting to process it as a bunch of bytes,
rather than having Perl try to do its clever unicode-ish stuff on it.

I think that still stands.

Good luck.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top