How do I parse this Charactor? 2byte vs 1byte

N

nntp

I found the bug, and could not fix it. It is related to OS related bytes.

Under dos, it looks like this
valign="top">¨¢</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>

I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.

How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?
 
K

Karel Kubat

Hi,
I found the bug, and could not fix it. It is related to OS related bytes.

Under dos, it looks like this
valign="top">¨¢</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>

I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.
How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?

This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.

You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.

Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.

And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

What _is_ the problem you're describing anyway? It might be helpful to
know..

Cheers,
--
Karel Kubat <[email protected], (e-mail address removed)>
Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125
PGP fingerprint: D76E 86EC B457 627A 0A87 0B8D DB71 6BCD 1CF2 6CD5

From the Science Exam Papers:
Vegetative propagation is the process by which
one individual manufactures another individual
by accident.
 
B

Ben Morrow

Quoth (e-mail address removed):
And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

You mean '-q'. :)

Ben
 
N

nntp

Well, you'll need to:

1. Get a recent version of Perl if you don't already have it (5.8.4 is
fine).

2. Check out the docs for the binmode() function: perldoc -f binmode

3. Determine what sort of encoding is used to represent your character.
If you don't know, you can guess by trying the options available in
the binmode() function. Chances are good it is UTF-8 encoding.

I only need English charactors. Is that possible using s///gs to remove
those suckers? It is totally messed up my program. When I parse, I got
Chinese, French, Spanish, everything, but I only need English.
 
N

nntp

Karel Kubat said:
Hi,


This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.

You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.

Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.

And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

What _is_ the problem you're describing anyway? It might be helpful to
know..

Cheers,

The first several lines:
<HTML XMLNS:IE>
<head>
<mainD5>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
URL of the page is www.ebay.com
I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?

Can I do
s/\W//gs or s/[^\w]//gs
to remove everything that is not an English charactor or number or < _ . -!
/\?

I read perldoc -q multebytes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top