How do I parse this Charactor? 2byte vs 1byte

nntp · Oct 27, 2004

I found the bug, and could not fix it. It is related to OS related bytes.

Under dos, it looks like this
valign="top">¨¢</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>

I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.

How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?

Karel Kubat · Oct 27, 2004

Hi,

I found the bug, and could not fix it. It is related to OS related bytes.

Under dos, it looks like this
valign="top">¨¢</td>
In Unix, it looks like this
valign="top">| </td>
In my Textpad (in Chinese OS), it looks like this
valign="top">?/td>

I asked experts about this. They told me that the charactor is missed
combined with the next charactor to become one charactor.
How do I make sure my perl can parse this correctly? Can perl tell 2 byte
word and 1 byte word?

This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.

You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.

Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.

And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

What _is_ the problem you're describing anyway? It might be helpful to
know..

Cheers,
--
Karel Kubat <[email protected], (e-mail address removed)>
Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125
PGP fingerprint: D76E 86EC B457 627A 0A87 0B8D DB71 6BCD 1CF2 6CD5

From the Science Exam Papers:
Vegetative propagation is the process by which
one individual manufactures another individual
by accident.

Ben Morrow · Oct 27, 2004

Quoth (e-mail address removed):

And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

You mean '-q'.

Ben

nntp · Oct 27, 2004

Well, you'll need to:

1. Get a recent version of Perl if you don't already have it (5.8.4 is
fine).

2. Check out the docs for the binmode() function: perldoc -f binmode

3. Determine what sort of encoding is used to represent your character.
If you don't know, you can guess by trying the options available in
the binmode() function. Chances are good it is UTF-8 encoding.

I only need English charactors. Is that possible using s///gs to remove
those suckers? It is totally messed up my program. When I parse, I got
Chinese, French, Spanish, everything, but I only need English.

nntp · Oct 28, 2004

Karel Kubat said:
Hi,

This is not a Perl issue per se, and neither an OS-related issue. You're
dealing with multibyte encodings of characters.

You need to look at the encoding of the original document first. Off the top
of my head, in an XML document, it would say something like <?xml
version="1.0" encoding="....."?>. When the encoding specifier is missing,
then UTF-8 is the default I think.

Your problem however probably refers to an HTML page, not to an XML
document. In that case the encoding might be in one of the HTTP headers
that are sent when a server outputs a page -- that will depend on the
server configuration.

And regarding encodings or character sets: _yes_, Perl can be told to
regard 2-byte sequences as 1 character (or even more than 2 bytes,
actually). Try "perldoc -f multibyte" and then play around with the Unicode
modules.

What _is_ the problem you're describing anyway? It might be helpful to
know..

Cheers,

The first several lines:
<HTML XMLNS:IE>
<head>
<mainD5>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
URL of the page is www.ebay.com
I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?

Can I do
s/\W//gs or s/[^\w]//gs
to remove everything that is not an English charactor or number or < _ . -!
/\?

I read perldoc -q multebytes.

How do i Do this function(dealing with arrays)	1	Dec 10, 2021
FAQ 9.15 How do I parse a mail header?	0	Apr 10, 2011
How can I train a neural network by reading different csv files	0	Nov 24, 2022
Help: How can I parse this properties file?	26	Nov 5, 2008
How do I upload image to firebase and retrieve it using picasso	14	Apr 28, 2019
FAQ 5.1 How do I flush/unbuffer an output filehandle? Why must I do this?	0	Apr 2, 2011
FAQ 4.23 How do I find matching/nesting anything?	0	Apr 2, 2011
Spreadsheet::Parse & Write Excel	1	Oct 30, 2007

How do I parse this Charactor? 2byte vs 1byte

nntp

Karel Kubat

Ben Morrow

nntp

nntp

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads