How do I parse this Charactor? 2byte vs 1byte

Discussion in 'Perl Misc' started by nntp, Oct 27, 2004.

  1. nntp

    nntp Guest

    I found the bug, and could not fix it. It is related to OS related bytes.

    Under dos, it looks like this
    valign="top">¨¢</td>
    In Unix, it looks like this
    valign="top">| </td>
    In my Textpad (in Chinese OS), it looks like this
    valign="top">?/td>

    I asked experts about this. They told me that the charactor is missed
    combined with the next charactor to become one charactor.

    How do I make sure my perl can parse this correctly? Can perl tell 2 byte
    word and 1 byte word?
    nntp, Oct 27, 2004
    #1
    1. Advertising

  2. nntp

    Karel Kubat Guest

    Hi,

    > I found the bug, and could not fix it. It is related to OS related bytes.
    >
    > Under dos, it looks like this
    > valign="top">¨¢</td>
    > In Unix, it looks like this
    > valign="top">| </td>
    > In my Textpad (in Chinese OS), it looks like this
    > valign="top">?/td>
    >
    > I asked experts about this. They told me that the charactor is missed
    > combined with the next charactor to become one charactor.
    > How do I make sure my perl can parse this correctly? Can perl tell 2 byte
    > word and 1 byte word?


    This is not a Perl issue per se, and neither an OS-related issue. You're
    dealing with multibyte encodings of characters.

    You need to look at the encoding of the original document first. Off the top
    of my head, in an XML document, it would say something like <?xml
    version="1.0" encoding="....."?>. When the encoding specifier is missing,
    then UTF-8 is the default I think.

    Your problem however probably refers to an HTML page, not to an XML
    document. In that case the encoding might be in one of the HTTP headers
    that are sent when a server outputs a page -- that will depend on the
    server configuration.

    And regarding encodings or character sets: _yes_, Perl can be told to
    regard 2-byte sequences as 1 character (or even more than 2 bytes,
    actually). Try "perldoc -f multibyte" and then play around with the Unicode
    modules.

    What _is_ the problem you're describing anyway? It might be helpful to
    know..

    Cheers,
    --
    Karel Kubat <, >
    Phone: mobile (+31) 6 2956 4861, office (+31) (0)38 46 06 125
    PGP fingerprint: D76E 86EC B457 627A 0A87 0B8D DB71 6BCD 1CF2 6CD5

    From the Science Exam Papers:
    Vegetative propagation is the process by which
    one individual manufactures another individual
    by accident.
    Karel Kubat, Oct 27, 2004
    #2
    1. Advertising

  3. nntp

    Ben Morrow Guest

    Quoth :
    > And regarding encodings or character sets: _yes_, Perl can be told to
    > regard 2-byte sequences as 1 character (or even more than 2 bytes,
    > actually). Try "perldoc -f multibyte" and then play around with the Unicode
    > modules.


    You mean '-q'. :)

    Ben

    --
    Although few may originate a policy, we are all able to judge it.
    - Pericles of Athens, c.430 B.C.
    Ben Morrow, Oct 27, 2004
    #3
  4. nntp

    nntp Guest


    >
    > Well, you'll need to:
    >
    > 1. Get a recent version of Perl if you don't already have it (5.8.4 is
    > fine).
    >
    > 2. Check out the docs for the binmode() function: perldoc -f binmode
    >
    > 3. Determine what sort of encoding is used to represent your character.
    > If you don't know, you can guess by trying the options available in
    > the binmode() function. Chances are good it is UTF-8 encoding.
    >


    I only need English charactors. Is that possible using s///gs to remove
    those suckers? It is totally messed up my program. When I parse, I got
    Chinese, French, Spanish, everything, but I only need English.
    nntp, Oct 27, 2004
    #4
  5. nntp

    nntp Guest

    "Karel Kubat" <> ????
    news:417fdccf$0$142$4all.nl...
    > Hi,
    >
    > > I found the bug, and could not fix it. It is related to OS related

    bytes.
    > >
    > > Under dos, it looks like this
    > > valign="top">¨¢</td>
    > > In Unix, it looks like this
    > > valign="top">| </td>
    > > In my Textpad (in Chinese OS), it looks like this
    > > valign="top">?/td>
    > >
    > > I asked experts about this. They told me that the charactor is missed
    > > combined with the next charactor to become one charactor.
    > > How do I make sure my perl can parse this correctly? Can perl tell 2

    byte
    > > word and 1 byte word?

    >
    > This is not a Perl issue per se, and neither an OS-related issue. You're
    > dealing with multibyte encodings of characters.
    >
    > You need to look at the encoding of the original document first. Off the

    top
    > of my head, in an XML document, it would say something like <?xml
    > version="1.0" encoding="....."?>. When the encoding specifier is missing,
    > then UTF-8 is the default I think.
    >
    > Your problem however probably refers to an HTML page, not to an XML
    > document. In that case the encoding might be in one of the HTTP headers
    > that are sent when a server outputs a page -- that will depend on the
    > server configuration.
    >
    > And regarding encodings or character sets: _yes_, Perl can be told to
    > regard 2-byte sequences as 1 character (or even more than 2 bytes,
    > actually). Try "perldoc -f multibyte" and then play around with the

    Unicode
    > modules.
    >
    > What _is_ the problem you're describing anyway? It might be helpful to
    > know..
    >
    > Cheers,


    The first several lines:
    <HTML XMLNS:IE>
    <head>
    <mainD5>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
    URL of the page is www.ebay.com
    I see charset=ISO-8859-1. Isn't that regular 1 byte encoding?

    Can I do
    s/\W//gs or s/[^\w]//gs
    to remove everything that is not an English charactor or number or < _ . -!
    /\?

    I read perldoc -q multebytes.
    nntp, Oct 28, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guoqi Zheng

    shift + ctrl charactor

    Guoqi Zheng, Oct 26, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    576
    Guoqi Zheng
    Oct 27, 2004
  2. ±è¿ë°Ç
    Replies:
    3
    Views:
    289
    Victor Bazarov
    Apr 10, 2004
  3. Replies:
    4
    Views:
    267
    Michael Mair
    Apr 2, 2006
  4. Replies:
    0
    Views:
    374
  5. Replies:
    0
    Views:
    263
Loading...

Share This Page