help - how to find what is the code for "+/-" symbol copied fromWindows app

Discussion in 'Perl Misc' started by Joe, Dec 15, 2012.

  1. Joe

    Joe Guest

    I received an Excel data file that contains a "+/-" symbol (html code \± ±), that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?

    Thanks in advance!

    joe
    Joe, Dec 15, 2012
    #1
    1. Advertising

  2. Joe

    Dr.Ruud Guest

    On 2012-12-15 07:19, Joe wrote:

    > I received an Excel data file that contains a "+/-" symbol (html code \± ±), that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?


    First, look up its Unicode code point.

    Google for: unicode plus minus, which will lead you to (for example)
    http://www.fileformat.info/info/unicode/char/b1/index.htm

    From there it is easy to deduce: \x{B1}.

    The page also gives you the exact Unicode character name
    'PLUS-MINUS SIGN', which you can use in regular expressions.

    --
    Ruud
    Dr.Ruud, Dec 15, 2012
    #2
    1. Advertising

  3. Re: help - how to find what is the code for "+/-" symbol copiedfrom Windows app

    On 2012-12-15 12:34, Ben Morrow <> wrote:
    > Quoth "Dr.Ruud" <>:
    >> On 2012-12-15 07:19, Joe wrote:
    >> > I received an Excel data file that contains a "+/-" symbol (html code

    >> \&plusmn; &plusmn;), that can be copied and displayed in Word, Notepad,
    >> "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from
    >> MySQL operated on linux. But when I need to manipulate the data in
    >> perl, I am lost as how to recognize the symbol with RE. Could anyone
    >> help?
    >>
    >> First, look up its Unicode code point.
    >>
    >> Google for: unicode plus minus, which will lead you to (for example)
    >> http://www.fileformat.info/info/unicode/char/b1/index.htm
    >>
    >> From there it is easy to deduce: \x{B1}.

    >
    > Or just use what you already know:
    >
    > use HTML::Entities "decode_entities";
    > my $plusmn = decode_entities "&plusmn;";


    Or just copy/paste the sign into your source code:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use utf8;

    my $text = "the result is 8±2";

    if ($text =~ m/±/) {
    print "The text contains a plus/minus sign\n";
    }
    __END__


    (Of course you need to make sure that the module you use to read the
    Excel sheet really returns the ± as a single character U+00B1, but this is true
    for all methods. If it doesn't, use Encode::decode to convert whatever
    your Excel module returns into something sane.)

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Dec 15, 2012
    #3
  4. Re: help - how to find what is the code for "+/-" symbol copiedfrom Windows app

    On 2012-12-15 14:04, Ben Morrow <> wrote:
    > Quoth "Peter J. Holzer" <>:
    >> Or just copy/paste the sign into your source code:
    >>
    >> #!/usr/bin/perl
    >> use warnings;
    >> use strict;
    >> use utf8;
    >>
    >> my $text = "the result is 8±2";

    >
    > Should I comment on the irony of your newsreader having converted that
    > to ISO8859-1? :)


    That's a feature, not a bug. Usenet is (except for the binaries groups)
    a text medium: The content of a usenet posting consists of characters,
    not bytes. Of course for transport it has to be encoded into some
    sequence of bytes, but as long as the encoding/decoding process is
    lossless, the NUA is free to employ any encoding it likes.

    In my case I have configured the following outgoing charsets:

    us-ascii,iso-8859-1,iso-8859-15,utf-8

    The order is significant, so since my posting contained characters
    which could not be represented in us-ascii, but could be represented in
    iso-8859-1, the latter was used. If I had also used a euro sign, it
    would have used iso-8859-15; and if I had used typographical quotes, it
    would have used utf-8.


    > (This is why I'm slightly suspicious of the whole idea of non-ASCII
    > source code. It's fine as long as it's just in a file, but tends to be
    > much less likely to survive diffs/mailing-list posts/&c. without being
    > mangled.)


    That can usually be avoided by attaching the diffs or code instead of
    including them in the main text part. It also makes them easier to hande
    for the receiver.

    Also Non-ASCII characters aren't the only ones mangled by common
    NUAs/MUAs. Many fold long lines, some remove leading whitespace, some
    change tabs into spaces, ...

    At least an unintended charset conversion can be easily undone with
    iconv or similar tools - other changes which MUAs are likely to inflict
    on a text are generally not reversible.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Dec 16, 2012
    #4
  5. Joe

    Dr.Ruud Guest

    On 2012-12-15 13:34, Ben Morrow wrote:

    > [recognizing &plusmn; with RE]
    >
    > use HTML::Entities "decode_entities";
    > my $plusmn = decode_entities "&plusmn;";


    Realize that HTML::Entities is not in core.

    And don't forget quotemeta:

    perl -wle '
    my $dot = "\x{2E}";
    print "1:", "a" =~ /$dot/;
    print "2:", "a" =~ /\x{2E}/;
    '
    1:1
    2:

    --
    Ruud
    Dr.Ruud, Dec 17, 2012
    #5
  6. Re: help - how to find what is the code for "+/-" symbol copiedfrom Windows app

    On 2012-12-16 14:31, Shmuel Metz <> wrote:
    > In <>, on 12/16/2012
    > at 12:43 PM, "Peter J. Holzer" <> said:
    >>At least an unintended charset conversion can be easily undone
    >>with iconv or similar tools

    >
    > Not when there are no MIME headers.


    In the scenario mentioned by Ben (diffs sent by E-Mail) there are.

    > Absent charset, there's no way to know what the non-ASCII code points
    > are.


    Not really a problem in this scenario: There aren't that many plausible
    candidates and you have a few good tests to distinguish between right
    and wrong: 1: The patch has to apply; 2: The result has to make sense.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Dec 18, 2012
    #6
  7. Joe

    Dr.Ruud Guest

    On 2012-12-19 15:04, Shmuel (Seymour J.) Metz wrote:

    > What's wrong with use charnames? While \N{foo} may be more verbose
    > than a hex code, it's also more legible.


    Nothing wrong with 'use charnames'. In fact, the more it is used,
    the better it will be implemented.

    Does it already know the HTML-entities?
    \N{&plusmn;} would be handy to have.

    --
    Ruud
    Dr.Ruud, Dec 20, 2012
    #7
  8. Joe

    Dr.Ruud Guest

    On 2012-12-20 16:56, Dr.Ruud wrote:
    > On 2012-12-19 15:04, Shmuel (Seymour J.) Metz wrote:


    >> What's wrong with use charnames? While \N{foo} may be more verbose
    >> than a hex code, it's also more legible.

    >
    > Nothing wrong with 'use charnames'. In fact, the more it is used,
    > the better it will be implemented.
    >
    > Does it already know the HTML-entities?
    > \N{&plusmn;} would be handy to have.


    See also:

    http://98.245.80.27/tcpc/scripts/unicore/ html_alias.pl

    (seems to be missing a closing ')' though)

    --
    Ruud
    Dr.Ruud, Dec 20, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. BH
    Replies:
    4
    Views:
    692
    sampsons
    Jul 17, 2003
  2. baumann@pan
    Replies:
    1
    Views:
    723
    Richard Bos
    Apr 15, 2005
  3. Robert Kern
    Replies:
    0
    Views:
    489
    Robert Kern
    Sep 11, 2010
  4. Song Ma
    Replies:
    2
    Views:
    218
    Charles Oliver Nutter
    Jul 20, 2008
  5. Replies:
    6
    Views:
    1,697
Loading...

Share This Page