help - how to find what is the code for "+/-" symbol copied fromWindows app

Joe · Dec 15, 2012

I received an Excel data file that contains a "+/-" symbol (html code \± &plusmn

, that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?

Thanks in advance!

joe

Dr.Ruud · Dec 15, 2012

I received an Excel data file that contains a "+/-" symbol (html code \± &plusmn, that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?

First, look up its Unicode code point.

Google for: unicode plus minus, which will lead you to (for example)
http://www.fileformat.info/info/unicode/char/b1/index.htm

From there it is easy to deduce: \x{B1}.

The page also gives you the exact Unicode character name
'PLUS-MINUS SIGN', which you can use in regular expressions.

Peter J. Holzer · Dec 15, 2012

Or just use what you already know:

use HTML::Entities "decode_entities";
my $plusmn = decode_entities "±";

Or just copy/paste the sign into your source code:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;

my $text = "the result is 8±2";

if ($text =~ m/±/) {
print "The text contains a plus/minus sign\n";
}
__END__

(Of course you need to make sure that the module you use to read the
Excel sheet really returns the ± as a single character U+00B1, but this is true
for all methods. If it doesn't, use Encode::decode to convert whatever
your Excel module returns into something sane.)

hp

Peter J. Holzer · Dec 16, 2012

Should I comment on the irony of your newsreader having converted that
to ISO8859-1?

That's a feature, not a bug. Usenet is (except for the binaries groups)
a text medium: The content of a usenet posting consists of characters,
not bytes. Of course for transport it has to be encoded into some
sequence of bytes, but as long as the encoding/decoding process is
lossless, the NUA is free to employ any encoding it likes.

In my case I have configured the following outgoing charsets:

us-ascii,iso-8859-1,iso-8859-15,utf-8

The order is significant, so since my posting contained characters
which could not be represented in us-ascii, but could be represented in
iso-8859-1, the latter was used. If I had also used a euro sign, it
would have used iso-8859-15; and if I had used typographical quotes, it
would have used utf-8.

(This is why I'm slightly suspicious of the whole idea of non-ASCII
source code. It's fine as long as it's just in a file, but tends to be
much less likely to survive diffs/mailing-list posts/&c. without being
mangled.)

That can usually be avoided by attaching the diffs or code instead of
including them in the main text part. It also makes them easier to hande
for the receiver.

Also Non-ASCII characters aren't the only ones mangled by common
NUAs/MUAs. Many fold long lines, some remove leading whitespace, some
change tabs into spaces, ...

At least an unintended charset conversion can be easily undone with
iconv or similar tools - other changes which MUAs are likely to inflict
on a text are generally not reversible.

hp

Dr.Ruud · Dec 17, 2012

[recognizing ± with RE]

use HTML::Entities "decode_entities";
my $plusmn = decode_entities "±";

Realize that HTML::Entities is not in core.

And don't forget quotemeta:

perl -wle '
my $dot = "\x{2E}";
print "1:", "a" =~ /$dot/;
print "2:", "a" =~ /\x{2E}/;
'
1:1
2:

Peter J. Holzer · Dec 18, 2012

Not when there are no MIME headers.

In the scenario mentioned by Ben (diffs sent by E-Mail) there are.

Absent charset, there's no way to know what the non-ASCII code points
are.

Not really a problem in this scenario: There aren't that many plausible
candidates and you have a few good tests to distinguish between right
and wrong: 1: The patch has to apply; 2: The result has to make sense.

hp

Dr.Ruud · Dec 20, 2012

What's wrong with use charnames? While \N{foo} may be more verbose
than a hex code, it's also more legible.

Nothing wrong with 'use charnames'. In fact, the more it is used,
the better it will be implemented.

Does it already know the HTML-entities?
\N{±} would be handy to have.

Dr.Ruud · Dec 20, 2012

On 2012-12-19 15:04, Shmuel (Seymour J.) Metz wrote:

Nothing wrong with 'use charnames'. In fact, the more it is used,
the better it will be implemented.

Does it already know the HTML-entities?
\N{±} would be handy to have.

See also:

http://98.245.80.27/tcpc/scripts/unicore/ html_alias.pl

(seems to be missing a closing ')' though)

NewsMaestro Usenet Supertool 3.8.1 is released	0	Sep 20, 2007
Announce SiSU - publishing for e-documents, books, libraries, relational databases	1	Jan 4, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

help - how to find what is the code for "+/-" symbol copied fromWindows app

Joe

Dr.Ruud

Peter J. Holzer

Peter J. Holzer

Dr.Ruud

Peter J. Holzer

Dr.Ruud

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads