help - how to find what is the code for "+/-" symbol copied fromWindows app

J

Joe

I received an Excel data file that contains a "+/-" symbol (html code \± ±), that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?

Thanks in advance!

joe
 
D

Dr.Ruud

I received an Excel data file that contains a "+/-" symbol (html code \± ±), that can be copied and displayed in Word, Notepad, "Kompozer" html editor, unix vi, pico editors, and load to/retrieve from MySQL operated on linux. But when I need to manipulate the data in perl, I am lost as how to recognize the symbol with RE. Could anyone help?

First, look up its Unicode code point.

Google for: unicode plus minus, which will lead you to (for example)
http://www.fileformat.info/info/unicode/char/b1/index.htm

From there it is easy to deduce: \x{B1}.

The page also gives you the exact Unicode character name
'PLUS-MINUS SIGN', which you can use in regular expressions.
 
P

Peter J. Holzer

Or just use what you already know:

use HTML::Entities "decode_entities";
my $plusmn = decode_entities "±";

Or just copy/paste the sign into your source code:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;

my $text = "the result is 8±2";

if ($text =~ m/±/) {
print "The text contains a plus/minus sign\n";
}
__END__


(Of course you need to make sure that the module you use to read the
Excel sheet really returns the ± as a single character U+00B1, but this is true
for all methods. If it doesn't, use Encode::decode to convert whatever
your Excel module returns into something sane.)

hp
 
P

Peter J. Holzer

Should I comment on the irony of your newsreader having converted that
to ISO8859-1? :)

That's a feature, not a bug. Usenet is (except for the binaries groups)
a text medium: The content of a usenet posting consists of characters,
not bytes. Of course for transport it has to be encoded into some
sequence of bytes, but as long as the encoding/decoding process is
lossless, the NUA is free to employ any encoding it likes.

In my case I have configured the following outgoing charsets:

us-ascii,iso-8859-1,iso-8859-15,utf-8

The order is significant, so since my posting contained characters
which could not be represented in us-ascii, but could be represented in
iso-8859-1, the latter was used. If I had also used a euro sign, it
would have used iso-8859-15; and if I had used typographical quotes, it
would have used utf-8.

(This is why I'm slightly suspicious of the whole idea of non-ASCII
source code. It's fine as long as it's just in a file, but tends to be
much less likely to survive diffs/mailing-list posts/&c. without being
mangled.)

That can usually be avoided by attaching the diffs or code instead of
including them in the main text part. It also makes them easier to hande
for the receiver.

Also Non-ASCII characters aren't the only ones mangled by common
NUAs/MUAs. Many fold long lines, some remove leading whitespace, some
change tabs into spaces, ...

At least an unintended charset conversion can be easily undone with
iconv or similar tools - other changes which MUAs are likely to inflict
on a text are generally not reversible.

hp
 
D

Dr.Ruud

[recognizing ± with RE]

use HTML::Entities "decode_entities";
my $plusmn = decode_entities "±";

Realize that HTML::Entities is not in core.

And don't forget quotemeta:

perl -wle '
my $dot = "\x{2E}";
print "1:", "a" =~ /$dot/;
print "2:", "a" =~ /\x{2E}/;
'
1:1
2:
 
P

Peter J. Holzer

Not when there are no MIME headers.

In the scenario mentioned by Ben (diffs sent by E-Mail) there are.
Absent charset, there's no way to know what the non-ASCII code points
are.

Not really a problem in this scenario: There aren't that many plausible
candidates and you have a few good tests to distinguish between right
and wrong: 1: The patch has to apply; 2: The result has to make sense.

hp
 
D

Dr.Ruud

What's wrong with use charnames? While \N{foo} may be more verbose
than a hex code, it's also more legible.

Nothing wrong with 'use charnames'. In fact, the more it is used,
the better it will be implemented.

Does it already know the HTML-entities?
\N{±} would be handy to have.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top