How get all digits, letters and punctuation characters in perl?

P

Peng Yu

Hi,

In python, I can do the following to get a category of characters. But
I don't find a corresponding thing in perl. Could anybody let me know
if there is one? Thanks!

~/linux/test/python/man/library/string/printable$ cat main.py
#!/usr/bin/env python

import string
print string.digits + string.letters + string.punctuation
print string.printable



Regards,
Peng
 
J

Jürgen Exner

Peng Yu said:
Hi,

In python, I can do the following to get a category of characters. But
I don't find a corresponding thing in perl. Could anybody let me know
if there is one? Thanks!

~/linux/test/python/man/library/string/printable$ cat main.py
#!/usr/bin/env python

import string
print string.digits + string.letters + string.punctuation
print string.printable

I am smelling an x-y problem here. Why do think you need this set of
characters?

If you just want to test if a certain character belongs to a specific
class then you can use POSIX character classes in an RE, e.g.
m/[[:alpha:]]/;
I suppose you could also use this test to enumerate all characters of
this class although this does seem to be somewhat backwards indeed.

jue
 
P

Peter Makholm

Peng Yu said:
In python, I can do the following to get a category of characters. But
I don't find a corresponding thing in perl. Could anybody let me know
if there is one? Thanks!

It depends on your definitions.

The Unicode standard defines 9293 letters, 350 digits, and 582
punctuation characters - and this is just the Basic Multilingual Plane.

Tom Christiansen has made a tool called `unichars` to list characters
matching a number of conditions (availaable in the Unicode::Tussle
distribution). His code basically just iterates over all relevant
codepoints excluding a number of special cases:

for my $codepoint ( $first_codepoint .. $last_codepoint ) {

# gaggy UTF-16 surrogates are invalid UTF-8 code points
next if $codepoint >= 0xD800 && $codepoint <= 0xDFFF;

# from utf8.c in perl src; must avoid fatals in 5.10
next if $codepoint >= 0xFDD0 && $codepoint <= 0xFDEF;

next if 0xFFFE == ($codepoint & 0xFFFE); # both FFFE and FFFF

# debug("testing codepoint $codepoint");

# see "Unicode non-character %s is illegal for interchange" in
perldiag(1)
$_ = do { no warnings "utf8"; chr($codepoint) };

# fixes "the Unicode bug"
unless (utf8::is_utf8($_)) {
$_ = decode("iso-8859-1", $_);
}

# Test the given conditions, e.g. /\p{Digit}/
}

But given your python example, this is probably way overkill for what
you are trying.

//Makholm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top