Control characters - regex to match/lose these?

J

Justin C

Some of the text files I have to parse with perl have control characters
within them. They are controls to turn on and off things when printing
(stuff like "make this bold", "make this bigger", "draw a box around
this", that kind of thing). If I view the text file with /usr/bin/less I
see them as: ESC(s12H ESC&16D ESC(s16H ESC&18D

Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

Those are the four that jump out at me while scrolling this document,
there are probably more.

I've tried perldoc -q control and perldoc -q escape but neither of those
mentions control characters or escape sequences. Can someone point me at
the documentation I need to help me catch these and strip them out? Will
a regex work? What does perl see these as?

Thank you for your help.


Justin.
 
P

Paul Lalli

Justin said:
Some of the text files I have to parse with perl have control characters
within them. They are controls to turn on and off things when printing
(stuff like "make this bold", "make this bigger", "draw a box around
this", that kind of thing). If I view the text file with /usr/bin/less I
see them as: ESC(s12H ESC&16D ESC(s16H ESC&18D

Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

Those are the four that jump out at me while scrolling this document,
there are probably more.

I've tried perldoc -q control and perldoc -q escape but neither of those
mentions control characters or escape sequences. Can someone point me at
the documentation I need to help me catch these and strip them out? Will
a regex work? What does perl see these as?

Just a guess, but I'd try to use the [:cntrl:] character class. See
perldoc perlre

s/[[:cntrl:]]//g;

Paul Lalli
 
D

Dr.Ruud

Justin C schreef:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

(no, [@-Z] isn't allways the same as [@A-Z])


(no, [@[:upper:]] isn't a proper alternative either, in this context)

You can be more specific: \e(?:[&#*()](?:[a-z](?:-?\d+)?)?[@A-Z])|[=9]
 
T

Ted Zlatanov

Some of the text files I have to parse with perl have control characters
within them. They are controls to turn on and off things when printing
(stuff like "make this bold", "make this bigger", "draw a box around
this", that kind of thing). If I view the text file with /usr/bin/less I
see them as: ESC(s12H ESC&16D ESC(s16H ESC&18D

Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

Those are the four that jump out at me while scrolling this document,
there are probably more.

I've tried perldoc -q control and perldoc -q escape but neither of those
mentions control characters or escape sequences. Can someone point me at
the documentation I need to help me catch these and strip them out? Will
a regex work? What does perl see these as?

If these are ANSI or similar escape codes, they may consist of more
than one character.

You can write a Perl program or use "od -a -h FILENAME" to look at the
contents of the file (assuming a Unix environment with the GNU
coreutils installed).

In Perl it's pretty easy:

perl -p -e '@a = split //; foreach my $k (@a) { print "$k = ", ord($k), "\n"}' FILENAME

The above, given a file, will print each character, followed by the
decimal numeric code for that character, per line. Then it will print
the line itself. It's primitive but it will show you the exact
characters that are in your input. Then look at the Perl regular
expression syntax (especially the POSIX extension) and an ASCII
character reference table to see what characters exactly you want to
filter out.

Ted
 
J

Justin C

Justin said:
Some of the text files I have to parse with perl have control characters
within them. They are controls to turn on and off things when printing
(stuff like "make this bold", "make this bigger", "draw a box around
this", that kind of thing). If I view the text file with /usr/bin/less I
see them as: ESC(s12H ESC&16D ESC(s16H ESC&18D

Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

Those are the four that jump out at me while scrolling this document,
there are probably more.
Just a guess, but I'd try to use the [:cntrl:] character class. See
perldoc perlre

s/[[:cntrl:]]//g;

Thanks Paul,

Quick reply and I've been able to move along. This does leave me with
the bits after the ESC, I'm going to look further into what Dr Ruud has
suggested further down the thread and try to catch whatever may come
along in future rather than just these specifics now.

Thanks for your help - I really should have looked at perlre to start
with, it's obvious (with hindsight at least) that I was going to need an
RE to deal with this.

Justin.
 
J

Justin C

Justin C schreef:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Gonna have to learn more about REs to understand this one!

I'm reading through perlre, the ?: is doing my head in a bit, may be
it's been a long day: ``it groups subexpressions like "()" but doesn't
make backreferences'' ... ``This is for clustering, not capturing''



Justin.
 
D

Dr.Ruud

Justin C schreef:
Dr.Ruud:
Justin C:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Gonna have to learn more about REs to understand this one!

I'm reading through perlre, the ?: is doing my head in a bit, may be
it's been a long day: ``it groups subexpressions like "()" but doesn't
make backreferences'' ... ``This is for clustering, not capturing''

Concentrate on the "clustering, not capturing".

You can just as well do it step by step:

s/\e[=9]//g ;
s/\e[^@A-Z]*[@A-Z]//g ;

google: HP escape

An extensive list:
http://printers.necsam.com/public/printers/pclcodes/pcl5hp.htm
That also mentions some "@PJL.*\n" sequences you might encounter.
Look for "(Data)" on that page for some other special sequences.
 
B

Ben Morrow

Quoth (e-mail address removed):
Justin C schreef:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Without really knowing what you're matching, isn't there a bug here? Do
you really want to strip all = and all 9 from the input? Also, @ needs
escaping in qq strings:

s/\e (?: [^\@A-Z]* [\@A-Z] | [=9] )//gx;
Gonna have to learn more about REs to understand this one!

I'll walk you through it with the /x flag:

s/ # find
\e # an escape character (^[ in vim)
(?: # followed by either
[^\@A-Z] # anything *but* @ or A-Z
* # 0-or-more times
[\@A-Z] # terminated by @ or A-Z
| # or
[=9] # = or 9
)
//gx; # and replace with nothing

The /x flag is your friend. I would have said that even my first version
above, without the comments, is *much* more readable that Dr.Ruud's
original.
I'm reading through perlre, the ?: is doing my head in a bit, may be
it's been a long day: ``it groups subexpressions like "()" but doesn't
make backreferences'' ... ``This is for clustering, not capturing''

Basically, /(?: ... )/ is the same as /( ... )/ except it doesn't create
an $N variable. This makes it somewhat faster, and lets you capture what
you want to capture, rather than creating a whole lot of captures you
didn't really want; which helps when putting regexen together out of
separate pieces.

Ben
 
B

Ben Morrow

Quoth "Dr.Ruud said:
Justin C schreef:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

(you need to escape @)
(no, [@-Z] isn't allways the same as [@A-Z])

(no, [@[:upper:]] isn't a proper alternative either, in this context)

I seem to be missing something, would you mind explaining? Surely either
you're assuming ASCII, in which case /[\@-Z]/ *is* the same as
/[\@A-Z]/, or you're not, in which case you need either /[\@[:upper:]]/
or perhaps /[\x40-\x5a]/? If '@' ne "\x40" then [A-Z] probably isn't
just the upper-case letters either.

Ben
 
D

Dr.Ruud

Ben Morrow schreef:
justin.news:
Dr.Ruud:
Justin C:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Without really knowing what you're matching, isn't there a bug here?

Yes, there is. And I even ruined it further just before posting...

I'll walk you through it with the /x flag:

s/ # find
\e # an escape character (^[ in vim)
(?: # followed by either
[^\@A-Z] # anything *but* @ or A-Z
* # 0-or-more times
[\@A-Z] # terminated by @ or A-Z
| # or
[=9] # = or 9
)
//gx; # and replace with nothing

The /x flag is your friend. I would have said that even my first
version above, without the comments, is *much* more readable that
Dr.Ruud's original.

Thanks for the correction.
 
D

Dr.Ruud

Ben Morrow schreef:
Dr.Ruud:
(no, [@-Z] isn't allways the same as [@A-Z])
(no, [@[:upper:]] isn't a proper alternative either, in this context)

I seem to be missing something, would you mind explaining? Surely
either you're assuming ASCII, in which case /[\@-Z]/ *is* the same as
/[\@A-Z]/, or you're not, in which case you need either
/[\@[:upper:]]/ or perhaps /[\x40-\x5a]/? If '@' ne "\x40" then [A-Z]
probably isn't just the upper-case letters either.

See perlebcdic:

REGULAR EXPRESSION DIFFERENCES
As of perl 5.005_03 the letter range regular expression such as
[A-Z]
and [a-z] have been especially coded to not pick up gap
characters.

In (for example) Latin1 and EBCDIC, [[:upper:]] contains more than
[A-Z], so [:upper:] is too much.
 
J

Justin C

Justin C schreef:
Dr.Ruud:
Justin C:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Gonna have to learn more about REs to understand this one!

I'm reading through perlre, the ?: is doing my head in a bit, may be
it's been a long day: ``it groups subexpressions like "()" but doesn't
make backreferences'' ... ``This is for clustering, not capturing''

Concentrate on the "clustering, not capturing".

You can just as well do it step by step:

s/\e[=9]//g ;

Why [=9]? I've read, and re-read, this thread and tried to understand
what you and Ben are trying to explain to me (I'm really not stupid,
honest), but I can't see, given the examples I gave in the OP, why the
[=9].
s/\e[^@A-Z]*[@A-Z]//g ;
I also don't understand how the "clustering" use of ?: is enabling these
two lines to be made into one (it maybe that if I understand one part of
this the other falls into place). I'm not seeing what effect the ?: is
having, the one line without the ?: looks, to me, like it'd do the same
as the two.

I'm very grateful for the time you've both spent on this and don't
intend make this difficult but, if you could try one last time, I really
would like to get this.
google: HP escape
That's very interesting. I can see that's something that needs
bookmarking for when I want to take some of the more plain reports and
"tart" them up for printing.
And that's even more so!

I'm sure I could be quite good at this coding thing if only my job
allowed me to do more of it.


Justin.
 
B

Ben Morrow

Quoth (e-mail address removed):
Justin C schreef:
Dr.Ruud:
Justin C:
Viewed in vim they look like: ^[(s12H ^[&16D ^[(16H ^[&18D

s/\e(?:[^@A-Z]*[@A-Z])|[=9]//g

Note there are two bugs here (as I mentioned earlier). This should read

s/\e(?:[^\@A-Z]*[\@A-Z]|[=9])//g;

that is, the [=9] should be *inside* the parens, and the @s should be
backwhacked.
Gonna have to learn more about REs to understand this one!

I'm reading through perlre, the ?: is doing my head in a bit, may be
it's been a long day: ``it groups subexpressions like "()" but doesn't
make backreferences'' ... ``This is for clustering, not capturing''

Concentrate on the "clustering, not capturing".

You can just as well do it step by step:

s/\e[=9]//g ;

Why [=9]? I've read, and re-read, this thread and tried to understand
what you and Ben are trying to explain to me (I'm really not stupid,
honest), but I can't see, given the examples I gave in the OP, why the
[=9].

Dr.Ruud is assuming that what you are in fact trying to do is remove HP
PCL escape sequences. 'ESC 9' and 'ESC =' are valid sequences, in
addition to the 'ESC ( s 1 S'-type sequence you mentioned.
s/\e[^@A-Z]*[@A-Z]//g ;
I also don't understand how the "clustering" use of ?: is enabling these
two lines to be made into one (it maybe that if I understand one part of
this the other falls into place). I'm not seeing what effect the ?: is
having, the one line without the ?: looks, to me, like it'd do the same
as the two.

The (?:) (when corrected) is causing the RE to match 'escape, followed
by (either a multi-char escape sequence or one of the single-char
sequences'. It is necessary as without the grouping the | alternation
would apply to the whole regex, and any '=' would be stripped. Compare

s/ # either
\e # escape
[^\@A-Z]*[\@A-Z] # multi-char sequence
| # or
[=9] # '=' or '9;
//gx;

with

s/
\e # escape
(?: # followed by either
[^\@A-Z]*[\@A-Z] # multi-char sequence
| # or
[=9] # '=' or '9'
)
//gx;

The parens are what make the escape always required for a match. (You
are right that in the example as given, the parens are useless. This is
what made me sure there was a bug :).)

Ben
 
J

Justin C

Quoth (e-mail address removed):
[snip] (apologies if the quoting goes awry)
Concentrate on the "clustering, not capturing".

You can just as well do it step by step:

s/\e[=9]//g ;

Why [=9]? I've read, and re-read, this thread and tried to understand
what you and Ben are trying to explain to me (I'm really not stupid,
honest), but I can't see, given the examples I gave in the OP, why the
[=9].

Dr.Ruud is assuming that what you are in fact trying to do is remove HP
PCL escape sequences. 'ESC 9' and 'ESC =' are valid sequences, in
addition to the 'ESC ( s 1 S'-type sequence you mentioned.

Ah, PCL escape sequences. That was a correct assumption, it's just that
having none in any of the output I have here I wasn't expecting to see
someone trying to match sequences I hadn't mentioned (but I should have
been, Dr Ruud having pointed me at some good sources of info for those,
and also mentioning I might want to catch others too).
s/\e[^@A-Z]*[@A-Z]//g ;
I also don't understand how the "clustering" use of ?: is enabling these
two lines to be made into one (it maybe that if I understand one part of
this the other falls into place). I'm not seeing what effect the ?: is
having, the one line without the ?: looks, to me, like it'd do the same
as the two.

The (?:) (when corrected) is causing the RE to match 'escape, followed
by (either a multi-char escape sequence or one of the single-char
sequences'. It is necessary as without the grouping the | alternation
would apply to the whole regex, and any '=' would be stripped. Compare

Thank you for clearing that up for me, I am now able to read, and
understand, that regex. I wasn't aware of the '=' being ignored in a
pattern match (though I'd probably escape it in most cases myself
anyway).

"Thank you" too to Dr Ruud for the solution - even though it's taken me
three days to understand it! :)

Justin.
 
B

Ben Morrow

Quoth (e-mail address removed):
[re: s/\e (?: [^@A-Z]* [@A-Z] | [=9] )//gx;]

The (?:) (when corrected) is causing the RE to match 'escape, followed
by (either a multi-char escape sequence or one of the single-char
sequences'. It is necessary as without the grouping the | alternation
would apply to the whole regex, and any '=' would be stripped. Compare

Thank you for clearing that up for me, I am now able to read, and
understand, that regex. I wasn't aware of the '=' being ignored in a
pattern match (though I'd probably escape it in most cases myself
anyway).

No, you're misunderstanding still (I can't have been clear...). It's not
that = is ignored in a regex (it isn't), it's that without the parens
the pattern will match just '=', without a preceding escape, so all =
signs will be removed from your input. This (I presume :) ) isn't what
you want.

Ben
 
J

Justin C

Quoth (e-mail address removed):
[re: s/\e (?: [^@A-Z]* [@A-Z] | [=9] )//gx;]

The (?:) (when corrected) is causing the RE to match 'escape, followed
by (either a multi-char escape sequence or one of the single-char
sequences'. It is necessary as without the grouping the | alternation
would apply to the whole regex, and any '=' would be stripped. Compare

Thank you for clearing that up for me, I am now able to read, and
understand, that regex. I wasn't aware of the '=' being ignored in a
pattern match (though I'd probably escape it in most cases myself
anyway).

No, you're misunderstanding still (I can't have been clear...). It's not
that = is ignored in a regex (it isn't), it's that without the parens
the pattern will match just '=', without a preceding escape, so all =
signs will be removed from your input. This (I presume :) ) isn't what
you want.

Yes, I think I mustn't have been paying attention when I replied (I just
had a two hour afternoon nap (don't ask) so my brain is more alert just
now). The perens are the grouping for the ?:, the pipe gives us an 'or'
so the perens match "a string starting with an @ or any upper case A-Z
any number of times (though not necessarily the same char) but only if
it's followed by exactly one more char from the same class" or "an '=' or
9" and, prior to the perens is the \e, so it's the \e=, \e9 or \e[a
string starting with and @ etc...].

Thank you for your patience.

If I've still not got it, I'm just gonna try it out anyway. I'm sure my
understanding of regexs will improve with time/use/practice.

Justin.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top