matching string literals

Discussion in 'Perl Misc' started by Morfys, Feb 1, 2011.

  1. Morfys

    Morfys Guest

    Hello,

    I would like to able to match strings with several uninterpreted
    characters (for instance, "/", "-", "(", and "=").

    Ideally, I would like to not have to escape each of these characters
    with a "\", as I have many different strings with many special
    characters to try to match.

    Is there any way to force the search of a string literally?

    I have tried m/\Q $string \E/, but the issue is that $string can
    contain "/", which perl interprets rather than taking it literally.

    Thank you in advance.
     
    Morfys, Feb 1, 2011
    #1
    1. Advertising

  2. Morfys

    Guest

    On Tue, 1 Feb 2011 15:45:55 -0800 (PST), Morfys <> wrote:

    >Hello,
    >
    >I would like to able to match strings with several uninterpreted
    >characters (for instance, "/", "-", "(", and "=").
    >
    >Ideally, I would like to not have to escape each of these characters
    >with a "\", as I have many different strings with many special
    >characters to try to match.
    >
    >Is there any way to force the search of a string literally?
    >
    >I have tried m/\Q $string \E/, but the issue is that $string can
    >contain "/", which perl interprets rather than taking it literally.
    >
    >Thank you in advance.


    Try m{\Q $string \E}.
    As a side note, $string has spaces around it that is literal in the
    regex unless m//x

    -sln
     
    , Feb 1, 2011
    #2
    1. Advertising

  3. Morfys <> wrote:
    >I would like to able to match strings with several uninterpreted
    >characters (for instance, "/", "-", "(", and "=").
    >
    >Ideally, I would like to not have to escape each of these characters
    >with a "\", as I have many different strings with many special
    >characters to try to match.
    >
    >Is there any way to force the search of a string literally?


    There is standard function that probably does exactly what you are
    asking for, see
    perldoc -f index

    jue
     
    Jürgen Exner, Feb 2, 2011
    #3
  4. Morfys

    Guest

    On Tue, 01 Feb 2011 15:57:04 -0800, wrote:

    >On Tue, 1 Feb 2011 15:45:55 -0800 (PST), Morfys <> wrote:
    >
    >>Hello,
    >>
    >>I would like to able to match strings with several uninterpreted
    >>characters (for instance, "/", "-", "(", and "=").
    >>
    >>Ideally, I would like to not have to escape each of these characters
    >>with a "\", as I have many different strings with many special
    >>characters to try to match.
    >>
    >>Is there any way to force the search of a string literally?
    >>
    >>I have tried m/\Q $string \E/, but the issue is that $string can
    >>contain "/", which perl interprets rather than taking it literally.
    >>
    >>Thank you in advance.

    >
    >Try m{\Q $string \E}.
    >As a side note, $string has spaces around it that is literal in the
    >regex unless m//x
    >


    To correct myself, '/' has no special meaning inside of regex's,
    its taken as a literal.

    The problem is that as you see it, '/' is being used as a delimeter
    that Perl, when you say m//, uses to parse out the regex.
    Either s/// or m//. Any character can be used as the delimeter.

    Example:

    m/ $string / parses out ' $string '
    m# $string /# parses out ' $string /'
    m/ $string/ / syntax error, one too many delimeters '/'

    The regex is parsed, variable interpolation is done,
    then the regex is evaluated for proper syntax.
    But, it is parsed first. So any character in $string used as a
    delimeter before parsing is not considered a parsing character
    because parsing is already done.

    You can use different characters for the delimeter, but some
    delimeter characters have special meaning to the parser.
    See perlop manpage for m// and also quote like operators.

    quotemeta EXPR:
    -----------------

    This documentation (perlfunc man page), says:
    "Returns the value of EXPR with all non-"word" characters backslashed."
    e.g. NOT /[A-Za-z_0-9]/

    Why do they do this? To take away special escape characters and
    the possibility of quantifier construct-punctuation.

    Unfortunately, they ruin the entire string if you actually want
    any of these (especially the escape character) in there. The net result
    is that everything in the string is innoculated, but this throws
    out the baby with the bathwater.

    The best thing to do is learn what the perl special characters are,
    ie: its metacharacters, then do your own escaping where needed.

    ----------

    Quotemeta does this:
    (my $tmp_str = $string) =~ s/(^\W)/\\$1/g;

    Since quotemeta() escapes all non-word characters, you could exclude
    some chararacters from being escaped with a negative class.
    s/([^\W\\])/\\$1/g;

    For instance [^\W\\] will escaped all non-words but won't escape
    '\' itself. So, you could run your $string through this:
    (my $tmp_str = $string) =~ s/([^\W\\])/\\$1/g;
    Add any other characters to the negative class you don't want to be escaped.

    You could also create a custom character property class using the
    \p{} or \P{} construct. In that class you could just define the metachars
    then use that to escape just those characters.

    Like,
    (my $tmp_str = $string) =~ s/(\p{InMyclass})/\\$1/g;

    or possibly without the '\' escape character itself,
    (my $tmp_str = $string) =~ s/([^\P{InMyclass}\\])/\\$1/g;$string =~ /

    More often then not, if searching for a literal '\n' or other literal
    control characters, the '\' is not desired to be escaped.

    ---------

    Here is some code you could test out that illustrates these options...
    Wether or not the \p{} construct invokes some mega unicode database
    I'm not sure of, or if there is a performance hit. If it is, its just
    a one time event.

    Cheers,
    -sln
    ---------

    use warnings;
    use strict;

    sub InMeta {
    # {}[]()^$.|*+?\
    return <<END;
    7b
    7d
    5b
    5d
    28
    29
    5e
    24
    2e
    7c
    2a
    2b
    3f
    5c
    END
    }

    my $rx = q!{}[]()^$.|*+?\<--meta, word-->ABCabc123_, newline \n!;

    # Compressed:
    # (my $rx_quoted_meta1 = $rx) =~ s/(\p{InMeta})/\\$1/g;
    # (my $rx_quoted_meta2 = $rx) =~ s/([^\P{InMeta}\\])/\\$1/g;
    # (my $rx_quoted_all1 = $rx) =~ s/(\W)/\\$1/g;
    # (my $rx_quoted_all2 = $rx) =~ s/([^\w\\])/\\$1/g;

    (my $rx_quoted_meta1 = $rx) =~ s/
    (
    \p{InMeta} # Custom meta class
    )
    /\\$1/xg;

    (my $rx_quoted_meta2 = $rx) =~ s/
    (
    [^ # Negative class
    \P{InMeta} # not meta class (result = include meta chars)
    \\ # '\' escapes (exclude)
    ]
    )
    /\\$1/xg;

    (my $rx_quoted_all1 = $rx) =~ s/(\W)/\\$1/xg;

    (my $rx_quoted_all2 = $rx) =~ s/
    (
    [^ # Negative class
    \w # words (exclude)
    \\ # '\' escapes (exclude)
    ]
    )
    /\\$1/xg;

    my $rx_Q = quotemeta $rx;

    print "Original:\n$rx\n\n";

    print "Quoted meta:\n$rx_quoted_meta1\n\n";
    print "** Quoted meta, not \\:\n$rx_quoted_meta2\n\n";
    print "Quoted all, not words:\n$rx_quoted_all1\n\n";
    print "Quotemeta function:\n$rx_Q\n\n";
    print "Quoted all, not words, not \\:\n$rx_quoted_all2\n\n";

    __END__

    Original:
    {}[]()^$.|*+?\<--meta, word-->ABCabc123_, newline \n

    Quoted meta:
    \{\}\[\]\(\)\^\$\.\|\*\+\?\\<--meta, word-->ABCabc123_, newline \\n

    ** Quoted meta, not \:
    \{\}\[\]\(\)\^\$\.\|\*\+\?\<--meta, word-->ABCabc123_, newline \n

    Quoted all, not words:
    \{\}\[\]\(\)\^\$\.\|\*\+\?\\\<\-\-meta\,\ word\-\-\>ABCabc123_\,\ newline\ \\n

    Quotemeta function:
    \{\}\[\]\(\)\^\$\.\|\*\+\?\\\<\-\-meta\,\ word\-\-\>ABCabc123_\,\ newline\ \\n

    Quoted all, not words, not \:
    \{\}\[\]\(\)\^\$\.\|\*\+\?\\<\-\-meta\, word\-\-\>ABCabc123_\, newline \n
     
    , Feb 2, 2011
    #4
  5. Morfys

    Guest

    On Wed, 02 Feb 2011 12:50:33 -0800, wrote:

    >On Tue, 01 Feb 2011 15:57:04 -0800, wrote:
    >
    >----------
    >
    >Quotemeta does this:
    > (my $tmp_str = $string) =~ s/(^\W)/\\$1/g;

    ^^^^^^
    (my $tmp_str = $string) =~ s/(\W)/\\$1/g;
    or
    (my $tmp_str = $string) =~ s/([^\w])/\\$1/g;

    Typo's.

    -sln
     
    , Feb 2, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Harri Pesonen

    String literals in Java

    Harri Pesonen, May 28, 2004, in forum: Java
    Replies:
    59
    Views:
    15,002
    Jim Cochrane
    Jun 2, 2004
  2. John Goche
    Replies:
    8
    Views:
    16,504
  3. =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    Replies:
    5
    Views:
    1,315
    =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    May 6, 2006
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    244
    Marc Bissonnette
    Jan 13, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    240
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page