convert list of strings to set of regexes; convert list of strings to trie

Discussion in 'Python' started by Klaus Neuner, Jul 19, 2004.

  1. Klaus Neuner

    Klaus Neuner Guest

    Hello,

    I need a function that converts a list into a set of regexes. Like so:

    string_list = ["blaa", "blab", "raaa", "rabb"]

    print string_list2regexes(string_list)


    This should return something like:

    ["bla(a|b)", "ra(aa|bb)"]

    I am aware of the fact that converting the list to a *trie* would almost do
    the job. But I couldn't find anything about Python modules that produce tries.

    Are there any modules that I could use? Or do I have to implement it myself?

    Klaus
    Klaus Neuner, Jul 19, 2004
    #1
    1. Advertising

  2. Re: convert list of strings to set of regexes;convert list of strings to trie

    [Klaus Neuner]

    > I need a function that converts a list into a set of regexes. Like so:
    > string_list = ["blaa", "blab", "raaa", "rabb"]
    > print string_list2regexes(string_list)
    > This should return something like:
    > ["bla(a|b)", "ra(aa|bb)"]


    I once wrote code for doing something very close, but not finding it,
    I guess I deleted it; so presumably, it would not be that difficult
    redoing from scratch. By the way, the regexp compiler in Python library
    already turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
    `(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups are being
    saved by parentheses. So, you may also peek in this direction.

    --
    François Pinard http://www.iro.umontreal.ca/~pinard
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, Jul 19, 2004
    #2
    1. Advertising

  3. Klaus Neuner

    Klaus Neuner Guest

    François Pinard <> wrote in message news:<>...

    > By the way, the regexp compiler in Python library
    > already turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
    > `(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups are being
    > saved by parentheses. So, you may also peek in this direction.


    What do you mean exactly by "logically"? That Python translates

    (1) (blaa)|(blab)|(raaa)|(rabb)

    to something (logically equivalent to)

    (2) (bla(a|b))|(ra(aa|bb))

    before matching it? Does this mean that (1) (taken as normal regex) is always
    exactly as fast as (2) (taken as normal regex)?

    Klaus
    Klaus Neuner, Jul 20, 2004
    #3
  4. Re: convert list of strings to set of regexes; convert list of stringsto trie

    Klaus Neuner wrote:
    > Hello,
    >
    > I need a function that converts a list into a set of regexes. Like so:
    >
    > string_list = ["blaa", "blab", "raaa", "rabb"]
    >
    > print string_list2regexes(string_list)
    >
    >
    > This should return something like:
    >
    > ["bla(a|b)", "ra(aa|bb)"]


    So, I guess you just want to isolate common prefixes.

    > I am aware of the fact that converting the list to a *trie* would almost do
    > the job. But I couldn't find anything about Python modules that produce tries.


    Maybe
    http://cvs.bioperl.org/cgi-bin/view...io/triefind.py?annotate=1.1&cvsroot=biopython
    can be of some use for you, even if I didn't have a look at it.

    --
    Ciao,
    Matteo
    Matteo Dell'Amico, Jul 20, 2004
    #4
  5. Re: convert list of strings to set of regexes; convert list of stringsto trie

    Klaus Neuner wrote:

    > What do you mean exactly by "logically"? That Python translates
    >
    > (1) (blaa)|(blab)|(raaa)|(rabb)
    >
    > to something (logically equivalent to)
    >
    > (2) (bla(a|b))|(ra(aa|bb))
    >
    > before matching it? Does this mean that (1) (taken as normal regex) is always
    > exactly as fast as (2) (taken as normal regex)?


    Looks like (2) is a little bit more efficient in case of not matching
    strings: edited from an interactive session,

    In [22]: timer = timeit.Timer("re1.match('blorb')", "import re;re1 =
    re.compile(r'(blaa)|(blab)|(raaa)|(rabb)')")

    In [23]: timer.timeit()
    Out[23]: 1.4223082065582275

    In [24]: timer2 = timeit.Timer("re2.match('blorb')", "import re;re2 =
    re.compile(r'(bla(a|b))|(ra(aa|bb))')")

    In [25]: timer2.timeit()
    Out[25]: 1.2273669242858887

    In [28]: timer3 = timeit.Timer("re3.match('blab')", "import re;re3 =
    re.compile(r'(bla(a|b))|(ra(aa|bb))')")

    In [29]: timer3.timeit()
    Out[29]: 1.4784541130065918

    In [30]: timer4 = timeit.Timer("re4.match('blab')", "import re;re4 =
    re.compile(r'(blaa)|(blab)|(raaa)|(rabb)')")

    In [31]: timer4.timeit()
    Out[31]: 1.4766991138458252

    --
    Ciao,
    Matteo
    Matteo Dell'Amico, Jul 20, 2004
    #5
  6. Re: convert list of strings to set of regexes;convert list of strings to trie

    [Klaus Neuner]
    > [François Pinard]


    > > By the way, the regexp compiler in Python library already
    > > turns, at least logically, `(blaa)|(blab)|(raaa)|(rabb)' into
    > > `(bla(a|b))|(ra(aa|bb))' -- disregarding of course how groups
    > > are being saved by parentheses. So, you may also peek in this
    > > direction.


    > What do you mean exactly by "logically"? That Python translates (1) to
    > something (logically equivalent to) (2) before matching it? Does this
    > mean that (1) (taken as normal regex) is always exactly as fast as (2)
    > (taken as normal regex)?


    This is what I understand, yes: it is not worth optimising regexps the
    above way ourselves (as we sometimes do for Emacs Lisp regexps, say),
    before giving them to the Python regexp compiler.

    --
    François Pinard http://www.iro.umontreal.ca/~pinard
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, Jul 20, 2004
    #6
  7. Klaus Neuner <> wrote:
    > I need a function that converts a list into a set of regexes. Like so:
    >
    > string_list = ["blaa", "blab", "raaa", "rabb"]
    >
    > print string_list2regexes(string_list)
    >
    >
    > This should return something like:
    >
    > ["bla(a|b)", "ra(aa|bb)"]


    The program below does exactly this... Run it like this

    $ ./words-to-regexp.pl 5
    Extracting 5 letter words
    blaa
    blab
    raaa
    rabb

    (Giving the list of words to stdin)

    It produced this regexp which matches just the above strings.

    Re: bla[ab]|ra(?:aa|bb)

    For your input. Eg find a regexp to match all the 1&2 letter words
    in the dictionary...

    ../words-to-regexp.pl 2 < /usr/share/dict/words

    Re: [aa]|a[cdghlmmnrssttuy]|[bb]|b[aeeikry]|[cc]|c[adfilmorssu]|[dd]|d[bor]|[ee]|e[dhmrssux]|[ff]|f[aemr]|[gg]|g[adeeos]|[hh]|h[aeefgiooz]|[ii]|i[dfnnorstt]|[jj]|j[or]|[kk]|k[crsw]|[ll]|l[aaeiorstu]|[mm]|m[abdeginorsstuy]|[nn]|n[abdeiopu]|[oo]|o[bfhknrswxz]|[pp]|p[aabdhimotu]|[qqrr]|r[abdeehnsux]|[ss]|s[bcehimnort]|[tt]|t[abchiilmosy]|[uu]|u[hprs]|[vv]|v[as]|[ww]|w[emu]|[xx]|xe|[yy]|y[abe]|[zz]|z[nr]

    Yes its in perl, but its almost entirely regexps! I haven't got the
    time to pythonise it at the moment - hope you enjoy a challenge ;-)

    #!/usr/bin/perl -w
    #
    # The challenge - to write a function which given a list of words
    # returns a regexp which will match those and only those words.

    use strict;
    my $LIMIT = shift || 3;
    $|=1;

    print "Extracting $LIMIT letter words\n";

    my @list = ();
    while (<>)
    {
    chomp;
    next if $_ eq "";
    push @list, lc($_) if length($_) <= $LIMIT;
    }

    print "Extracted ", scalar(@list), " words\n";

    my ($re, $old_re) = ("", "1");

    while ($re ne $old_re)
    {
    $old_re = $re;
    print "-" x 60, "\n";
    $re = word_list_to_regexp( @list );
    check_word_list_to_regexp($re, \@list);
    print "Length: ", length($re), "\n";
    }
    exit;

    ############################################################
    # Converts a list of words into a regexp which will
    # match those words and those numbers only.
    #
    # It does this by constructing a regexp and then progressively
    # simplifying it - recursively if necessary. It uses regexp's to
    # transform the regexp of course! This is almost a general purpose
    # regexp optimiser.
    #
    # We assume that the caller will bound the regexp with ^( and )$ or
    # \W(?: and )\W or whatever takes their fancy
    #
    # Set $DEBUG to 1 if you want to print lots of info and check the
    # regexp works after each transformation.
    #
    # Warning: code contains heavy regexps - lift with care ;-)
    # Caution: Code may use exponential time and space ;-(
    ############################################################

    sub word_list_to_regexp
    {
    my (@list) = @_;
    my $DEBUG = 1;

    # The basic regexp with |'s on the start and end to make our life
    # easier
    # Should uniq here too...
    $re = join("|", sort {
    #length($a) <=> length($b) ||
    $a cmp $b
    } @list);

    $re = "|$re|";

    # Transform the regexp in stages, making sure at all time the
    # regexp is correct if $DEBUG is set

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 1) Concatenate all the single characters a|b|c into [abc]'s
    $re =~ s{ \| ( \w (?: \| \w )+ ) (?= \| ) }
    {
    my ( $string ) = ( $1 );
    print "string = '$string'\n" if $DEBUG;
    "|[" . join("", split m{\|}, $string) . "]"
    }gex;

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 2) Find all the Xa|Xb|Xc and change to X(?:a|b|c)]
    $re =~ s{ \| ( (\w+)(\w+) (?: \| \2\w+ )+ ) (?= \| ) }
    {
    my ( $string, $prefix ) = ( $1, $2 );
    print "prefix = '$prefix', string = '$string'\n" if $DEBUG;
    "|$prefix\(?:" . join("|", map { substr($_, length $prefix) } split m{\|}, $string) . ")"
    }gex;

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 3) Find all the aX|bX|cX and change to (a|b|c)X]
    $re =~ s{ \| ( (\w+?)(.+) (?: \| \w+\3 )+ ) (?= \| ) }
    {
    my ( $string, $postfix ) = ( $1, $3 );
    print "postfix = '$postfix', string = '$string'\n" if $DEBUG;
    $string =~ s{ \Q$postfix\E (?= \| | $ ) }{}gx;
    print "...string = '$string'\n" if $DEBUG;
    "|(?:$string)$postfix"
    }gex;

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 4) Change (?:a|b|c) into [abc]
    $re =~ s{ \(\?\: ( \w (?: \| \w )+ ) \) }
    {
    my ( $string ) = ( $1 );
    print "string = '$string'\n" if $DEBUG;
    "[" . join("", split m{\|}, $string) . "]"
    }gex;

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 5) Optimise [abc] into [a-c] or \d
    # This doesn't optimise all the cases only the complete continuous
    # range in the [ ... ]
    # $re =~ s{ \[ ( \w{3,} ) \] }
    # {
    # my ( $string, $start, $end ) = ( $1, substr($1, 0, 1), substr($1, -1, 1) );
    # print "match ['$string']...range [$start-$end]\n" if $DEBUG;
    # if ($end - $start + 1 == length $string)
    # {
    # $start == 0 && $end == 9 ? '\d' : "[$start-$end]";
    # }
    # else
    # {
    # "[$string]";
    # }
    # }gex;

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    my $re_length;

    do
    {
    $re_length = length($re);

    # 6) recurse on any sequences left (?:ab|cd|ef)
    $re =~ s{ \(\?\: ( \w+ (?: \| \w+ )+ ) \) }
    {
    my ( $string ) = ( $1 );
    if (length($string) < length($re) - 4)
    {
    print "**** Recursing on '$string'\n" if $DEBUG;
    "(?:" . word_list_to_regexp(split m{\|}, $string) . ")";
    }
    else
    {
    "(?:$string)";
    }
    }gex;

    # 6a) recurse on any sequences left |ab|cd|ef|
    $re =~ s{ \| ( \w+ (?: \| \w+ )+ ) \| }
    {
    my ( $string ) = ( $1 );
    if (length($string) < length($re) - 2)
    {
    print "**** Recursing on '$string'\n" if $DEBUG;
    "|" . word_list_to_regexp(split m{\|}, $string) . "|";
    }
    else
    {
    "|$string|";
    }
    }gex;
    }
    until (length($re) == $re_length);

    check_word_list_to_regexp($re, \@list) if $DEBUG;

    # 7) fix the | on each end
    $re =~ s{^\|}{};
    $re =~ s{\|$}{};

    print "**** Returning '$re'\n" if $DEBUG;

    return $re;
    }

    ############################################################
    # Test subroutine to check the regexp performs as advertised
    #
    # Call with a regexp and a reference to a list of numbers
    # it will check that the regexp matches all the list and
    # doesn't match some others (obviously it can't check them
    # all can it!) die-ing on any failures.
    ############################################################

    sub check_word_list_to_regexp
    {
    my ($re, $list) = @_;
    my %list = map { $_ => 1 } @$list;
    print "Re: $re\n";

    # Put some other test cases in
    $list{$_} += 0 for (0..999);
    $list{int(rand()*1000)} += 0 for (0..99);
    $list{int(rand()*10000)} += 0 for (0..99);
    $list{int(rand()*100000)} += 0 for (0..99);

    # print join(", ", map {"$_ => $list{$_}"} keys %list), "\n";
    $re =~ s{^\|}{}; # fix | on start and end
    $re =~ s{\|$}{};
    $re = "^(?:$re)\$"; # put in ^(?: ... )$
    $re = qr{$re}; # compile the regexp for speed

    # Check all the keys in list against the regexp - some should pass
    # and some should fail
    for my $item (keys %list)
    {
    if ($list{$item} xor ($item =~ /$re/))
    {
    die "*** FAILED '$re' for '$item' ShouldMatch: $list{$item}\n";
    }
    else
    {
    # print "OK '$re' for '$item'\n";
    }
    }
    }

    --
    Nick Craig-Wood <> -- http://www.craig-wood.com/nick
    Nick Craig-Wood, Jul 21, 2004
    #7
  8. Klaus Neuner

    Klaus Neuner Guest

    Thanks to all who participated in this thread.

    Klaus
    Klaus Neuner, Jul 26, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph

    compressed suffix trie

    Joseph, Sep 22, 2004, in forum: Java
    Replies:
    1
    Views:
    521
    =?ISO-8859-1?Q?Daniel_Sj=F6blom?=
    Sep 22, 2004
  2. Joseph

    compressed suffix trie

    Joseph, Sep 22, 2004, in forum: C++
    Replies:
    3
    Views:
    1,794
    Victor Bazarov
    Sep 22, 2004
  3. Miki Tebeka

    [ANN] Trie for Python

    Miki Tebeka, Oct 1, 2003, in forum: Python
    Replies:
    0
    Views:
    416
    Miki Tebeka
    Oct 1, 2003
  4. Polar

    Trie

    Polar, Aug 3, 2004, in forum: C Programming
    Replies:
    0
    Views:
    574
    Polar
    Aug 3, 2004
  5. chuck
    Replies:
    8
    Views:
    1,043
    chuck
    Oct 1, 2007
Loading...

Share This Page