need help of regular expression genius

Discussion in 'Python' started by GHUM, Aug 2, 2006.

  1. GHUM

    GHUM Guest

    I need to split a text at every ; (Semikolon), but not at semikolons
    which are "escaped" within a pair of $$ or $_$ signs.

    My guess was that something along this should happen withing csv.py;
    but ... it is done within _csv.c :(

    Example: the SQL text should be splitted at "<split here>" (of course,
    those "split heres" are not there yet :)

    set interval 2;
    <split here>
    CREATE FUNCTION uoibcachebetrkd(bigint, text, text, text, text, text,
    timestamp without time zone, text, text) RETURNS integer
    AS $_$
    DECLARE
    result int4;
    BEGIN
    update bcachebetrkd set
    name=$2, wieoftjds=$3, letztejds=$4, njds=$5,
    konzern=$6, letztespeicherung=$7, betreuera=$8, jdsueberkonzern=$9
    where id_p=$1;
    IF FOUND THEN
    result:=-1;
    else
    insert into bcachebetrkd (
    id_p, name, wieoftjds, letztejds, njds, konzern,
    letztespeicherung, betreuera, jdsueberkonzern
    )
    values ($1, $2, $3, $4, $5, $6, $7, $8, $9);
    result:=$1;
    END IF;
    RETURN result;
    END;
    $_$
    LANGUAGE plpgsql;
    <split here>
    CREATE FUNCTION set_quarant(mylvlquarant integer) RETURNS integer
    AS $$
    BEGIN
    perform relname from pg_class
    where relname = 'quara_tmp'
    and case when has_schema_privilege(relnamespace, 'USAGE')
    then pg_table_is_visible(oid) else false end;
    if not found then
    create temporary table quara_tmp (
    lvlquara integer
    );
    else
    delete from quara_tmp;
    end if;

    insert into quara_tmp values (mylvlquarant);
    return 0;
    END;
    $$
    LANGUAGE plpgsql;
    <split here>

    Can anybody hint me in the right direction, how a RE looks for "all ;
    but not those ; within $$" ?

    Harald
    GHUM, Aug 2, 2006
    #1
    1. Advertising

  2. GHUM

    Ant Guest

    GHUM wrote:
    > I need to split a text at every ; (Semikolon), but not at semikolons
    > which are "escaped" within a pair of $$ or $_$ signs.


    Looking at you example SQL code, it probably isn't possible with
    regexes. Consider the code:

    $$
    blah blah
    ....
    $$
    blah;
    <split here>
    xxx
    $$
    blah
    blah
    $$

    Regexes aren't clever enough to count the number of backreferences, and
    so won't help in the above case. You'd be better off creating a custom
    parser using a stack or counter of some sort to decide whether or not
    to split the text.
    Ant, Aug 2, 2006
    #2
    1. Advertising

  3. Harald,

    This works. 's' is your SQL sample.

    >>> import SE # From the Cheese Shop with a good manual
    >>> Split_Marker = SE.SE (' ";=\<split here>" "~\$_?\$(.|\n)*?\$_?\$~==" ')
    >>> s_with_split_marks = Split_Marker (s)
    >>> s_split = s_with_split_marks.split ('<split here>')


    That's it! And it isn't as complicated as it looks. The first expressions says translate the semicolon to your split mark. The
    second expression finds the $-blocks and says translate them to themselves. So they don't change. You can add as many expressions as
    you want. You'd probably want to choose a more convenient split mark.

    Frederic

    ----- Original Message -----
    From: "GHUM" <>
    Newsgroups: comp.lang.python
    To: <>
    Sent: Wednesday, August 02, 2006 5:27 PM
    Subject: need help of regular expression genius


    > I need to split a text at every ; (Semikolon), but not at semikolons
    > which are "escaped" within a pair of $$ or $_$ signs.
    >
    > My guess was that something along this should happen withing csv.py;
    > but ... it is done within _csv.c :(
    >
    > Example: the SQL text should be splitted at "<split here>" (of course,
    > those "split heres" are not there yet :)
    >
    > set interval 2;
    > <split here>
    > CREATE FUNCTION uoibcachebetrkd(bigint, text, text, text, text, text,
    > timestamp without time zone, text, text) RETURNS integer
    > AS $_$
    > DECLARE
    > result int4;
    > BEGIN
    > update bcachebetrkd set
    > name=$2, wieoftjds=$3, letztejds=$4, njds=$5,
    > konzern=$6, letztespeicherung=$7, betreuera=$8, jdsueberkonzern=$9
    > where id_p=$1;
    > IF FOUND THEN
    > result:=-1;
    > else
    > insert into bcachebetrkd (
    > id_p, name, wieoftjds, letztejds, njds, konzern,
    > letztespeicherung, betreuera, jdsueberkonzern
    > )
    > values ($1, $2, $3, $4, $5, $6, $7, $8, $9);
    > result:=$1;
    > END IF;
    > RETURN result;
    > END;
    > $_$
    > LANGUAGE plpgsql;
    > <split here>
    > CREATE FUNCTION set_quarant(mylvlquarant integer) RETURNS integer
    > AS $$
    > BEGIN
    > perform relname from pg_class
    > where relname = 'quara_tmp'
    > and case when has_schema_privilege(relnamespace, 'USAGE')
    > then pg_table_is_visible(oid) else false end;
    > if not found then
    > create temporary table quara_tmp (
    > lvlquara integer
    > );
    > else
    > delete from quara_tmp;
    > end if;
    >
    > insert into quara_tmp values (mylvlquarant);
    > return 0;
    > END;
    > $$
    > LANGUAGE plpgsql;
    > <split here>
    >
    > Can anybody hint me in the right direction, how a RE looks for "all ;
    > but not those ; within $$" ?
    >
    > Harald
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Anthra Norell, Aug 2, 2006
    #3
  4. GHUM

    Paul McGuire Guest

    "GHUM" <> wrote in message
    news:...
    > I need to split a text at every ; (Semikolon), but not at semikolons
    > which are "escaped" within a pair of $$ or $_$ signs.
    >


    The pyparsing rendition to this looks very similar to the SE solution,
    except for the regexp's:

    text = """ ... input source text ... ""

    from pyparsing import SkipTo,Literal,replaceWith
    ign1 = "$$" + SkipTo("$$") + "$$"
    ign2 = "$_$" + SkipTo("$_$") + "$_$"
    semi = Literal(";").setParseAction( replaceWith("; <***>") )
    print (ign1 | ign2 | semi).transformString(text)

    In concept, this works just like the SE program: as the scanner/parser scans
    through the input text, the ignoreable expressions are looked for first, and
    if found, just skipped over. If the semicolon expression is found, then its
    parse action is executed, which replaces the ';' with "; <***>", or whatever
    you choose.

    The pyparsing wiki is at pyparsing.wikispaces.com.

    -- Paul
    Paul McGuire, Aug 3, 2006
    #4
  5. GHUM

    GHUM Guest

    Paul,

    > text = """ ... input source text ... ""
    > from pyparsing import SkipTo,Literal,replaceWith
    > ign1 = "$$" + SkipTo("$$") + "$$"
    > ign2 = "$_$" + SkipTo("$_$") + "$_$"
    > semi = Literal(";").setParseAction( replaceWith("; <***>") )
    > print (ign1 | ign2 | semi).transformString(text)


    Thank you very much! this really looks beautifull and short! How could
    I forget about pyparsing? Old loves are often better then adventures
    with RE. :)

    Two questions remain:
    1) I did not succeed in finding a documentation for pyparsing. Is there
    something like a "full list of Classes and their methods" ?

    2) as of missing 1) :)): something like
    "setParseAction(splithereandreturnalistofelementssplittedhere) ?

    Thanks again!

    Harald

    (of course, I can .split("<***>") the transformedString :)
    GHUM, Aug 3, 2006
    #5
  6. GHUM

    Paul McGuire Guest

    "GHUM" <> wrote in message
    news:...
    > Paul,
    >
    > > text = """ ... input source text ... ""
    > > from pyparsing import SkipTo,Literal,replaceWith
    > > ign1 = "$$" + SkipTo("$$") + "$$"
    > > ign2 = "$_$" + SkipTo("$_$") + "$_$"
    > > semi = Literal(";").setParseAction( replaceWith("; <***>") )
    > > print (ign1 | ign2 | semi).transformString(text)

    >
    > Thank you very much! this really looks beautifull and short! How could
    > I forget about pyparsing? Old loves are often better then adventures
    > with RE. :)


    Good to hear from you again, Harald! I didn't recognize your "From"
    address, but when I looked into the details, I recognized your name from
    when we talked about some very early incarnations of pyparsing.

    >
    > Two questions remain:
    > 1) I did not succeed in finding a documentation for pyparsing. Is there
    > something like a "full list of Classes and their methods" ?
    >


    Pyparsing ships with JPG and PNG files containing class diagrams, plus an
    htmldoc directory containing epydoc-generated help files.
    There are also about 20 example programs included (also accessible in the
    wiki).

    > 2) as of missing 1) :)): something like
    > "setParseAction(splithereandreturnalistofelementssplittedhere) ?
    >


    I briefly considered what this grammar might look like, and rejected it as
    much too complicated compared to .split("<***>"). You could also look into
    using scanString instead of transformString (scanString reports the location
    within the string of the matched text). Then when matching on a ";", use
    the match location to help slice up the string and append to a list. But
    again, this is so much more complicated than just .split("<***>"), I
    wouldn't bother other than as an exercise in learning scanString.

    Good luck!
    -- Paul
    Paul McGuire, Aug 3, 2006
    #6
  7. GHUM

    GHUM Guest

    Paul,

    > Pyparsing ships with JPG and PNG files containing class diagrams, plus an
    > htmldoc directory containing epydoc-generated help files.
    > There are also about 20 example programs included (also accessible in the
    > wiki).


    Yes. That's what I have been missing. Maybe you could add: "please also
    download the .zip file if you use the windows installer to find the
    documentation" :)))

    >You could also look into using scanString instead of transformString

    thats what I found:
    from pyparsing import SkipTo,Literal,replaceWith
    ign1 = "$$" + SkipTo("$$") + "$$"
    ign2 = "$_$" + SkipTo("$_$") + "$_$"
    semi = Literal(";")

    von=0
    befehle=[]
    for row in (ign1 | ign2 | semi).scanString(txt):
    if row[0][0]==";":
    token, bis, von2=row
    befehle.append(txt[von: von2])
    von=von2

    I knew that for this common kind of problem there MUST be better
    solution then my homebrewn tokenizer (skimming through text char by
    char and remembering the switch to escape mode ... brrrrrr, looked like
    perl)

    Thanks for the reminder of pyparsing, maybe I should put in a reminder
    in my calender ... something along the lines "if you think of using a
    RE, you propably have forgotton pyparsing" every 3 months :)))))

    Best wishes and thank you very much for pyparsing and the hint

    Harald
    GHUM, Aug 3, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,267
  2. Toby A Inkster
    Replies:
    0
    Views:
    382
    Toby A Inkster
    Dec 11, 2003
  3. Replies:
    0
    Views:
    98
  4. Todd S.
    Replies:
    3
    Views:
    147
    Matthew Moss
    Jan 26, 2006
  5. Giles Bowkett
    Replies:
    16
    Views:
    194
    Giles Bowkett
    Jun 18, 2007
Loading...

Share This Page