Parsing blocks of text in Perl

Discussion in 'Perl Misc' started by mxyzplk, Mar 5, 2008.

  1. mxyzplk

    mxyzplk Guest

    OK, so every way I've thought of doing this is really ugly. I'm using
    Perl 5.8.4 and only have access to the stock libraries, mostly.

    What I need to do is parse through a text file and perform some
    transformations on embedded link structures for a wiki content
    conversion. A "link" is defined as anything wrapped in double
    brackets - [[<string>]], which can appear anywhere in a line of text
    and multiple links can appear in a line of text.

    1) If the link has a colon (":") in it, I need to strip out all
    special characters and spaces (everything except [a-zA_Z0-9]) from the
    portion before the colon but leave the part after the colon intact.
    Examples:
    [[Operation Intranet 2.0!:EvalHome|Eval Home]] -->
    [[OperationIntranet20:EvalHome|Eval Home]]
    [[UP Platform:Home|UP Platform]] --> [[UPPlatform:Home|UP Platform]]

    2) If the link does not have a ":" in it, I need to insert the string
    General: before the name of the page.
    Examples:
    [[Technical FAQs|Technical FAQs]] --> [[General:Technical FAQs|
    Technical FAQs]]
    [[Embedded - Top 5 content|Top 5 content]] [[General:Embedded - Top 5
    content|Top 5 content]]

    3) Special case - don't change if it is an image link or if it is an
    external link (only single [] enclosure).
    Examples:
    [[Image:BIhouse.jpg]] --> [[Image:BIhouse.jpg]]
    [http://spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki] --> [http://
    spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki]

    I expect this is similar to some HTML parsing requirements, but I've
    been hunting through my O'Reilly Perl books and Googling and I'm
    having trouble finding my way. Normal regexp replace appears not to
    be the way to go and I'm having greediness issues. Ideas?

    Thanks,
    Ernest
    mxyzplk, Mar 5, 2008
    #1
    1. Advertising

  2. On Wed, 5 Mar 2008 12:35:43 -0800 (PST),
    mxyzplk <> wrote:
    > OK, so every way I've thought of doing this is really ugly. I'm using
    > Perl 5.8.4 and only have access to the stock libraries, mostly.
    >
    > What I need to do is parse through a text file and perform some
    > transformations on embedded link structures for a wiki content
    > conversion. A "link" is defined as anything wrapped in double
    > brackets - [[<string>]], which can appear anywhere in a line of text
    > and multiple links can appear in a line of text.


    This implies that they cannot span more than one line of text, which is
    what I assumed.

    > 1) If the link has a colon (":") in it, I need to strip out all
    > special characters and spaces (everything except [a-zA_Z0-9]) from the
    > portion before the colon but leave the part after the colon intact.


    > 2) If the link does not have a ":" in it, I need to insert the string
    > General: before the name of the page.


    > 3) Special case - don't change if it is an image link or if it is an
    > external link (only single [] enclosure).
    > Examples:
    > [[Image:BIhouse.jpg]] --> [[Image:BIhouse.jpg]]


    Removing everything except a-zA-Z0-9 from 'Image' doesn't change it.

    > [http://spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki] --> [http://
    > spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki]


    Avoiding looking at single brackets would be easiest.

    > I expect this is similar to some HTML parsing requirements, but I've
    > been hunting through my O'Reilly Perl books and Googling and I'm
    > having trouble finding my way. Normal regexp replace appears not to
    > be the way to go and I'm having greediness issues. Ideas?


    This is not nearly as complex as HTML, unles you haven't yet given us
    all possible problems. I'm pretty sure that a regex can do that, and
    greediness issues should, in this case, be simply fixable by using
    non-greedy modifiers. If you have an example that doesn't get correctly
    handled by the below, let us know.

    Next time, before you post here, show us what you have tried first. This
    is not a place where you can coe to get free code all the time, and if
    you don't show us what you have tried, it looks like that is exactly
    what you're trying to do.

    For this time:

    #!/usr/bin/perl
    use warnings;
    use strict;

    while (<>)
    {
    s/\[\[(.*?)\]\]/'[[' . replace_link($1) . ']]'/ge;
    print;
    }

    sub replace_link
    {
    my @link = split ':', shift;
    if (@link == 1)
    {
    unshift @link, 'General';
    }
    else
    {
    $link[0] =~ tr/a-zA-Z0-9//dc;
    }

    return join ':', @link;
    }

    This can probably be made a bit faster, by avoiding splitting and
    joining, but unless it's a problem I wouldn't worry about it. The
    mechanism remains the same, and you shold be easily able to adjust
    replace_link to taste. You could also avoid having to put the brackets
    back by using look-(ahead|behind) assertions instead, but I generally
    find this more readable. If links can cross line bondaries, and files
    aren't too large, read the whole file in, and run the body of the while
    loop on that.

    Martien
    --
    |
    Martien Verbruggen | Blessed are the Fundamentalists, for they
    | shall inhibit the earth.
    |
    Martien Verbruggen, Mar 5, 2008
    #2
    1. Advertising

  3. Martien Verbruggen wrote:
    > Next time, before you post here, show us what you have tried first.


    That's good advice.

    > This is not a place where you can coe to get free code all the time,


    Isn't it? You just made me believe it is. ;-)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Mar 5, 2008
    #3
  4. mxyzplk

    mxyzplk Guest

    Thanks man. Here's what I was trying to do, without splitting:

    #!/bin/perl
    #
    # Usage: linkxfer.pl file
    #

    ((($file) = @ARGV) == 1 && -f $file)
    || die "Usage: $0 file\n";

    open(IN,"$file");

    $|=1;

    while ($line=<IN>) {
    $line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|\2/g;
    print $line;
    }

    close IN;

    exit;

    It was working for the adding "General:" part, and I was trying to
    figure out how the heck to apply "tr" to the \1 in the output and came
    to a standstill. Apparently it's the magic /e flag plus a subroutine
    to the rescue; using your example I did:

    #!/bin/perl
    #
    # Usage: linkxfer.pl file
    #

    ((($file) = @ARGV) == 1 && -f $file)
    || die "Usage: $0 file\n";

    open(IN,"$file");

    $|=1;

    while ($line=<IN>) {
    $line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|$2/g;
    $line =~ s/\[\[([^:]+?):(.+?\|.+?\]\])/'[[' . transform($1) . ":$2"/
    ge;
    print $line;
    }

    close IN;

    sub transform
    {
    my $string = shift;
    $string =~ tr/[^a-zA-Z0-9]//cd;
    return $string;
    }

    exit;

    Although I do think your version's more elegant and extensible.

    Thanks,
    Ernest
    mxyzplk, Mar 5, 2008
    #4
  5. mxyzplk

    mxyzplk Guest

    Sorry to come across as a code mooch, I was more looking for a
    direction to go with it than finished code, because I wasn't at all
    sure about my general approach and whether I should be instead doing
    something fancier with Text::Balanced or some other parser... Thanks
    to Martien and hugs to all the grouchy Europeans out there!
    mxyzplk, Mar 5, 2008
    #5
  6. mxyzplk wrote:
    > Thanks man. Here's what I was trying to do, without splitting:
    >
    > #!/bin/perl


    use warnings;
    use strict;

    > #
    > # Usage: linkxfer.pl file
    > #
    >
    > ((($file) = @ARGV) == 1 && -f $file)
    > || die "Usage: $0 file\n";


    Probably better written as:

    @ARGV == 1 && -f $ARGV[0] and my $file = shift or die "Usage: $0 file\n";


    > open(IN,"$file");


    You should *always* verify that the file opened correctly:

    open IN, '<', $file or die "Cannot open '$file' $!";


    > $|=1;
    >
    > while ($line=<IN>) {


    while ( my $line = <IN> ) {


    > $line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|\2/g;


    Backreferences \1 and \2 should only be used *inside* a regular
    expression, you should use $1 and $2 instead.


    > print $line;
    > }
    >
    > close IN;
    >
    > exit;




    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, Mar 6, 2008
    #6
  7. mxyzplk

    mxyzplk Guest

    Thanks John, better style noted! (I'm often working off an old first
    edn. of the O'Reilly Programming Perl book so my idioms are sadly
    decrepit :)

    Ernest
    mxyzplk, Mar 6, 2008
    #7
  8. On Wed, 05 Mar 2008 22:46:00 +0100,
    Gunnar Hjalmarsson <> wrote:
    > Martien Verbruggen wrote:
    >> Next time, before you post here, show us what you have tried first.

    >
    > That's good advice.
    >
    >> This is not a place where you can coe to get free code all the time,

    >
    > Isn't it? You just made me believe it is. ;-)


    Note that I said 'all the time' :)

    Martien
    --
    |
    Martien Verbruggen | +++ Out of Cheese Error +++ Reinstall
    | Universe and Reboot +++
    |
    Martien Verbruggen, Mar 6, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    873
    GIMME
    Feb 11, 2004
  2. Arjen
    Replies:
    3
    Views:
    437
    Scott Allen
    Feb 27, 2005
  3. Replies:
    2
    Views:
    267
    James Stroud
    Apr 12, 2007
  4. matt
    Replies:
    1
    Views:
    254
    George Ogata
    Aug 6, 2004
  5. Steven Taylor
    Replies:
    9
    Views:
    247
    Brian Candler
    Apr 27, 2009
Loading...

Share This Page