Parsing blocks of text in Perl

mxyzplk · Mar 5, 2008

OK, so every way I've thought of doing this is really ugly. I'm using
Perl 5.8.4 and only have access to the stock libraries, mostly.

What I need to do is parse through a text file and perform some
transformations on embedded link structures for a wiki content
conversion. A "link" is defined as anything wrapped in double
brackets - [[<string>]], which can appear anywhere in a line of text
and multiple links can appear in a line of text.

1) If the link has a colon (":") in it, I need to strip out all
special characters and spaces (everything except [a-zA_Z0-9]) from the
portion before the colon but leave the part after the colon intact.
Examples:
[[Operation Intranet 2.0!:EvalHome|Eval Home]] -->
[[OperationIntranet20:EvalHome|Eval Home]]
[[UP Platform:Home|UP Platform]] --> [[UPPlatform:Home|UP Platform]]

2) If the link does not have a ":" in it, I need to insert the string
General: before the name of the page.
Examples:
[[Technical FAQs|Technical FAQs]] --> [[General:Technical FAQs|
Technical FAQs]]
[[Embedded - Top 5 content|Top 5 content]] [[General:Embedded - Top 5
content|Top 5 content]]

3) Special case - don't change if it is an image link or if it is an
external link (only single [] enclosure).
Examples:
[[Image:BIhouse.jpg]] --> [[Image:BIhouse.jpg]]
[http://spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki] --> [http://
spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki]

I expect this is similar to some HTML parsing requirements, but I've
been hunting through my O'Reilly Perl books and Googling and I'm
having trouble finding my way. Normal regexp replace appears not to
be the way to go and I'm having greediness issues. Ideas?

Thanks,
Ernest

Martien Verbruggen · Mar 5, 2008

OK, so every way I've thought of doing this is really ugly. I'm using
Perl 5.8.4 and only have access to the stock libraries, mostly.

What I need to do is parse through a text file and perform some
transformations on embedded link structures for a wiki content
conversion. A "link" is defined as anything wrapped in double
brackets - [[<string>]], which can appear anywhere in a line of text
and multiple links can appear in a line of text.

This implies that they cannot span more than one line of text, which is
what I assumed.

1) If the link has a colon (":") in it, I need to strip out all
special characters and spaces (everything except [a-zA_Z0-9]) from the
portion before the colon but leave the part after the colon intact.

2) If the link does not have a ":" in it, I need to insert the string
General: before the name of the page.

3) Special case - don't change if it is an image link or if it is an
external link (only single [] enclosure).
Examples:
[[Image:BIhouse.jpg]] --> [[Image:BIhouse.jpg]]

Removing everything except a-zA-Z0-9 from 'Image' doesn't change it.

[http://spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki] --> [http://
spss.wikicities.com/wiki/SPSS_Wiki SPSS Wiki]

Avoiding looking at single brackets would be easiest.

I expect this is similar to some HTML parsing requirements, but I've
been hunting through my O'Reilly Perl books and Googling and I'm
having trouble finding my way. Normal regexp replace appears not to
be the way to go and I'm having greediness issues. Ideas?

This is not nearly as complex as HTML, unles you haven't yet given us
all possible problems. I'm pretty sure that a regex can do that, and
greediness issues should, in this case, be simply fixable by using
non-greedy modifiers. If you have an example that doesn't get correctly
handled by the below, let us know.

Next time, before you post here, show us what you have tried first. This
is not a place where you can coe to get free code all the time, and if
you don't show us what you have tried, it looks like that is exactly
what you're trying to do.

For this time:

#!/usr/bin/perl
use warnings;
use strict;

while (<>)
{
s/\[\[(.*?)\]\]/'[[' . replace_link($1) . ']]'/ge;
print;
}

sub replace_link
{
my @link = split ':', shift;
if (@link == 1)
{
unshift @link, 'General';
}
else
{
$link[0] =~ tr/a-zA-Z0-9//dc;
}

return join ':', @link;
}

This can probably be made a bit faster, by avoiding splitting and
joining, but unless it's a problem I wouldn't worry about it. The
mechanism remains the same, and you shold be easily able to adjust
replace_link to taste. You could also avoid having to put the brackets
back by using look-(ahead|behind) assertions instead, but I generally
find this more readable. If links can cross line bondaries, and files
aren't too large, read the whole file in, and run the body of the while
loop on that.

Martien

Gunnar Hjalmarsson · Mar 5, 2008

Martien said:
Next time, before you post here, show us what you have tried first.

That's good advice.

This is not a place where you can coe to get free code all the time,

Isn't it? You just made me believe it is. ;-)

mxyzplk · Mar 5, 2008

Thanks man. Here's what I was trying to do, without splitting:

#!/bin/perl
#
# Usage: linkxfer.pl file
#

((($file) = @ARGV) == 1 && -f $file)
|| die "Usage: $0 file\n";

open(IN,"$file");

$|=1;

while ($line=<IN>) {
$line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|\2/g;
print $line;
}

close IN;

exit;

It was working for the adding "General:" part, and I was trying to
figure out how the heck to apply "tr" to the \1 in the output and came
to a standstill. Apparently it's the magic /e flag plus a subroutine
to the rescue; using your example I did:

#!/bin/perl
#
# Usage: linkxfer.pl file
#

((($file) = @ARGV) == 1 && -f $file)
|| die "Usage: $0 file\n";

open(IN,"$file");

$|=1;

while ($line=<IN>) {
$line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|$2/g;
$line =~ s/\[\[([^:]+?)

.+?\|.+?\]\])/'[[' . transform($1) . ":$2"/
ge;
print $line;
}

close IN;

sub transform
{
my $string = shift;
$string =~ tr/[^a-zA-Z0-9]//cd;
return $string;
}

exit;

Although I do think your version's more elegant and extensible.

Thanks,
Ernest

mxyzplk · Mar 5, 2008

Sorry to come across as a code mooch, I was more looking for a
direction to go with it than finished code, because I wasn't at all
sure about my general approach and whether I should be instead doing
something fancier with Text::Balanced or some other parser... Thanks
to Martien and hugs to all the grouchy Europeans out there!

John W. Krahn · Mar 5, 2008

mxyzplk said:
Thanks man. Here's what I was trying to do, without splitting:

#!/bin/perl

use warnings;
use strict;

#
# Usage: linkxfer.pl file
#

((($file) = @ARGV) == 1 && -f $file)
|| die "Usage: $0 file\n";

Probably better written as:

@ARGV == 1 && -f $ARGV[0] and my $file = shift or die "Usage: $0 file\n";

open(IN,"$file");

You should *always* verify that the file opened correctly:

open IN, '<', $file or die "Cannot open '$file' $!";

$|=1;

while ($line=<IN>) {

$line =~ s/\[\[([^:]+?)\|(.+?\]\])/[[General:\1|\2/g;

Backreferences \1 and \2 should only be used *inside* a regular
expression, you should use $1 and $2 instead.

print $line;
}

close IN;

exit;

John

mxyzplk · Mar 5, 2008

Thanks John, better style noted! (I'm often working off an old first
edn. of the O'Reilly Programming Perl book so my idioms are sadly
decrepit

Ernest

Martien Verbruggen · Mar 6, 2008

That's good advice.

Isn't it? You just made me believe it is. ;-)

Note that I said 'all the time'

Martien

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
building generators in Perl	1	Feb 27, 2013
Multi select options in a menu	1	Oct 30, 2022
Having difficulty with the layout of these images / video for this web page	2	Jul 4, 2022
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023
Help with my responsive home page	2	Dec 14, 2022
Help with Visual Lightbox: Scripts	2	May 3, 2023

Parsing blocks of text in Perl

mxyzplk

Martien Verbruggen

Gunnar Hjalmarsson

mxyzplk

mxyzplk

John W. Krahn

mxyzplk

Martien Verbruggen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads