Regexp to combine table cells

  • Thread starter Bart Van der Donck
  • Start date

B

Bart Van der Donck

Hello,

I'm having difficulties to find a regular expression starting from the following input:

my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';

What I would like to achieve:

<tr>
<td>UNIQUESTRING</td>
<td colspan="3">A</td>
<td colspan="5">B</td>
<td colspan="2">A</td>
<td colspan="2"></td>
<td>C</td>
<td></td>
</tr>

Thanks
 
Ad

Advertisements

P

Peter Makholm

Bart Van der Donck said:
Hello,

I'm having difficulties to find a regular expression starting from the
following input:

Regular expressions are really not the right toll for that task. It can
be done with perl regular expressions, but it wont be pretty.

You should look at a real HTML parser instead. My personal preference
would be to use HTML::TreeBuilder to parse your HTML. This would leave
you with a navigatable perl structure where you can itterate over the
<td> elements and the generate a new list of <td> elements.

Should be quite simple...

//Makholm
 
G

gamo

El 09/05/14 10:59, Bart Van der Donck escribió:
Hello,

I'm having difficulties to find a regular expression starting from the following input:

my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';

What I would like to achieve:

<tr>
<td>UNIQUESTRING</td>
<td colspan="3">A</td>
<td colspan="5">B</td>
<td colspan="2">A</td>
<td colspan="2"></td>
<td>C</td>
<td></td>
</tr>

Thanks

There is poetry out there remarking that html is not regex.
How I do this specific example, in 3 steps:
1) Deleting all html tags with tr///, and UNIQUESTRING
2) Counting the sequence which results
3) Printing the new table with the counters

HTH
 
B

Ben Bacarisse

Bart Van der Donck said:
Hello,

I'm having difficulties to find a regular expression starting from the
following input:

my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';

What I would like to achieve:

<tr>
<td>UNIQUESTRING</td>
<td colspan="3">A</td>
<td colspan="5">B</td>
<td colspan="2">A</td>
<td colspan="2"></td>
<td>C</td>
<td></td>
</tr>

As has been said, REs are probably not the right solution. They often
result in fragile code when used with HTML. Still, it's a interesting
questions so here's an answer:

$row =~ s! <td> ([^<]*) </td> ((?: \s* <td>\1</td> )+)
! "<td colspan='" . ((() = $2 =~ /<td>/g) + 1) . "'>$1</td>" !xeg;

If the A, B and C of your examples are actually very much more complex
then my [^<]* won't do and the whole thing will look a lot worse.
 
T

Tim McDaniel

[QUOTE="gamo said:
my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';
....
There is poetry out there remarking that html is not regex.[/QUOTE]

Not regexpable in general. "This ... is wrong tool. Never use this."
1) Deleting all html tags with tr///, and UNIQUESTRING

tr/// does single-character substitution. You can't get rid of the
HTML tags with tr. s///, probably, if there's nothing like < inside
comments.
 
G

gamo

El 09/05/14 18:11, Tim McDaniel escribió:
tr/// does single-character substitution. You can't get rid of the
HTML tags with tr. s///, probably, if there's nothing like < inside
comments.

Thank you for the correction.
 
Ad

Advertisements

G

George Mpouras

# have fun !


use strict;
use warnings;


my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';



my @data;
my $regex = qr/<td>(.*?)<\/td>/o;
my $i=0;
my $out = "<tr>\n";
my $match_previous = 'UNIQUESTRING';


while ( $row =~ /$regex/g )
{
$i++ if $match_previous ne $^N;
$data[$i]->{ $^N }++;
$match_previous = $^N;;
}


foreach (@data) {
my ($k,$v) = each %{$_};
$out .= $v ==1 ? "<td>$k</td>\n" : "<td colspan=\"$v\">$k</td>\n"
}

$out .= '</tr>';


print $out;
 
B

Bart Van der Donck

Thanks to everyone for the input. Ben's solution is technically brilliant, but indeed the A/B/C are more complex; and the <td>'s have their arguments as well. George's approach fails at EOL, but appears to be okay with an initial

$row =~ tr/\015\012//d;

Then still a trick to handle the arguments of each <td>. Fortunately I'm ina situation where identical cell values always have identical <td>-arguments:

# at init
my $row = '
<tr>
<td>§ class="c1" §A</td>
<td>§ class="c1" §A</td>
<td>§ class="c2" §B</td>
<td></td>
</tr>
';

# final regex
$out =~ s/>§ (.*?) §/$1>/g;

And I believe that should do it...
 
Ad

Advertisements

K

Kaz Kylheku

Hello,

I'm having difficulties to find a regular expression starting from the following input:

my $row = '
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
';

TXR language, version 89:

@(output :into str)
<tr>
<td>UNIQUESTRING</td>
<td>A</td>
<td>A</td>
<td>A</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>B</td>
<td>A</td>
<td>A</td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
@(end)
@(next :list str)
<tr>
<td>@header</td>
@(collect :vars (item count))
@(bind count 1)
<td>@item</td>
@(collect :gap 0)
<td>@item</td>
@(do (inc count))
@(end)
@(until)
</tr>
@(end)
@(output)
<tr>
<td>@header</td>
@(repeat :vars (count))
<[email protected](if (> count 1) ` colspan="@count"` "")>@item</td>
@(end)
</tr>
@(end)


Output:

<tr>
<td>UNIQUESTRING</td>
<td colspan="3">A</td>
<td colspan="5">B</td>
<td colspan="2">A</td>
<td colspan="2"></td>
<td>C</td>
<td></td>
</tr>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top