Regexp to combine table cells

Discussion in 'Perl Misc' started by Bart Van der Donck, May 9, 2014.

  1. Hello,

    I'm having difficulties to find a regular expression starting from the following input:

    my $row = '
    <tr>
    <td>UNIQUESTRING</td>
    <td>A</td>
    <td>A</td>
    <td>A</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>A</td>
    <td>A</td>
    <td></td>
    <td></td>
    <td>C</td>
    <td></td>
    </tr>
    ';

    What I would like to achieve:

    <tr>
    <td>UNIQUESTRING</td>
    <td colspan="3">A</td>
    <td colspan="5">B</td>
    <td colspan="2">A</td>
    <td colspan="2"></td>
    <td>C</td>
    <td></td>
    </tr>

    Thanks
     
    Bart Van der Donck, May 9, 2014
    #1
    1. Advertisements

  2. Regular expressions are really not the right toll for that task. It can
    be done with perl regular expressions, but it wont be pretty.

    You should look at a real HTML parser instead. My personal preference
    would be to use HTML::TreeBuilder to parse your HTML. This would leave
    you with a navigatable perl structure where you can itterate over the
    <td> elements and the generate a new list of <td> elements.

    Should be quite simple...

    //Makholm
     
    Peter Makholm, May 9, 2014
    #2
    1. Advertisements

  3. Bart Van der Donck

    gamo Guest

    El 09/05/14 10:59, Bart Van der Donck escribió:
    There is poetry out there remarking that html is not regex.
    How I do this specific example, in 3 steps:
    1) Deleting all html tags with tr///, and UNIQUESTRING
    2) Counting the sequence which results
    3) Printing the new table with the counters

    HTH
     
    gamo, May 9, 2014
    #3
  4. As has been said, REs are probably not the right solution. They often
    result in fragile code when used with HTML. Still, it's a interesting
    questions so here's an answer:

    $row =~ s! <td> ([^<]*) </td> ((?: \s* <td>\1</td> )+)
    ! "<td colspan='" . ((() = $2 =~ /<td>/g) + 1) . "'>$1</td>" !xeg;

    If the A, B and C of your examples are actually very much more complex
    then my [^<]* won't do and the whole thing will look a lot worse.
     
    Ben Bacarisse, May 9, 2014
    #4
  5. Bart Van der Donck

    Tim McDaniel Guest

    ....
    There is poetry out there remarking that html is not regex.[/QUOTE]

    Not regexpable in general. "This ... is wrong tool. Never use this."
    tr/// does single-character substitution. You can't get rid of the
    HTML tags with tr. s///, probably, if there's nothing like < inside
    comments.
     
    Tim McDaniel, May 9, 2014
    #5
  6. Bart Van der Donck

    gamo Guest

    El 09/05/14 18:11, Tim McDaniel escribió:
    Thank you for the correction.
     
    gamo, May 9, 2014
    #6
  7. # have fun !


    use strict;
    use warnings;


    my $row = '
    <tr>
    <td>UNIQUESTRING</td>
    <td>A</td>
    <td>A</td>
    <td>A</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>A</td>
    <td>A</td>
    <td></td>
    <td></td>
    <td>C</td>
    <td></td>
    </tr>
    ';



    my @data;
    my $regex = qr/<td>(.*?)<\/td>/o;
    my $i=0;
    my $out = "<tr>\n";
    my $match_previous = 'UNIQUESTRING';


    while ( $row =~ /$regex/g )
    {
    $i++ if $match_previous ne $^N;
    $data[$i]->{ $^N }++;
    $match_previous = $^N;;
    }


    foreach (@data) {
    my ($k,$v) = each %{$_};
    $out .= $v ==1 ? "<td>$k</td>\n" : "<td colspan=\"$v\">$k</td>\n"
    }

    $out .= '</tr>';


    print $out;
     
    George Mpouras, May 9, 2014
    #7
  8. Thanks to everyone for the input. Ben's solution is technically brilliant, but indeed the A/B/C are more complex; and the <td>'s have their arguments as well. George's approach fails at EOL, but appears to be okay with an initial

    $row =~ tr/\015\012//d;

    Then still a trick to handle the arguments of each <td>. Fortunately I'm ina situation where identical cell values always have identical <td>-arguments:

    # at init
    my $row = '
    <tr>
    <td>§ class="c1" §A</td>
    <td>§ class="c1" §A</td>
    <td>§ class="c2" §B</td>
    <td></td>
    </tr>
    ';

    # final regex
    $out =~ s/>§ (.*?) §/$1>/g;

    And I believe that should do it...
     
    Bart Van der Donck, May 10, 2014
    #8
  9. Bart Van der Donck

    Kaz Kylheku Guest

    TXR language, version 89:

    @(output :into str)
    <tr>
    <td>UNIQUESTRING</td>
    <td>A</td>
    <td>A</td>
    <td>A</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    <td>A</td>
    <td>A</td>
    <td></td>
    <td></td>
    <td>C</td>
    <td></td>
    </tr>
    @(end)
    @(next :list str)
    <tr>
    <td>@header</td>
    @(collect :vars (item count))
    @(bind count 1)
    <td>@item</td>
    @(collect :gap 0)
    <td>@item</td>
    @(do (inc count))
    @(end)
    @(until)
    </tr>
    @(end)
    @(output)
    <tr>
    <td>@header</td>
    @(repeat :vars (count))
    <td@(if (> count 1) ` colspan="@count"` "")>@item</td>
    @(end)
    </tr>
    @(end)


    Output:

    <tr>
    <td>UNIQUESTRING</td>
    <td colspan="3">A</td>
    <td colspan="5">B</td>
    <td colspan="2">A</td>
    <td colspan="2"></td>
    <td>C</td>
    <td></td>
    </tr>
     
    Kaz Kylheku, May 12, 2014
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.