Splitting paragraph into array.

S

Sandman

I am splitting a text block into paragraphs, to be able to add images and stuff
like that to a specific paragraph in a content management system.

Well, right now I'm splittin on two or more newlines, so this text block
(indentation added for clarity):

Hello, my nickname is Sandman and I am coding
some Perl

Call me

Would be split into two parts, with "Call me" being the second one.

My problem now is that if I have a text block like below:

Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>

Call me

The above would, given the rules I use now, yield four parts, as such:

---------------------------------------------
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:
---------------------------------------------
<code>
print "Hello World!";
---------------------------------------------
print "Foo";
</code>
---------------------------------------------
Call me
---------------------------------------------

But I would want it to end up in three parts, as such:

---------------------------------------------
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:
---------------------------------------------
<code>
print "Hello World!";

print "Foo";
</code>
---------------------------------------------
Call me
---------------------------------------------

So, basically, what I want to do is to split the text block up with the
delimiter "\n{2,}" but not when it is inside an *unclosed* html tag. Some
examples:


<div class='quote'>
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>

Call me
</div>

Ends up in:

---------------------------------------------
<div class='quote'>
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>

Call me
</div>
---------------------------------------------

And

<div class='quote'>
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>
</div>

Call me

Ends up in:

---------------------------------------------
<div class='quote'>
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>
</div>
---------------------------------------------
Call me
---------------------------------------------


Hopefully you get the idea.

Any ideas on how to solve it?
 
G

gnari

[balanced tags]

is the code html encoded, or can this happen?:


foo

<code>
$a = "</code>";

print "$a\n<code>";

print "\n";
</code>

bar

<code>
$b = "</code>";

print "$b\n<code>";

print "\n";
</code>

fubar


gnari
 
S

Sandman

[balanced tags]

is the code html encoded, or can this happen?:


foo

<code>
$a = "</code>";

print "$a\n<code>";

print "\n";
</code>

bar

<code>
$b = "</code>";

print "$b\n<code>";

print "\n";
</code>

fubar

That could happen, but it's pretty unlikely.

I have a working version now that works by iterating trough each line, seeing
if there is a start tag but not an end tag, and if so, add 1 to a variable, and
only adds the aggregated if this variable is zero.

Your above example outputs this:

Debug:
0: foo
1: <code>
0: = "</code>";
0: print "
1: <code>";
1:
1: print "
1: ";
0: </code>
0: bar
1: <code>
0: = "</code>";
0: print "
1: <code>";
1:
1: print "
1: ";
0: </code>

Paragraphs:
---------------
foo
---------------
<code>
= "</code>";
---------------
print "
<code>";

print "
";
</code>
---------------
bar
---------------
<code>
= "</code>";
---------------
print "
<code>";

print "
";
</code>
---------------

Which is completely wrong. But this text:

-------------------------------------------------------------
Hello, my nickname is Sandman, and I like PHP, some examples:

<code>
print "Hello World";

print "Foobar";
</code>

Here are nested tags:

<quote>
<quote>
He said he liked flowers
</quote>

Well, he doesn't, ok.

<quote>I like them</quote>

Good for you
</quote>

<div class="paragraph">
Nice paragraph
</div>

<img src="foo.jpg"> <- Nice pic!
-------------------------------------------------------------

Outputs this:

Debug:
0: Hello, my nickname is Sandman, and I like PHP, some examples:
1: <code>
1: print "Hello World";
1:
1: print "Foobar";
0: </code>
0: Here are nested tags:
1: <quote>
2: <quote>
2: He said he liked flowers
1: </quote>
1:
1: Well, he doesn't, ok.
1:
1: <quote>I like them</quote>
1:
1: Good for you
0: </quote>
1: <div class="paragraph">
1: Nice paragraph
0: </div>
0: <img src="foo.jpg"> <- Nice pic!

Paragraphs:
---------------
Hello, my nickname is Sandman, and I like PHP, some examples:
---------------
<code>
print "Hello World";

print "Foobar";
</code>
---------------
Here are nested tags:
---------------
<quote>
<quote>
He said he liked flowers
</quote>

Well, he doesn't, ok.

<quote>I like them</quote>

Good for you
</quote>
 
T

Tad McClellan

Well, right now I'm splittin on two or more newlines,
My problem now is that if I have a text block like below:

Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:

<code>
print "Hello World!";

print "Foo";
</code>

Call me

The above would, given the rules I use now, yield four parts, as such:

---------------------------------------------
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:
---------------------------------------------
<code>
print "Hello World!";
---------------------------------------------
print "Foo";
</code>
---------------------------------------------
Call me
---------------------------------------------

But I would want it to end up in three parts, as such:

---------------------------------------------
Hello, my nickname is Sandman and I am coding
some Perl. Here is an example:
---------------------------------------------
<code>
print "Hello World!";

print "Foo";
</code>
Any ideas on how to solve it?


foreach ( grep {defined and length} split m#\n{2,}|(<code>.*?</code>)#s, $txt )


Buggy and fragile, but that is to be expected when processing HTML
without a real parser. (hint: you should use an HTML::* module
for processing HTML data).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top