Suggestions for improving a trivial tag parser

G

Gregory Brown

Hi folks,

I need to take strings that have embedded <i> and <b> tags and split
them out into an array.
Here are some examples:

# No need to ensure valid pairs, just need to break out tags from the text
#----------------------------------------------------------------------------------------------

describe "Inline style parsing" do
it "should return an identical string if inline styles are not detected" do
create_pdf
@pdf.parse_inline_styles("Hello World").should == "Hello World"
end

it "should return an array of segments when a style is detected" do
create_pdf
@pdf.parse_inline_styles("Hello <i>Fine</i> World").should ==
["Hello ", "<i>","Fine", "</i>", " World"]
end

it "should create an array of segments when multiple styles are
detected" do
create_pdf
@pdf.parse_inline_styles("Hello <i>Fine <b>World</b></i>").should ==
["Hello ", "<i>", "Fine ", "<b>", "World", "</b>", "</i>"]
end
end

###

I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I'm sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?

###

def parse_inline_styles(text) #:nodoc:
require "strscan"

sc = StringScanner.new(text)
output = []
last_pos = 0

loop do
if sc.scan_until(/<\/?[ib]>/)
pre = sc.pre_match[last_pos..-1]
output << pre unless pre.empty?
output << sc.matched
last_pos = sc.pos
else
output << sc.rest if sc.rest?
break output
end
end

output.length == 1 ? output.first : output
end

###

Thanks,
-greg
 
R

Robert Dober

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

HTH
Robert
 
G

Gregory Brown

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I needed something a little more restricted than that, but you gave me
almost exactly what I need:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

This passes my specs, and so long as people don't see any major issues
with it, it looks great.

I knew there had to be a way to do this with split. Thanks Robert.

-greg
 
G

Gregory Brown

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I needed something a little more restricted than that, but you gave me
almost exactly what I need:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

Whoops, make that:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib]>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

*Slaps head*. I totally get what I was missing out on before, when
you use groupings, split includes the matched segments:
"kitten robot snake robot tree robot".split(/(robot)/) => ["kitten ", "robot", " snake ", "robot", " tree ", "robot"]
"kitten robot snake robot tree robot".split(/robot/)
=> ["kitten ", " snake ", " tree "]
 
R

Rolando Abarca

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I think your regexp is wrong, since it (incorrectly) parses empty tags:
)} ).delete_if{|x| x.empty? }
=> ["Hello ", "<i>", "Fine", "<>", " ", "<b>", "World", "</b>", "</i>"]

I would try something like:
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello " said:
"Hello <i>Fine<> <b>World</b></i>".split( %r{(</?
[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", "<i>", "Fine<> ", "<b>", "World", "</b>", "</i>"]

Or:
"Hello <i>Fine <b>World</b></i>".split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
=> ["Hello " said:
"Hello <i>Fine<> <b>World</b></i>".split( %r{(</?[^>]
+>)} ).delete_if{|x| x.empty? }
HTH
Robert


Now, I would bet that this might be a little too expensive with large
strings.
regards,
 
A

ara.t.howard

Hi folks,

I need to take strings that have embedded <i> and <b> tags and split
them out into an array.
Here are some examples:

# No need to ensure valid pairs, just need to break out tags from
the text
#----------------------------------------------------------------------------------------------

describe "Inline style parsing" do
it "should return an identical string if inline styles are not
detected" do
create_pdf
@pdf.parse_inline_styles("Hello World").should == "Hello World"
end

it "should return an array of segments when a style is detected" do
create_pdf
@pdf.parse_inline_styles("Hello <i>Fine</i> World").should ==
["Hello ", "<i>","Fine", "</i>", " World"]
end

it "should create an array of segments when multiple styles are
detected" do
create_pdf
@pdf.parse_inline_styles("Hello <i>Fine <b>World</b></i>").should
==
["Hello ", "<i>", "Fine ", "<b>", "World", "</b>", "</i>"]
end
end

###

I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I'm sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?

###

def parse_inline_styles(text) #:nodoc:
require "strscan"

sc = StringScanner.new(text)
output = []
last_pos = 0

loop do
if sc.scan_until(/<\/?[ib]>/)
pre = sc.pre_match[last_pos..-1]
output << pre unless pre.empty?
output << sc.matched
last_pos = sc.pos
else
output << sc.rest if sc.rest?
break output
end
end

output.length == 1 ? output.first : output
end

###

Thanks,
-greg




my take:


cfp:~ > cat a.rb
require 'yaml'

strings =
"Hello World",
"Hello <i>Fine</i> World",
"Hello <i>Fine <b>World</b></i>"

def parse_inline_styles string, tags = %w(<i> </i> <b> </b>)
re = Regexp.new tags.flatten.map{|tag| "(#{ Regexp.escape
tag })"}.join('|')
tokens = string.split(re)
tokens.delete_if{|token| token.empty?}
((tokens.size == 1 and tokens.first == string) ? string : tokens)
end

strings.each do |string|
y string => parse_inline_styles(string)
end







cfp:~ > ruby a.rb
---
Hello World: Hello World
---
Hello <i>Fine</i> World:
- "Hello "
- <i>
- Fine
- </i>
- " World"
---
Hello <i>Fine <b>World</b></i>:
- "Hello "
- <i>
- "Fine "
- <b>
- World
- </b>
- </i>




a @ http://codeforpeople.com/
 
R

Robert Dober

I think your regexp is wrong, since it (incorrectly) parses empty tags:
It passed the specs did it not? I did not know what Gregory wanted
exactly said:
Now, I would bet that this might be a little too expensive with large
strings.
Hmm why?
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

Robert
 
G

Gregory Brown

I think your regexp is wrong, since it (incorrectly) parses empty tags:
It passed the specs did it not? I did not know what Gregory wanted
exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message ;).

My implementation was tighter than my specs, but I added an extra one
to catch this. :)
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

It does a double pass through the segments rather than a single pass,
and I guess that if I had a giant string with a ton of tags I needed
to parse, that'd make it less efficient.

However, I think it'll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.
 
R

Robert Dober

It does a double pass through the segments rather than a single pass,
Yes but these two passes are quite fast, see below.
However, I think it'll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.
Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

require 'benchmark'

def split_select txt
txt.split(%r{(</?[bi]>)}).select{|x| ! x.empty? }
end

def split_delete txt
txt.split(%r{(</?[bi]>)}).delete_if{|x| x.empty? }
end

def use_scan txt
r=[]
txt.scan(%r{(.*?)(</?[ib]>)}) do | pr,po | r << pr unless pr.empty?;
r << po end
r << $' unless $'.empty?
end

N = 400_000;
text = "<b>bold<i>italicbold<i>bold<b>normal" * N;

Benchmark.bmbm do | bm |
bm.report("split_select") do
split_select text
end
bm.report("split_delete") do
split_delete text
end
bm.report("use_scan") do
use_scan text
end
end
Rehearsal ------------------------------------------------
split_select 6.844000 0.094000 6.938000 ( 7.063000)
split_delete 7.687000 0.109000 7.796000 ( 7.953000)
use_scan 20.063000 0.203000 20.266000 ( 20.109000)
-------------------------------------- total: 35.000000sec

user system total real
split_select 6.344000 0.031000 6.375000 ( 6.485000)
split_delete 6.500000 0.109000 6.609000 ( 6.359000)
use_scan 16.625000 0.265000 16.890000 ( 16.906000)

HTH
Robert
 
G

Gregory Brown

Yes but these two passes are quite fast, see below.
Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

Interesting that it's faster to do a double pass with split than a
single pass with StringScanner.
I didn't implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want :)

-greg
 
R

Robert Dober

Interesting that it's faster to do a double pass with split than a
single pass with StringScanner.
I didn't implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want :)
Anyway for small strings even scan will take only one split second,
sorry could not resist the pun ;).
Robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top