Suggestions for improving a trivial tag parser

Gregory Brown · Jul 30, 2008

Hi folks,

I need to take strings that have embedded and tags and split
them out into an array.
Here are some examples:

# No need to ensure valid pairs, just need to break out tags from the text
#----------------------------------------------------------------------------------------------

describe "Inline style parsing" do
it "should return an identical string if inline styles are not detected" do
create_pdf
@pdf.parse_inline_styles("Hello World").should == "Hello World"
end

it "should return an array of segments when a style is detected" do
create_pdf
@pdf.parse_inline_styles("Hello Fine World").should ==
["Hello ", "","Fine", "", " World"]
end

it "should create an array of segments when multiple styles are
detected" do
create_pdf
@pdf.parse_inline_styles("Hello Fine World").should ==
["Hello ", "", "Fine ", "", "World", "", ""]
end
end

###

I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I'm sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?

###

def parse_inline_styles(text) #:nodoc:
require "strscan"

sc = StringScanner.new(text)
output = []
last_pos = 0

loop do
if sc.scan_until(/<\/?[ib]>/)
pre = sc.pre_match[last_pos..-1]
output << pre unless pre.empty?
output << sc.matched
last_pos = sc.pos
else
output << sc.rest if sc.rest?
break output
end
end

output.length == 1 ? output.first : output
end

###

Thanks,
-greg

Robert Dober · Jul 30, 2008

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

HTH
Robert

Gregory Brown · Jul 30, 2008

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I needed something a little more restricted than that, but you gave me
almost exactly what I need:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

This passes my specs, and so long as people don't see any major issues
with it, it looks great.

I knew there had to be a way to do this with split. Thanks Robert.

-greg

Gregory Brown · Jul 30, 2008

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

Click to expand...

I needed something a little more restricted than that, but you gave me
almost exactly what I need:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib].*?>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

Whoops, make that:

def parse_inline_styles(text) #:nodoc:
segments = text.split( %r{(</?[ib]>)} ).delete_if{|x| x.empty? }
segments.size == 1 ? segments.first : segments
end

*Slaps head*. I totally get what I was missing out on before, when
you use groupings, split includes the matched segments:

"kitten robot snake robot tree robot".split(/(robot)/) => ["kitten ", "robot", " snake ", "robot", " tree ", "robot"]
"kitten robot snake robot tree robot".split(/robot/)

Click to expand...

=> ["kitten ", " snake ", " tree "]

Rolando Abarca · Jul 30, 2008

What about

eles = split( %r{(</?.*?>)} ).delete_if{|x| x.empty? }
eles.size.one? ? eles.first : eles

I think your regexp is wrong, since it (incorrectly) parses empty tags:

)} ).delete_if{|x| x.empty? }

=> ["Hello ", "", "Fine", "<>", " ", "", "World", "", ""]

I would try something like:
[bi]>)} ).delete_if{|x| x.empty? }

=> ["Hello " said:
"Hello Fine<> World".split( %r{(</?

Click to expand...

[bi]>)} ).delete_if{|x| x.empty? }
=> ["Hello ", "", "Fine<> ", "", "World", "", ""]

Or:

"Hello Fine World".split( %r{(</?[^>]

Click to expand...

+>)} ).delete_if{|x| x.empty? }

=> ["Hello " said:
"Hello Fine<> World".split( %r{(</?[^>]

Click to expand...

+>)} ).delete_if{|x| x.empty? }

HTH
Robert

Now, I would bet that this might be a little too expensive with large
strings.
regards,

ara.t.howard · Jul 30, 2008

Hi folks,

I need to take strings that have embedded and tags and split
them out into an array.
Here are some examples:

# No need to ensure valid pairs, just need to break out tags from
the text
#----------------------------------------------------------------------------------------------

describe "Inline style parsing" do
it "should return an identical string if inline styles are not
detected" do
create_pdf
@pdf.parse_inline_styles("Hello World").should == "Hello World"
end

it "should return an array of segments when a style is detected" do
create_pdf
@pdf.parse_inline_styles("Hello Fine World").should ==
["Hello ", "","Fine", "", " World"]
end

it "should create an array of segments when multiple styles are
detected" do
create_pdf
@pdf.parse_inline_styles("Hello Fine World").should
==
["Hello ", "", "Fine ", "", "World", "", ""]
end
end

###

I fear my implementation (below) is showing one of my weakest areas in
Ruby, and probably even has some unforeseen problems. I'm sure it can
be done more accurately in less code. Any kind RubyTalk folks want to
school me?

###

def parse_inline_styles(text) #:nodoc:
require "strscan"

sc = StringScanner.new(text)
output = []
last_pos = 0

loop do
if sc.scan_until(/<\/?[ib]>/)
pre = sc.pre_match[last_pos..-1]
output << pre unless pre.empty?
output << sc.matched
last_pos = sc.pos
else
output << sc.rest if sc.rest?
break output
end
end

output.length == 1 ? output.first : output
end

###

Thanks,
-greg

my take:

cfp:~ > cat a.rb
require 'yaml'

strings =
"Hello World",
"Hello Fine World",
"Hello Fine World"

def parse_inline_styles string, tags = %w( )
re = Regexp.new tags.flatten.map{|tag| "(#{ Regexp.escape
tag })"}.join('|')
tokens = string.split(re)
tokens.delete_if{|token| token.empty?}
((tokens.size == 1 and tokens.first == string) ? string : tokens)
end

strings.each do |string|
y string => parse_inline_styles(string)
end

cfp:~ > ruby a.rb
---
Hello World: Hello World
---
Hello Fine World:
- "Hello "
- 
- Fine
- 
- " World"
---
Hello Fine World:
- "Hello "
- 
- "Fine "
- 
- World
- 
- 

a @ http://codeforpeople.com/

Robert Dober · Jul 30, 2008

I think your regexp is wrong, since it (incorrectly) parses empty tags:

It passed the specs did it not? I did not know what Gregory wanted

exactly said:
Now, I would bet that this might be a little too expensive with large
strings.

Hmm why?
It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

Robert

Gregory Brown · Jul 30, 2008

I think your regexp is wrong, since it (incorrectly) parses empty tags:

Click to expand...

It passed the specs did it not? I did not know what Gregory wanted
exactly, turns out he wanted %r{(</?[ib]>)}
but he got the message .

My implementation was tighter than my specs, but I added an extra one
to catch this.

It would be different if we were treating files, but as the string is
already here we use the memory required by the
specification and nothing more.

It does a double pass through the segments rather than a single pass,
and I guess that if I had a giant string with a ton of tags I needed
to parse, that'd make it less efficient.

However, I think it'll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.

Robert Dober · Jul 30, 2008

It does a double pass through the segments rather than a single pass,

Yes but these two passes are quite fast, see below.

However, I think it'll be okay for my purposes (PDF inline styling),
unless I missed some other concern Rolando had.

Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

require 'benchmark'

def split_select txt
txt.split(%r{(</?[bi]>)}).select{|x| ! x.empty? }
end

def split_delete txt
txt.split(%r{(</?[bi]>)}).delete_if{|x| x.empty? }
end

def use_scan txt
r=[]
txt.scan(%r{(.*?)(</?[ib]>)}) do | pr,po | r << pr unless pr.empty?;
r << po end
r << $' unless $'.empty?
end

N = 400_000;
text = "bolditalicboldboldnormal" * N;

Benchmark.bmbm do | bm |
bm.report("split_select") do
split_select text
end
bm.report("split_delete") do
split_delete text
end
bm.report("use_scan") do
use_scan text
end
end
Rehearsal ------------------------------------------------
split_select 6.844000 0.094000 6.938000 ( 7.063000)
split_delete 7.687000 0.109000 7.796000 ( 7.953000)
use_scan 20.063000 0.203000 20.266000 ( 20.109000)
-------------------------------------- total: 35.000000sec

user system total real
split_select 6.344000 0.031000 6.375000 ( 6.485000)
split_delete 6.500000 0.109000 6.609000 ( 6.359000)
use_scan 16.625000 0.265000 16.890000 ( 16.906000)

HTH
Robert

Gregory Brown · Jul 30, 2008

Yes but these two passes are quite fast, see below.
Well trying to be useful I checked for some larger texts, I omitted
the conditional #first at the end of the parsing method for clarity.

Turns out that split is still much better than scanning for a string
of a size over one megabyte, but it is not
really fast either:
539/39 > cat split.rb && ruby split.rb

Interesting that it's faster to do a double pass with split than a
single pass with StringScanner.
I didn't implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want

-greg

Robert Dober · Jul 31, 2008

Interesting that it's faster to do a double pass with split than a

single pass with StringScanner.
I didn't implement it that way for efficiency, just out of total
forgetfulness on how to get split() to work the way I want

Anyway for small strings even scan will take only one split second,
sorry could not resist the pun

.
Robert

Suggestions for a distributed job queue	14	Dec 22, 2009
Issue with textbox script?	0	Sep 5, 2022
Searching for a very fast string parser	5	Mar 8, 2006
Best lexer/parser for Ruby language itself	0	Dec 21, 2005
XML parser; maybe ruby is too slow?	2	Sep 15, 2007
Need for speed -> a C extension?	27	Apr 18, 2011
how to simple connect a button with a progressbar	0	Sep 11, 2014
"move capture" for lambdas	0	Mar 19, 2012

Suggestions for improving a trivial tag parser

Gregory Brown

Robert Dober

Gregory Brown

Gregory Brown

Rolando Abarca

ara.t.howard

Robert Dober

Gregory Brown

Robert Dober

Gregory Brown

Robert Dober

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads