Converting a string to an array of tokens

J

John W. Long

Is there a fast way to convert a string into a list of tokens?

Something like:

"a= c+ a".tokenize(' ', '=', '+') #=> ['a', '=', ' ', 'c', '+', 'a']

___________________
John Long
www.wiseheartdesign.com
 
H

Hal Fulton

John said:
Is there a fast way to convert a string into a list of tokens?

Something like:

"a= c+ a".tokenize(' ', '=', '+') #=> ['a', '=', ' ', 'c', '+', 'a']

irb comes with its own lexer. I've used that before.
(If it's Ruby you're tokenizing.)


Hal
 
J

John W. Long

Hal said:
irb comes with its own lexer. I've used that before.
(If it's Ruby you're tokenizing.)

Actually no, I'm not using it to parse Ruby. I'm just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.
___________________
John Long
www.wiseheartdesign.com
 
J

John W. Long

This is quick and dirty, but it demonstrates what I am looking for:

<code>
tokens = %w{ <html> <head> <title> <body> <p> </title> </head> </p> </body>
</html> }

string = <<-HERE
<html>
<head>
<title>test</title>
</head>
<body>
<p>wow it worked</p>
</body>
</html>
HERE

class String
def tokenize(tokens)
tokens.each { |t| self.gsub!(/#{t}/, "<<token>>#{t}<<token>>") }
self.gsub!(/\A<<token>>/, '')
split('<<token>>')
end
end

p string.tokenize(tokens)
</code>

this should output:

["<html>", "\n ", "<head>", "\n ", "<title>", "test", "</title>", "\n ",
"</head>", "\n ", "<body>", "\n ", "<p>", "wow it worked", "</p>", "\n ",
"</body>", "\n", "</html>", "\n"]

Is there a better way?

___________________
John Long
www.wiseheartdesign.com
 
D

Dan Doel

I believe this also works:

class String
def tokenize(*tokens)
regstr = ""

regex = Regexp.new(tokens.map do |t|
Regexp.escape(t)
end.join("|"))

do_tokenize(regex).delete_if { |str| str == "" }
end

def do_tokenize(regex)
match = regex.match self

if match
[match.pre_match, match[0]] +
match.post_match.do_tokenize(regex)
else
[]
end
end
end

The recursion could cause problems if you have really long strings, in
which case it'd probably
be wise to rewrite it as a loop (which is arguably somewhat uglier). You
might also want to
make #do_tokenize private. I don't know if this is the best way, but
it's a way.

- Dan
 
R

Robert Klemme

John W. Long said:
Actually no, I'm not using it to parse Ruby. I'm just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

I'm not sure that I understand what you mean by "leaves the tokens in place
in the array". String#scan is usually the method of choice:

irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

Regards

robert
 
J

Joey Gibson

--------------020507000909060309010607
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

What about this? It produces what the original poster asked for in his
original message.

"a= c+ a".split //
=> ["a", "=", " ", "c", "+", " ", "a"]

--
Never trust a girl with your mother's cow,
never let your trousers go falling down in the green grass...

http://www.joeygibson.com
http://www.joeygibson.com/blog/life/Wisdom.html


--------------020507000909060309010607--
 
R

Robert Klemme

Joey Gibson said:
irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

What about this? It produces what the original poster asked for in his
original message.

"a= c+ a".split //
=> ["a", "=", " ", "c", "+", " ", "a"]

It's bad because it splits at every character and not tokens:

irb(main):001:0> "foo bar".split //
=> ["f", "o", "o", " ", "b", "a", "r"]

Definitely not a solution for the OP.

robert
 
N

nobu.nokada

Hi,

At Sun, 11 Jan 2004 23:06:39 +0900,
Robert said:
irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What about this?

"a= c+ a".split(/\b/) # => ["a", "= ", "c", "+ ", "a"]
 
R

Robert Klemme

Hi,

At Sun, 11 Jan 2004 23:06:39 +0900,
Robert said:
irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What about this?

"a= c+ a".split(/\b/) # => ["a", "= ", "c", "+ ", "a"]

I guess the OP won't like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:

irb(main):089:0> def tokenizer(*tokens)
irb(main):090:1> Regexp.new( tokens.map{|tk| tk.kind_of?( Regexp ) ? tk :
Regexp.escape(tk)}.join('|') )
irb(main):091:1> end
=> nil
irb(main):092:0> def tokenize(str, *tokens)
irb(main):093:1> str.scan tokenizer(*tokens)
irb(main):094:1> end
=> nil
irb(main):095:0> tokenize( "a= c+ a", "=", "+", /\w+/, /\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

That way one can reuse the tokenizer regexp for multiple passes.

Regards

robert
 
J

John W. Long

Robert said:
I guess the OP won't like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:
...

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don't want to have to specify everything I'm
looking for. Just the significant tokens. Sometimes whitespace is
significant:

'a = " this is a string "'.tokenize('"', "=")

should produce:

["a", " ", "=", " ", "\"", " this is a string ", "\""]

___________________
John Long
www.wiseheartdesign.com
 
J

John W. Long

Dan Doel said:
I believe this also works:
..snip!..

This is almost exactly what I was looking for.
The recursion could cause problems if you have really
long strings, in which case it'd probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join("|"))
string = self.dup
array = []
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == "" }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

___________________
John Long
www.wiseheartdesign.com
 
R

Robert Klemme

John W. Long said:
Dan Doel said:
I believe this also works:
..snip!..

This is almost exactly what I was looking for.
The recursion could cause problems if you have really
long strings, in which case it'd probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join("|"))
string = self.dup
array = []
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == "" }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

Oh, we can still tweak the solution provided:

- Use Array#push or Array#<< instead of "+=" which creates too much tmp
instances

- implement the iteration in each_token and make tokenize depend on that,
so that tokenizing of large strings via each_token is more efficient
because no array is needed then.

- Don't add empty strings to the array.

- No need to dup.


That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Kind regards

robert
 
S

Sabby and Tabby

John W. Long said:
Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don't want to have to specify everything I'm
looking for. Just the significant tokens. Sometimes whitespace is
significant:

'a = " this is a string "'.tokenize('"', "=")

should produce:

["a", " ", "=", " ", "\"", " this is a string ", "\""]

class String
def tokenize(*tokens)
tokens.map! {|t|
case t
when Regexp
s = t.to_s
s.gsub!(/\\./m, '') # Remove \(
s.gsub!(/\(\?/, '') # Remove (?
s =~ /\(/ ? t : /(#{t})/ # Add capturing () if none.
when Symbol
/(\s*)(#{t})(\s*)/ # Significant whitespace.
else
/(#{t})/
end
}
split(/#{tokens * '|'}/).reject {|x| x.empty?}
end
end

str = 'a = " this is a string "'

p str.tokenize('"', '=') # Whitespace not a token.
p str.tokenize('"', /(\s*)(=)(\s*)/) # Significant whitespace.
p str.tokenize('"', :'=') # Same but less typing.
 
J

John W. Long

Robert Klemme said:
That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

___________________
John Long
www.wiseheartdesign.com
 
R

Robert Klemme

John W. Long said:
Robert Klemme said:
That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

Btw, this would be easier if Regexp#match would support additional
arguments for start and end like

def match(str, start = 0, end = str.length)
....
end

and MatchData would expose the the index of the start element and the
indes of the first elem after the match. The loop above could then be
written as:

regex = ...
start = 0

while ( match = regex.match(string, start) )
yield self[start, match.start_index - start] if match.start_index -
start > 0
yield match[0] if match[0].length > 0
start = match.end_index # index of 1st element after match
end

yield self[start,self.length] if self.length - start > 0


What do others think, is this a reasonable extension?

Kind regards

robert
 
M

Minero Aoki

In mail "Re: Converting a string to an array of tokens"
Robert Klemme said:
class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t : Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Using ruby 1.8.1:

~ % cat t
require 'strscan'
require 'enumerator'

class String
def tokenize(*patterns)
enum_for:)each_token, *patterns).map
end

def each_token(*patterns)
re = Regexp.union(*patterns)
s = StringScanner.new(self)
until s.eos?
break unless s.skip_until(re)
yield s[0]
end
end
end

p "def m(a) 1 + a end".tokenize('def', 'end', /[a-z_]\w*/i, /\d+/, /\S/)

~ % ruby -v t
ruby 1.9.0 (2004-01-12) [i686-linux]
["def", "m", "(", "a", ")", "1", "+", "a", "end"]


-- Minero Aoki
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top