Converting a string to an array of tokens

John W. Long · Jan 11, 2004

Is there a fast way to convert a string into a list of tokens?

Something like:

"a= c+ a".tokenize(' ', '=', '+') #=> ['a', '=', ' ', 'c', '+', 'a']

___________________
John Long
www.wiseheartdesign.com

Hal Fulton · Jan 11, 2004

John said:
Is there a fast way to convert a string into a list of tokens?

Something like:

"a= c+ a".tokenize(' ', '=', '+') #=> ['a', '=', ' ', 'c', '+', 'a']

irb comes with its own lexer. I've used that before.
(If it's Ruby you're tokenizing.)

Hal

John W. Long · Jan 11, 2004

Hal said:
irb comes with its own lexer. I've used that before.
(If it's Ruby you're tokenizing.)

Actually no, I'm not using it to parse Ruby. I'm just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.
___________________
John Long
www.wiseheartdesign.com

John W. Long · Jan 11, 2004

This is quick and dirty, but it demonstrates what I am looking for:

<code>
tokens = %w{ <html> <head> <title> <body> </title> </head> </body>
</html> }

string = <<-HERE
<html>
<head>
<title>test</title>
</head>
<body>
wow it worked
</body>
</html>
HERE

class String
def tokenize(tokens)
tokens.each { |t| self.gsub!(/#{t}/, "<<token>>#{t}<<token>>") }
self.gsub!(/\A<<token>>/, '')
split('<<token>>')
end
end

p string.tokenize(tokens)
</code>

this should output:

["<html>", "\n ", "<head>", "\n ", "<title>", "test", "</title>", "\n ",
"</head>", "\n ", "<body>", "\n ", "", "wow it worked", "", "\n ",
"</body>", "\n", "</html>", "\n"]

Is there a better way?

___________________
John Long
www.wiseheartdesign.com

Dan Doel · Jan 11, 2004

I believe this also works:

class String
def tokenize(*tokens)
regstr = ""

regex = Regexp.new(tokens.map do |t|
Regexp.escape(t)
end.join("|"))

do_tokenize(regex).delete_if { |str| str == "" }
end

def do_tokenize(regex)
match = regex.match self

if match
[match.pre_match, match[0]] +
match.post_match.do_tokenize(regex)
else
[]
end
end
end

The recursion could cause problems if you have really long strings, in
which case it'd probably
be wise to rewrite it as a loop (which is arguably somewhat uglier). You
might also want to
make #do_tokenize private. I don't know if this is the best way, but
it's a way.

- Dan

Robert Klemme · Jan 11, 2004

John W. Long said:
Actually no, I'm not using it to parse Ruby. I'm just looking for something
general that works like String#scan, but leaves the tokens in place in the
array. This would be useful for parsing almost any language.

I'm not sure that I understand what you mean by "leaves the tokens in place
in the array". String#scan is usually the method of choice:

irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

Regards

robert

Joey Gibson · Jan 12, 2004

--------------020507000909060309010607
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

What about this? It produces what the original poster asked for in his
original message.

"a= c+ a".split //
=> ["a", "=", " ", "c", "+", " ", "a"]

--
Never trust a girl with your mother's cow,
never let your trousers go falling down in the green grass...

http://www.joeygibson.com
http://www.joeygibson.com/blog/life/Wisdom.html

--------------020507000909060309010607--

Robert Klemme · Jan 12, 2004

Joey Gibson said:
irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What are you missing?

Click to expand...

What about this? It produces what the original poster asked for in his
original message.

"a= c+ a".split //
=> ["a", "=", " ", "c", "+", " ", "a"]

It's bad because it splits at every character and not tokens:

irb(main):001:0> "foo bar".split //
=> ["f", "o", "o", " ", "b", "a", "r"]

Definitely not a solution for the OP.

robert

nobu.nokada · Jan 12, 2004

Hi,

At Sun, 11 Jan 2004 23:06:39 +0900,

Robert said:
irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

What about this?

"a= c+ a".split(/\b/) # => ["a", "= ", "c", "+ ", "a"]

Robert Klemme · Jan 12, 2004

Robert said:
Hi,

At Sun, 11 Jan 2004 23:06:39 +0900,

Robert said:

irb(main):002:0> "a= c+ a".scan( /\w+|=|\+/ )
=> ["a", "=", "c", "+", "a"]
irb(main):003:0> "a= c+ a".scan( /\w+|=|\+|\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

Click to expand...

What about this?

"a= c+ a".split(/\b/) # => ["a", "= ", "c", "+ ", "a"]

I guess the OP won't like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:

irb(main):089:0> def tokenizer(*tokens)
irb(main):090:1> Regexp.new( tokens.map{|tk| tk.kind_of?( Regexp ) ? tk :
Regexp.escape(tk)}.join('|') )
irb(main):091:1> end
=> nil
irb(main):092:0> def tokenize(str, *tokens)
irb(main):093:1> str.scan tokenizer(*tokens)
irb(main):094:1> end
=> nil
irb(main):095:0> tokenize( "a= c+ a", "=", "+", /\w+/, /\s+/ )
=> ["a", "=", " ", "c", "+", " ", "a"]

That way one can reuse the tokenizer regexp for multiple passes.

Regards

robert

John W. Long · Jan 12, 2004

Robert said:
I guess the OP won't like it because the whitespace is not separated
properly. And he wanted to be able to provide the tokens conveniently.
Another solution, which can of course be tweaked and integrated into String:
...

Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don't want to have to specify everything I'm
looking for. Just the significant tokens. Sometimes whitespace is
significant:

'a = " this is a string "'.tokenize('"', "=")

should produce:

["a", " ", "=", " ", "\"", " this is a string ", "\""]

___________________
John Long
www.wiseheartdesign.com

John W. Long · Jan 13, 2004

Dan Doel said:
I believe this also works:
..snip!..

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it'd probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join("|"))
string = self.dup
array = []
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == "" }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

___________________
John Long
www.wiseheartdesign.com

Robert Klemme · Jan 13, 2004

John W. Long said:
Dan Doel said:

I believe this also works:
..snip!..

Click to expand...

This is almost exactly what I was looking for.

The recursion could cause problems if you have really
long strings, in which case it'd probably be wise to
rewrite it as a loop (which is arguably somewhat
uglier).

Click to expand...

Depends on what you mean by ugly:

class String
def tokenize(*tokens)
regex = Regexp.new(tokens.map { |t| Regexp::escape(t) }.join("|"))
string = self.dup
array = []
while match = regex.match(string)
array += [match.pre_match, match[0]]
string = match.post_match
end
array += [string]
array.delete_if { |str| str == "" }
end
def each_token(*tokens, &b)
tokenize(*tokens).each { |t| b.call(t) }
end
end

Very nice. If only it would work with regular expressions as well.

I wonder what the odds are of getting this or something like this added to
the language. Seems like it would be a nice to have on the String class to
begin with and written in C for speed.

Oh, we can still tweak the solution provided:

- Use Array#push or Array#<< instead of "+=" which creates too much tmp
instances

- implement the iteration in each_token and make tokenize depend on that,
so that tokenizing of large strings via each_token is more efficient
because no array is needed then.

- Don't add empty strings to the array.

- No need to dup.

That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Kind regards

robert

Sabby and Tabby · Jan 13, 2004

John W. Long said:
Nice solution, but my idea of specifying your tokens is to have it separate
before and after each token. I don't want to have to specify everything I'm
looking for. Just the significant tokens. Sometimes whitespace is
significant:

'a = " this is a string "'.tokenize('"', "=")

should produce:

["a", " ", "=", " ", "\"", " this is a string ", "\""]

class String
def tokenize(*tokens)
tokens.map! {|t|
case t
when Regexp
s = t.to_s
s.gsub!(/\\./m, '') # Remove \(
s.gsub!(/\(\?/, '') # Remove (?
s =~ /\(/ ? t : /(#{t})/ # Add capturing () if none.
when Symbol
/(\s*)(#{t})(\s*)/ # Significant whitespace.
else
/(#{t})/
end
}
split(/#{tokens * '|'}/).reject {|x| x.empty?}
end
end

str = 'a = " this is a string "'

p str.tokenize('"', '=') # Whitespace not a token.
p str.tokenize('"', /(\s*)(=)(\s*)/) # Significant whitespace.
p str.tokenize('"', :'=') # Same but less typing.

John W. Long · Jan 13, 2004

Robert Klemme said:
That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

___________________
John Long
www.wiseheartdesign.com

Robert Klemme · Jan 13, 2004

John W. Long said:
Robert Klemme said:

That's what I'd do:

class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t :
Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Click to expand...

Good ideas. It would be fun to optimize this further for speed. Maybe even
do a little ruby golf with it. If I have time this evening I may work on
creating a demanding timed test case for it.

Btw, this would be easier if Regexp#match would support additional
arguments for start and end like

def match(str, start = 0, end = str.length)
....
end

and MatchData would expose the the index of the start element and the
indes of the first elem after the match. The loop above could then be
written as:

regex = ...
start = 0

while ( match = regex.match(string, start) )
yield self[start, match.start_index - start] if match.start_index -
start > 0
yield match[0] if match[0].length > 0
start = match.end_index # index of 1st element after match
end

yield self[start,self.length] if self.length - start > 0

What do others think, is this a reasonable extension?

Kind regards

robert

Minero Aoki · Jan 13, 2004

In mail "Re: Converting a string to an array of tokens"

Robert Klemme said:
class String
def tokenize(*tokens)
array = []
each_token(*tokens){|tk| array << tk}
array
end

def each_token(*tokens)
regex = Regexp.new(tokens.map { |t| t.kind_of?( Regexp ) ? t : Regexp::escape(t) }.join("|"))
string = self

while( match = regex.match(string) )
yield match.pre_match if match.pre_match.length > 0
yield match[0] if match[0].length > 0

string = match.post_match
end

yield string if string.length > 0
self
end
end

Using ruby 1.8.1:

~ % cat t
require 'strscan'
require 'enumerator'

class String
def tokenize(*patterns)
enum_for

each_token, *patterns).map
end

def each_token(*patterns)
re = Regexp.union(*patterns)
s = StringScanner.new(self)
until s.eos?
break unless s.skip_until(re)
yield s[0]
end
end
end

p "def m(a) 1 + a end".tokenize('def', 'end', /[a-z_]\w*/i, /\d+/, /\S/)

~ % ruby -v t
ruby 1.9.0 (2004-01-12) [i686-linux]
["def", "m", "(", "a", ")", "1", "+", "a", "end"]

-- Minero Aoki

Converting an Array to a String in JavaScript	7	Sep 22, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Whats the best approach for converting OST to PST files?	5	Feb 10, 2025
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
Why do I need an MBOX Converter to convert MBOX to TXT?	0	Apr 10, 2026
Looking for a tool to turn hundreds of MSG contacts into VCF files for CRM import.	0	Aug 5, 2025
Changing string value that is an element of a list	3	Feb 10, 2025
Can I Convert MBOX Emails to DOC Without Outlook?	0	Mar 23, 2026

Converting a string to an array of tokens

John W. Long

Hal Fulton

John W. Long

John W. Long

Dan Doel

Robert Klemme

Joey Gibson

Robert Klemme

nobu.nokada

Robert Klemme

John W. Long

John W. Long

Robert Klemme

Sabby and Tabby

John W. Long

Robert Klemme

Minero Aoki

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads