D
Don Wood
I have a large file that I need to tokenize. The method I am using now
is fast, but eats up a ton of memory by reading in the entire file first
as a String. I would also like to reuse existing tokens for duplicates.
(I have no control over the file format, but this Regex works well for
what I need.)
Here is what I am doing today.
tokens= File.read(filename).scan(/'[^']*'|"[^"]*"|[
)]|[^
)\s]+/)
And here is what I would like to do.
tokens= []
File.open(filename) do |fh|
fh.scan(/'[^']*'|"[^"]*"|[
)]|[^
)\s]+/) do |token|
tokens << i=tokens.index(token) ? tokens : token
end
end
So what I would like to have is a scan method for File objects that
yields the tokens when called with a block, instead of returning an
array. (It would be nice if String#scan could do this as well.) This
isn’t a big issue, it just causes my machine to overflow to the swap
file periodically. I could easily fix that with a couple DIMMs, but I
can’t help thinking that there should be a better way.
is fast, but eats up a ton of memory by reading in the entire file first
as a String. I would also like to reuse existing tokens for duplicates.
(I have no control over the file format, but this Regex works well for
what I need.)
Here is what I am doing today.
tokens= File.read(filename).scan(/'[^']*'|"[^"]*"|[
And here is what I would like to do.
tokens= []
File.open(filename) do |fh|
fh.scan(/'[^']*'|"[^"]*"|[
tokens << i=tokens.index(token) ? tokens : token
end
end
So what I would like to have is a scan method for File objects that
yields the tokens when called with a block, instead of returning an
array. (It would be nice if String#scan could do this as well.) This
isn’t a big issue, it just causes my machine to overflow to the swap
file periodically. I could easily fix that with a couple DIMMs, but I
can’t help thinking that there should be a better way.