Efficient storage of a temporary string

R

Randy Kramer

Background: In order to do the parsing I've talked about in another thread, in
many circumstances I need to know the number of spaces before and after the
current token. I'm trying to think about efficient ways to do that--one
might be to do a preprocess pass through the text to figure out how many
spaces separate various tokens then store the tokens and spaces between them
in a temporary in memory data structure, or I'll need a way to backtrack from
the found position of some token to find how many spaces separate it from the
previous token.

In another thread I asked about streams. In this thread I want to ask about
an efficient way to store the intermediate result if I do a preprocessing
pass.

What I envision as a result of the preprocessing pass is a new representation
of the file where all spaces or groups of spaces are replaced by a list of
"tokens" and the numbers of spaces between those tokens or between a token
and then last/next newline. For example, with the TWiki marked up text:

This is a two level bulleted list:
* Level 1
* Level 2

The result I'd see is something like this:

bof,0,"This is a two level bulleted list:",0,\n,3,*,1,"Level 1",0,
\n,6,*,1,"Level 2",eof

Aside: I don't necessarily have to break everything down into tokens of a
single word (I didn't in the above), but it might end up being easier.

What makes the most sense as temporary storage of that result? My guess is an
array, which will expand thruout the prescan process (unless I preallocate an
array of an appropriate size--can I do that in Ruby), and then be destroyed
after the main processing pass. (I'll probably do the main processing pass
by essentially incrementing my way through that array.)

Is there a better approach?

(Aside: At some point I may rewrite the method to do this preprocessing pass
in C.)

Randy Kramer
 
R

Robert Klemme

Randy Kramer said:
Background: In order to do the parsing I've talked about in another
thread, in
many circumstances I need to know the number of spaces before and after
the
current token. I'm trying to think about efficient ways to do that--one
might be to do a preprocess pass through the text to figure out how many
spaces separate various tokens then store the tokens and spaces between
them
in a temporary in memory data structure, or I'll need a way to backtrack
from
the found position of some token to find how many spaces separate it from
the
previous token.

In another thread I asked about streams. In this thread I want to ask
about
an efficient way to store the intermediate result if I do a preprocessing
pass.

What I envision as a result of the preprocessing pass is a new
representation
of the file where all spaces or groups of spaces are replaced by a list of
"tokens" and the numbers of spaces between those tokens or between a token
and then last/next newline. For example, with the TWiki marked up text:

This is a two level bulleted list:
* Level 1
* Level 2

The result I'd see is something like this:

bof,0,"This is a two level bulleted list:",0,\n,3,*,1,"Level 1",0,
\n,6,*,1,"Level 2",eof

Aside: I don't necessarily have to break everything down into tokens of a
single word (I didn't in the above), but it might end up being easier.

What makes the most sense as temporary storage of that result? My guess
is an
array, which will expand thruout the prescan process (unless I preallocate
an
array of an appropriate size--can I do that in Ruby),

Yes, you can
=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]

But I'd do that only if the array allocation / reallocation proves as
performance bottleneck.
and then be destroyed
after the main processing pass. (I'll probably do the main processing
pass
by essentially incrementing my way through that array.)

Is there a better approach?

(Aside: At some point I may rewrite the method to do this preprocessing
pass
in C.)

Does this help?
This is a two level bulleted list:
* Level 1
* Level 2
EOF
a=[]; s.scan %r{"[^"]*"|\S+|\n|\s+}xo do |m| a << (/\A\s+\z/ =~ m ?
m.length : m ) end => "This is a two level bulleted list:\n * Level 1\n * Level 2\n"
a
=> ["This", 1, "is", 1, "a", 1, "two", 1, "level", 1, "bulleted", 1,
"list:", 1, 3, "*", 1, "Level", 1, "1", 1, 6, "*", 1, "Level",
1, "2", 1]

The quoting part of the regexp can be improved to accept escaped quotes
inside a string as well as single quotes but I guess, you get the picture.

Also, you can do any type of conversion on the matched string in the block
before you insert the match into the array. If you use grouping in the
regexp, then you probably can use that for discrimination of the action to
be taken.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top