R
Randy Kramer
Background: In order to do the parsing I've talked about in another thread, in
many circumstances I need to know the number of spaces before and after the
current token. I'm trying to think about efficient ways to do that--one
might be to do a preprocess pass through the text to figure out how many
spaces separate various tokens then store the tokens and spaces between them
in a temporary in memory data structure, or I'll need a way to backtrack from
the found position of some token to find how many spaces separate it from the
previous token.
In another thread I asked about streams. In this thread I want to ask about
an efficient way to store the intermediate result if I do a preprocessing
pass.
What I envision as a result of the preprocessing pass is a new representation
of the file where all spaces or groups of spaces are replaced by a list of
"tokens" and the numbers of spaces between those tokens or between a token
and then last/next newline. For example, with the TWiki marked up text:
This is a two level bulleted list:
* Level 1
* Level 2
The result I'd see is something like this:
bof,0,"This is a two level bulleted list:",0,\n,3,*,1,"Level 1",0,
\n,6,*,1,"Level 2",eof
Aside: I don't necessarily have to break everything down into tokens of a
single word (I didn't in the above), but it might end up being easier.
What makes the most sense as temporary storage of that result? My guess is an
array, which will expand thruout the prescan process (unless I preallocate an
array of an appropriate size--can I do that in Ruby), and then be destroyed
after the main processing pass. (I'll probably do the main processing pass
by essentially incrementing my way through that array.)
Is there a better approach?
(Aside: At some point I may rewrite the method to do this preprocessing pass
in C.)
Randy Kramer
many circumstances I need to know the number of spaces before and after the
current token. I'm trying to think about efficient ways to do that--one
might be to do a preprocess pass through the text to figure out how many
spaces separate various tokens then store the tokens and spaces between them
in a temporary in memory data structure, or I'll need a way to backtrack from
the found position of some token to find how many spaces separate it from the
previous token.
In another thread I asked about streams. In this thread I want to ask about
an efficient way to store the intermediate result if I do a preprocessing
pass.
What I envision as a result of the preprocessing pass is a new representation
of the file where all spaces or groups of spaces are replaced by a list of
"tokens" and the numbers of spaces between those tokens or between a token
and then last/next newline. For example, with the TWiki marked up text:
This is a two level bulleted list:
* Level 1
* Level 2
The result I'd see is something like this:
bof,0,"This is a two level bulleted list:",0,\n,3,*,1,"Level 1",0,
\n,6,*,1,"Level 2",eof
Aside: I don't necessarily have to break everything down into tokens of a
single word (I didn't in the above), but it might end up being easier.
What makes the most sense as temporary storage of that result? My guess is an
array, which will expand thruout the prescan process (unless I preallocate an
array of an appropriate size--can I do that in Ruby), and then be destroyed
after the main processing pass. (I'll probably do the main processing pass
by essentially incrementing my way through that array.)
Is there a better approach?
(Aside: At some point I may rewrite the method to do this preprocessing pass
in C.)
Randy Kramer