dynamic file tokenization...

  • Thread starter Chris M. Thomasson
  • Start date
C

Chris M. Thomasson

Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c


Say I want to tokenize all the words in a file, the command like would look
like:

program.exe " \n" file.txt



Say I want to tokenize for all the tabs, newlines, and commas:

program.exe "\t\n," file.txt



Say I want to tokenize fields based on ";" and I want to use `stdin':

program.exe ";" file.txt




Anyway, I was wondering if anyone could provide any suggestions or notice
any undefended behavior and/or buffer overruns in the code. One suggestion I
will make: I think I should provide a way to shrink the dynamic buffer; I
don't think it would be all that difficult... I also should probably make
use of `errno' to report specific error conditions.



Please, be gentle!


;^)
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c
[...]

One point. The code works with both `fgets()' and `fread()'. The posted
version uses `fgets()' as-is. To force it to use `fread()' just alter the
following function from:
______________________________________________________________________
char*
file_tokenize(
struct file_tokenize* const self,
FILE* file,
char const* tokens
) {
return file
? file_tokenize_prv_read_and_parse_fgets(self, file, tokens)
: file_tokenize_prv_parse(self, tokens);
}
______________________________________________________________________




to:
______________________________________________________________________
char*
file_tokenize(
struct file_tokenize* const self,
FILE* file,
char const* tokens
) {
return file
? file_tokenize_prv_read_and_parse_fread(self, file, tokens)
: file_tokenize_prv_parse(self, tokens);
}
______________________________________________________________________




I think that `fread()' will give better performance... However, it seems
that `fgets()' gets along better with `stdin'.
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c
[...]

One other point. The buffer size is hard coded as `1'. You can increase this
by changing the following line in the `main()' function:
_____________________________________________________________
int main(...) {
[...];

if (file_tokenize_create(&lines, 1)) {

[...];
}
_____________________________________________________________




to:
_____________________________________________________________
int main(...) {
[...];

if (file_tokenize_create(&lines, 8192)) {

[...];
}
_____________________________________________________________




or whatever. Higher buffer size gives better performance...
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c
[...]

AFAICT, the following function harmlessly, however unnecessarily subtracts 1
from the `file_tokenize::read' buffer size during the buffer shift and
"reuse" phase. There was a sanity check for some debug code that I forget to
retract/delete before I posted the example code:
____________________________________________________________________
void
file_tokenize_prv_shift(
struct file_tokenize* const self
) {
/* shift buffer in order to "conserve" space */
memmove(self->buffer, tokenize_get_buffer(&self->cursor),
tokenize_get_size(&self->cursor));


self->current = self->buffer + tokenize_get_size(&self->cursor);


self->read = self->size - tokenize_get_size(&self->cursor) - 1;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



/* perhaps shrink an overly large buffer? */
}
____________________________________________________________________








The code can read as:
____________________________________________________________________
void
file_tokenize_prv_shift(
struct file_tokenize* const self
) {
/* shift buffer in order to "conserve" space */
memmove(self->buffer, tokenize_get_buffer(&self->cursor),
tokenize_get_size(&self->cursor));
self->current = self->buffer + tokenize_get_size(&self->cursor);
self->read = self->size - tokenize_get_size(&self->cursor);
/* perhaps shrink an overly large buffer? */
}
____________________________________________________________________





Sorry about that non-sense. `tokenize_get_size()' does not count the NULL
termination character; like `strlen()'. No need to compensate.
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c [...]
Say I want to tokenize fields based on ";" and I want to use `stdin':

program.exe ";" file.txt

Ummm... In order to use `stdin' I would need to omit the explicit file name
in the command line:


program.exe ";"



;^/


[...]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top