dynamic file tokenization...

Chris M. Thomasson · Aug 9, 2009

Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c

Say I want to tokenize all the words in a file, the command like would look
like:

program.exe " \n" file.txt

Say I want to tokenize for all the tabs, newlines, and commas:

program.exe "\t\n," file.txt

Say I want to tokenize fields based on ";" and I want to use `stdin':

program.exe ";" file.txt

Anyway, I was wondering if anyone could provide any suggestions or notice
any undefended behavior and/or buffer overruns in the code. One suggestion I
will make: I think I should provide a way to shrink the dynamic buffer; I
don't think it would be all that difficult... I also should probably make
use of `errno' to report specific error conditions.

Please, be gentle!

;^)

Chris M. Thomasson · Aug 9, 2009

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c

[...]

One point. The code works with both `fgets()' and `fread()'. The posted
version uses `fgets()' as-is. To force it to use `fread()' just alter the
following function from:
______________________________________________________________________
char*
file_tokenize(
struct file_tokenize* const self,
FILE* file,
char const* tokens
) {
return file
? file_tokenize_prv_read_and_parse_fgets(self, file, tokens)
: file_tokenize_prv_parse(self, tokens);
}
______________________________________________________________________

to:
______________________________________________________________________
char*
file_tokenize(
struct file_tokenize* const self,
FILE* file,
char const* tokens
) {
return file
? file_tokenize_prv_read_and_parse_fread(self, file, tokens)
: file_tokenize_prv_parse(self, tokens);
}
______________________________________________________________________

I think that `fread()' will give better performance... However, it seems
that `fgets()' gets along better with `stdin'.

Chris M. Thomasson · Aug 9, 2009

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c

[...]

One other point. The buffer size is hard coded as `1'. You can increase this
by changing the following line in the `main()' function:
_____________________________________________________________
int main(...) {
[...];

if (file_tokenize_create(&lines, 1)) {

[...];
}
_____________________________________________________________

to:
_____________________________________________________________
int main(...) {
[...];

if (file_tokenize_create(&lines, 8192)) {

[...];
}
_____________________________________________________________

or whatever. Higher buffer size gives better performance...

Chris M. Thomasson · Aug 10, 2009

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c

[...]

AFAICT, the following function harmlessly, however unnecessarily subtracts 1
from the `file_tokenize::read' buffer size during the buffer shift and
"reuse" phase. There was a sanity check for some debug code that I forget to
retract/delete before I posted the example code:
____________________________________________________________________
void
file_tokenize_prv_shift(
struct file_tokenize* const self
) {
/* shift buffer in order to "conserve" space */
memmove(self->buffer, tokenize_get_buffer(&self->cursor),
tokenize_get_size(&self->cursor));

self->current = self->buffer + tokenize_get_size(&self->cursor);

self->read = self->size - tokenize_get_size(&self->cursor) - 1;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

/* perhaps shrink an overly large buffer? */
}
____________________________________________________________________

The code can read as:
____________________________________________________________________
void
file_tokenize_prv_shift(
struct file_tokenize* const self
) {
/* shift buffer in order to "conserve" space */
memmove(self->buffer, tokenize_get_buffer(&self->cursor),
tokenize_get_size(&self->cursor));
self->current = self->buffer + tokenize_get_size(&self->cursor);
self->read = self->size - tokenize_get_size(&self->cursor);
/* perhaps shrink an overly large buffer? */
}
____________________________________________________________________

Sorry about that non-sense. `tokenize_get_size()' does not count the NULL
termination character; like `strlen()'. No need to compensate.

Chris M. Thomasson · Aug 11, 2009

Chris M. Thomasson said:
Here is a link to some fairly crude example code:

http://clc.pastebin.com/f4b9f841c [...]
Say I want to tokenize fields based on ";" and I want to use `stdin':

program.exe ";" file.txt

Ummm... In order to use `stdin' I would need to omit the explicit file name
in the command line:

program.exe ";"

;^/

[...]

Dynamic programming	3	Jan 9, 2023
Can't solve problems! please Help	0	Sep 26, 2022
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
'Dynamic' function calls	26	Dec 21, 2012
Simply open a file using a variable	3	Sep 11, 2010
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
dynamic allocation file buffer	26	Sep 9, 2008
I need help with a Gemini prompt	1	May 14, 2025

dynamic file tokenization...

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads