mmap parsing...

N

netbogus

hi,

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

thanks in advance,

lz.
 
J

Jack Klein

hi,

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

thanks in advance,

There is no mmap() and there are no threads in C, so this is off-topic
here. I suggest you take this to a group that supports your
compiler/OS combination.
 
N

netbogus

Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats why I
came to CLC.

lz.
 
A

Artie Gold

Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats why I
came to CLC.

lz.
1) Please quote an appropriate amount of context when replying. It's the
Usenet way.

2) Since your (well placed) concern about using strtok() has to do with
its relationship to threads, there *is* no ANSI solution (as ANSI C
knows not of threads). Your platform probably supplies a solution.
Posting to an appropriate platform specific forum will likely help you
find the information you need.

HTH,
--ag
 
W

websnarf

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

Indeed, strtok() is utter crap for situations like this. If you are
using gcc, you can use strtok_r() which is reetrnant and thread safe.

You said you wanted portability in another post, though I don't know
how that fits with your mmap() usage. I'll assume you mean "mmap() or
equivalent" or that you intend to make it more general in the future.
Anyhow, for portable string manipulations, you can use "The Better
String Library": http://bstring.sf.net/ . It has string parsing
facilities that are equal or better than any of C's built-in library
functions, it is a totally thread safe and reentrant library, and its
portable.

If, by portable, you mean portable to any system with mmap(), there is
another possibility of using James Antil's Vstr:
http://www.and.org/vstr/ which claims higher I/O performance via using
mmap, however it is not thread safe (it claims to be fork()-safe, which
is not the same thing, but may be sufficient for you.)
 
B

Ben Pfaff

I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

strtok() is rarely a good choice for anything.
strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as
your delimiter, then "a,,b,c" is three tokens, not
four. This is often the wrong thing to do. In fact,
it is only the right thing to do, in my experience,
when the delimiter set is limited to white space.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.

Instead, use some substitute, e.g. strtok_r(). Here is an
implementation of strtok_r(). It may be SUSv3 compliant, but I
do not know for sure. If you use it, you should probably rename
it, because (most) names beginning with `str' are reserved:

/* Breaks a string into tokens separated by DELIMITERS. The
first time this function is called, S should be the string to
tokenize, and in subsequent calls it must be a null pointer.
SAVE_PTR is the address of a `char *' variable used to keep
track of the tokenizer's position. The return value each time
is the next token in the string, or a null pointer if no
tokens remain.

This function treats multiple adjacent delimiters as a single
delimiter. The returned tokens will never be length 0.
DELIMITERS may change from one call to the next within a
single string.

strtok_r() modifies the string S, changing delimiters to null
bytes. Thus, S must be a modifiable string. String literals,
in particular, are *not* modifiable in C, even though for
backward compatibility they are not `const'.

Example usage:

char s[] = " String to tokenize. ";
char *token, *save_ptr;

for (token = strtok_r (s, " ", &save_ptr); token != NULL;
token = strtok_r (NULL, " ", &save_ptr))
printf ("'%s'\n", token);

outputs:

'String'
'to'
'tokenize.'
*/
char *
strtok_r (char *s, const char *delimiters, char **save_ptr)
{
char *token;

ASSERT (delimiters != NULL);
ASSERT (save_ptr != NULL);

/* If S is nonnull, start from it.
If S is null, start from saved position. */
if (s == NULL)
s = *save_ptr;
ASSERT (s != NULL);

/* Skip any DELIMITERS at our current position. */
while (strchr (delimiters, *s) != NULL)
{
/* strchr() will always return nonnull if we're searching
for a null byte, because every string contains a null
byte (at the end). */
if (*s == '\0')
{
*save_ptr = s;
return NULL;
}

s++;
}

/* Skip any non-DELIMITERS up to the end of the string. */
token = s;
while (strchr (delimiters, *s) == NULL)
s++;
if (*s != '\0')
{
*s = '\0';
*save_ptr = s + 1;
}
else
*save_ptr = s;
return token;
}
 
P

Peter Nilsson

Ben said:
I have a file stored in memory using mmap() and I'd like to parse to
read line by line.
Also, there are several threads that read this buffer so I think
strtok(p, "\n") wouldnt be a good choice. I'd like to hear from you
guys what would be a good implementation in this case.

strtok() is rarely a good choice for anything. ...
Instead, use some substitute, e.g. strtok_r(). Here is an
implementation of strtok_r().

<snip>

Example usage:

char s[] = " String to tokenize. ";
char *token, *save_ptr;

for (token = strtok_r (s, " ", &save_ptr); token != NULL;
token = strtok_r (NULL, " ", &save_ptr))
printf ("'%s'\n", token);

<snip>

I prefer...

char *alt_strtok(char **s, const char *del)
{
char *t;
if (!*s) return 0;
*s += strspn(*s, del);
if (!**s) return *s = 0;
*s += strcspn(t = *s, del);
if (**s) *(*s)++ = 0; else *s = 0;
return t;
}

Usage:

char s[] = " String to tokenize. ";
char *tok, *sp;

for (sp = s; tok = alt_strtok(&sp, " "); )
printf("'%s'\n", tok);
 
C

CBFalconer

Please read my post again. My question is regarding strtok(), not
mmap() or threads. I want an ANSI solution to this problem, thats
why I came to CLC.

Well, you didn't bother to quote things properly, so it is
impossible to write a sane reply (see my sig below). In general,
don't mess with things and use the fundamental file system to
access your data. Whether the underlying system uses mmap or
threads is its business. There certainly is no reason to mention
them in this group.
 
D

Dave Thompson

Indeed, strtok() is utter crap for situations like this. If you are
using gcc, you can use strtok_r() which is reetrnant and thread safe.
gcc as such is not relevant, unless this is one of the functions
chosen for inlining which I haven't seen on any platform I've used.
You have strtok_r if you use _glibc_, which is sometimes but not
always used in conjunction with gcc; or some other sytems. Or of
course if you provide it in usercode as Ben's nextthread.

And (the usual though not officially standard) strtok_r() is safe for
multiple threads concurrently or different parts (e.g. loop levels)
interleavedly parsing _different_ strings; it is of no help for
multiple threads accessing the same string, which appears to me to be
what the OP is asking. And as noted by Ben like strtok() it collapses
adjacent delimiters so here skips empty lines, which may or may not be
a problem for the OP.

Given the lines in the file are delimited by a single known character
like '\n', which is a good bet on most if not all systems that support
mmap _under that name_, and is also needed for strtok() or _r(), then
strchr() does much of the job -- or memchr() if the file contents
aren't (necessarily) terminated or followed by a null character, which
they might not be depending on file size and page size.

- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top