tokenising a string using another string

M

Mark

I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

TIA
Mark
 
S

Suman

Mark said:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

Can we have some more information, here? It sure is messy,
but my premise is it contains some information, otherwise you
wouldn't be splitting your hairs on this. And if it contains
some specific information, then there will be some structure
to it. Maybe then you can read a char at a time, build some
tokens out of them, take the ones you need and do whatever
that needs to be done.

Or, am I mistaken, and you have tried all of this out and failed?
I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

This can probably wait, till we have identified what all tokens
we have to find, and then proceed accordingly.
Is there an easy way to split it based on a string (read char*) rather
than a char?

Read them via fgets() and use sscanf() or your own hand spun lexer().
 
M

Mark

Suman said:
Can we have some more information, here?
[snip]

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

Mark
 
S

Suman

Mark said:
Suman said:
Can we have some more information, here?
[snip]

It's supposed to be a CSV export from MYOB but there are a few memo

CSV = Comma separated values? What is MYOB?
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun
Total: <-- this is always at the end

This is what I was talking about :)
So maybe you can actually write your own crude grammar: viz.
Record_Set -> Record Record_set
|'Total:'

Record -> Cust_name ',' Date ',' Memo_fields

Memo_fields -> Memo_field ',' Memo_fields
| Memo_field

Memo_field -> ...
Cust_name -> ...

... and then find what the *tokens* are. And then write your own
lexer -- that will scan the input for The Chosen Ones!
I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

Are you sure you are not missing the forest for the trees?
I mean I do not understand your preoccupation with `\r\n'.
Not to demean you or something, just that I can't fathom why it
is so important.
What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

I don't have any :/
 
R

Richard Bos

Mark said:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

Not pre-made. You'll have to search for the strings yourself, using
strstr().

Richard
 
N

Nick Keighley

Mark said:
Suman wrote:
It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

"might" is not a word I like to see in interface specifications...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

so how do you know when one "memo field" ends and the next one begins?

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

stop writing code (of whatever pasta variety). You have *got* to work
out the format of the data. The reason it has turned to spagetti is you

don't know what it's supposed to do. How can you write a program to do
something you can't do yourself?
 
P

Pramod Subramanyan

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end


The plan goes like this:

1. Use a state variable to keep track of what you're reading now.
2. Use a switch to handle similar states.
3. Inside the switch, read on until you reach the terminating condition
for this state.

Ok, I'd write some rough code based on this as :

enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
= CNAME;
while(!feof(infile)) {
switch(cstate) {
case CNAME:
case LDATE:
/* Read on until a ',' is reached and increment your state. */
break;
case MEMO1:
/* Code to read memo 1 */
break;

/* Write the rest of the code yourself :) */
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top