tokenising a string using another string

Mark · Aug 24, 2005

I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

TIA
Mark

Suman · Aug 24, 2005

Mark said:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

Can we have some more information, here? It sure is messy,
but my premise is it contains some information, otherwise you
wouldn't be splitting your hairs on this. And if it contains
some specific information, then there will be some structure
to it. Maybe then you can read a char at a time, build some
tokens out of them, take the ones you need and do whatever
that needs to be done.

Or, am I mistaken, and you have tried all of this out and failed?

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

This can probably wait, till we have identified what all tokens
we have to find, and then proceed accordingly.

Is there an easy way to split it based on a string (read char*) rather
than a char?

Read them via fgets() and use sscanf() or your own hand spun lexer().

Mark · Aug 24, 2005

Suman said:
Can we have some more information, here?

[snip]

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

Mark

Suman · Aug 24, 2005

Mark said:
Suman said:

Can we have some more information, here?

Click to expand...

[snip]

It's supposed to be a CSV export from MYOB but there are a few memo

CSV = Comma separated values? What is MYOB?

field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun
Total: <-- this is always at the end

This is what I was talking about

So maybe you can actually write your own crude grammar: viz.
Record_Set -> Record Record_set
|'Total:'

Record -> Cust_name ',' Date ',' Memo_fields

Memo_fields -> Memo_field ',' Memo_fields
| Memo_field

Memo_field -> ...
Cust_name -> ...

... and then find what the *tokens* are. And then write your own
lexer -- that will scan the input for The Chosen Ones!

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

Are you sure you are not missing the forest for the trees?
I mean I do not understand your preoccupation with `\r\n'.
Not to demean you or something, just that I can't fathom why it
is so important.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

I don't have any :/

Richard Bos · Aug 24, 2005

Mark said:
I've got a really messy text file that I need to work on and the only
things separating each record is either "\r\n\r\n" or "Total:".

I figure I won't be able to use strtok because it will split the string
when it matches any rather than all of the chars in the delimiter.

Is there an easy way to split it based on a string (read char*) rather
than a char?

Not pre-made. You'll have to search for the strings yourself, using
strstr().

Richard

Nick Keighley · Aug 24, 2005

Mark said:
Suman wrote:

It's supposed to be a CSV export from MYOB but there are a few memo
field that have carriage returns etc so I can't easily read until \r\n
and assume that that is one record.

It might go something like this...

"might" is not a word I like to see in interface specifications...

Customer name, date, first memo
field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

so how do you know when one "memo field" ends and the next one begins?

I can't change the data coming out and I can't really change the data
going in because it's coming out of an accounting system.

What I thought I could probably do was either read up until the first
\r\n\r\n and completely ignore Total: (it's never used) or read up until
Total: and discard it later.

What I was hoping is that someone has already done a generic split
string on string kinda thing so that when someone eventually takes a
look at my spaghetti code they won't decide to fire me on the spot ;-)

stop writing code (of whatever pasta variety). You have *got* to work
out the format of the data. The reason it has turned to spagetti is you

don't know what it's supposed to do. How can you write a program to do
something you can't do yourself?

Pramod Subramanyan · Aug 24, 2005

Customer name, date, first memo

field, another memo field that has no CR's, and another
memo field that
will be split across a
number of
lines and may well have any unquoted comma thrown
in just for fun

Total: <-- this is always at the end

The plan goes like this:

1. Use a state variable to keep track of what you're reading now.
2. Use a switch to handle similar states.
3. Inside the switch, read on until you reach the terminating condition
for this state.

Ok, I'd write some rough code based on this as :

enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
= CNAME;
while(!feof(infile)) {
switch(cstate) {
case CNAME:
case LDATE:
/* Read on until a ',' is reached and increment your state. */
break;
case MEMO1:
/* Code to read memo 1 */
break;

/* Write the rest of the code yourself

*/
}
}

How to print prefix and suffix without giving a String as an argument between them	2	May 9, 2022
Can't solve problems! please Help	0	Sep 26, 2022
Parsing a string	44	Nov 16, 2010
Finding a bit string in another bitstring	9	Oct 30, 2009
Finding a string inside string	3	Nov 23, 2010
Replacing one string with another	1	Jul 14, 2008
String interpolation	5	Jun 6, 2009
String copy with pointers not working as expected	6	Dec 15, 2010

tokenising a string using another string

Mark

Suman

Mark

Suman

Richard Bos

Nick Keighley

Pramod Subramanyan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads