tokenising a string using another string

Discussion in 'C Programming' started by Mark, Aug 24, 2005.

  1. Mark

    Mark Guest

    I've got a really messy text file that I need to work on and the only
    things separating each record is either "\r\n\r\n" or "Total:".

    I figure I won't be able to use strtok because it will split the string
    when it matches any rather than all of the chars in the delimiter.

    Is there an easy way to split it based on a string (read char*) rather
    than a char?

    TIA
    Mark
    Mark, Aug 24, 2005
    #1
    1. Advertising

  2. Mark

    Suman Guest

    Mark wrote:
    > I've got a really messy text file that I need to work on and the only
    > things separating each record is either "\r\n\r\n" or "Total:".


    Can we have some more information, here? It sure is messy,
    but my premise is it contains some information, otherwise you
    wouldn't be splitting your hairs on this. And if it contains
    some specific information, then there will be some structure
    to it. Maybe then you can read a char at a time, build some
    tokens out of them, take the ones you need and do whatever
    that needs to be done.

    Or, am I mistaken, and you have tried all of this out and failed?

    > I figure I won't be able to use strtok because it will split the string
    > when it matches any rather than all of the chars in the delimiter.


    This can probably wait, till we have identified what all tokens
    we have to find, and then proceed accordingly.

    > Is there an easy way to split it based on a string (read char*) rather
    > than a char?


    Read them via fgets() and use sscanf() or your own hand spun lexer().

    > TIA
    > Mark
    Suman, Aug 24, 2005
    #2
    1. Advertising

  3. Mark

    Mark Guest

    Suman wrote:
    > Can we have some more information, here?

    [snip]

    It's supposed to be a CSV export from MYOB but there are a few memo
    field that have carriage returns etc so I can't easily read until \r\n
    and assume that that is one record.

    It might go something like this...

    Customer name, date, first memo
    field, another memo field that has no CR's, and another
    memo field that
    will be split across a
    number of
    lines and may well have any unquoted comma thrown
    in just for fun

    Total: <-- this is always at the end

    I can't change the data coming out and I can't really change the data
    going in because it's coming out of an accounting system.

    What I thought I could probably do was either read up until the first
    \r\n\r\n and completely ignore Total: (it's never used) or read up until
    Total: and discard it later.

    What I was hoping is that someone has already done a generic split
    string on string kinda thing so that when someone eventually takes a
    look at my spaghetti code they won't decide to fire me on the spot ;-)

    Mark
    Mark, Aug 24, 2005
    #3
  4. Mark

    Suman Guest

    Mark wrote:
    > Suman wrote:
    > > Can we have some more information, here?

    > [snip]
    >
    > It's supposed to be a CSV export from MYOB but there are a few memo


    CSV = Comma separated values? What is MYOB?

    > field that have carriage returns etc so I can't easily read until \r\n
    > and assume that that is one record.
    >
    > It might go something like this...
    >
    > Customer name, date, first memo
    > field, another memo field that has no CR's, and another
    > memo field that
    > will be split across a
    > number of
    > lines and may well have any unquoted comma thrown
    > in just for fun
    > Total: <-- this is always at the end


    This is what I was talking about :)
    So maybe you can actually write your own crude grammar: viz.
    Record_Set -> Record Record_set
    |'Total:'

    Record -> Cust_name ',' Date ',' Memo_fields

    Memo_fields -> Memo_field ',' Memo_fields
    | Memo_field

    Memo_field -> ...
    Cust_name -> ...

    ... and then find what the *tokens* are. And then write your own
    lexer -- that will scan the input for The Chosen Ones!
    > I can't change the data coming out and I can't really change the data
    > going in because it's coming out of an accounting system.
    >
    > What I thought I could probably do was either read up until the first
    > \r\n\r\n and completely ignore Total: (it's never used) or read up until
    > Total: and discard it later.


    Are you sure you are not missing the forest for the trees?
    I mean I do not understand your preoccupation with `\r\n'.
    Not to demean you or something, just that I can't fathom why it
    is so important.

    > What I was hoping is that someone has already done a generic split
    > string on string kinda thing so that when someone eventually takes a
    > look at my spaghetti code they won't decide to fire me on the spot ;-)


    I don't have any :/
    > Mark
    Suman, Aug 24, 2005
    #4
  5. Mark

    Richard Bos Guest

    Mark <> wrote:

    > I've got a really messy text file that I need to work on and the only
    > things separating each record is either "\r\n\r\n" or "Total:".
    >
    > I figure I won't be able to use strtok because it will split the string
    > when it matches any rather than all of the chars in the delimiter.
    >
    > Is there an easy way to split it based on a string (read char*) rather
    > than a char?


    Not pre-made. You'll have to search for the strings yourself, using
    strstr().

    Richard
    Richard Bos, Aug 24, 2005
    #5
  6. Mark wrote:
    > Suman wrote:


    > > Can we have some more information, here?


    > It's supposed to be a CSV export from MYOB but there are a few memo
    > field that have carriage returns etc so I can't easily read until \r\n
    > and assume that that is one record.
    >
    > It might go something like this...


    "might" is not a word I like to see in interface specifications...


    > Customer name, date, first memo
    > field, another memo field that has no CR's, and another
    > memo field that
    > will be split across a
    > number of
    > lines and may well have any unquoted comma thrown
    > in just for fun
    >
    > Total: <-- this is always at the end


    so how do you know when one "memo field" ends and the next one begins?


    > I can't change the data coming out and I can't really change the data
    > going in because it's coming out of an accounting system.
    >
    > What I thought I could probably do was either read up until the first
    > \r\n\r\n and completely ignore Total: (it's never used) or read up until
    > Total: and discard it later.
    >
    > What I was hoping is that someone has already done a generic split
    > string on string kinda thing so that when someone eventually takes a
    > look at my spaghetti code they won't decide to fire me on the spot ;-)


    stop writing code (of whatever pasta variety). You have *got* to work
    out the format of the data. The reason it has turned to spagetti is you

    don't know what it's supposed to do. How can you write a program to do
    something you can't do yourself?


    --
    Nick Keighley
    Nick Keighley, Aug 24, 2005
    #6
  7. > Customer name, date, first memo
    > field, another memo field that has no CR's, and another
    > memo field that
    > will be split across a
    > number of
    > lines and may well have any unquoted comma thrown
    > in just for fun
    >
    > Total: <-- this is always at the end



    The plan goes like this:

    1. Use a state variable to keep track of what you're reading now.
    2. Use a switch to handle similar states.
    3. Inside the switch, read on until you reach the terminating condition
    for this state.

    Ok, I'd write some rough code based on this as :

    enum LEXERSTATES = { CNAME, LDATE, MEMO1, MEMO2, MEMO3, LDONE } cstate
    = CNAME;
    while(!feof(infile)) {
    switch(cstate) {
    case CNAME:
    case LDATE:
    /* Read on until a ',' is reached and increment your state. */
    break;
    case MEMO1:
    /* Code to read memo 1 */
    break;

    /* Write the rest of the code yourself :) */
    }
    }
    Pramod Subramanyan, Aug 24, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. nail
    Replies:
    0
    Views:
    1,537
  2. qwerty
    Replies:
    3
    Views:
    9,278
    Scott Allen
    Sep 30, 2004
  3. Maziar Aflatoun
    Replies:
    1
    Views:
    490
    =?Utf-8?B?UGF1bA==?=
    Jan 22, 2005
  4. et
    Replies:
    1
    Views:
    518
    Yunus Emre ALPĂ–ZEN [MCSD.NET]
    Jun 29, 2005
  5. John Harrison

    Re: Tokenising a string by \n.

    John Harrison, Jul 17, 2003, in forum: C++
    Replies:
    2
    Views:
    322
    John Harrison
    Jul 17, 2003
Loading...

Share This Page