Re: Handling delimited strings

Discussion in 'C++' started by michael@preece.net, Oct 29, 2005.

  1. Guest

    wrote:
    > One problem I see: your delimiters being in the range 128..255 rather
    > assume that "real" file identifiers will not contain characters with
    > these values, and this is not the case in Windows.
    >
    > International experience (the use of Chinese characters in Windows file
    > identifiers) has shown me that owing to the proprietary character of
    > Windows, the file identifier's syntax was never defined, to my
    > knowledge, formally and instead a minimal file syntax applies where ANY
    > unicode character other than the semicolon, backslash, asterisk and
    > question mark can be and will be accepted by most Windows installations
    > as part of the file id.
    >
    > It is well known also that the period doesn't left-delimit the file
    > type, instead the file name to the right of the type can contain
    > multiple periods with the right period delimiting the type.
    >
    > If M$ means Microsoft, then I suggest you BNF formulate the minimal
    > syntax of a file identifier and use this to parse the file identifier.
    >


    Sorry. I'm a bit confused. I was only looking for something to handle
    delimited text strings within a single file. How do M$'s file naming
    "conventions" come into it. Were you expanding on the idea of using
    ReiserFS instead of a program? I realise that the characters within
    each string will be limited to the ASCII chars 0-127 inclusive (except
    that I'd also like to exclude char0).

    If you're wondering where I'm heading with this, think of nested data -
    like XML (only far more compact). I guess you could say that any
    characters allowed in XML should be allowed. Further.. think of two
    associated delimited strings - one to hold markup etc., the other the
    data.

    Mike.
    , Oct 29, 2005
    #1
    1. Advertising

  2. On 28 Oct 2005 23:16:53 -0700
    wrote:

    > If you're wondering where I'm heading with this, think of nested data -
    > like XML (only far more compact).


    If that's the goal look into ASN1.

    --
    C:>WIN | Directable Mirror Arrays
    The computer obeys and wins. | A better way to focus the sun
    You lose and Bill collects. | licences available see
    | http://www.sohara.org/
    Steve O'Hara-Smith, Oct 29, 2005
    #2
    1. Advertising

  3. Guest

    wrote:
    > wrote:
    > > One problem I see: your delimiters being in the range 128..255 rather
    > > assume that "real" file identifiers will not contain characters with
    > > these values, and this is not the case in Windows.
    > >
    > > International experience (the use of Chinese characters in Windows file
    > > identifiers) has shown me that owing to the proprietary character of
    > > Windows, the file identifier's syntax was never defined, to my
    > > knowledge, formally and instead a minimal file syntax applies where ANY
    > > unicode character other than the semicolon, backslash, asterisk and
    > > question mark can be and will be accepted by most Windows installations
    > > as part of the file id.
    > >
    > > It is well known also that the period doesn't left-delimit the file
    > > type, instead the file name to the right of the type can contain
    > > multiple periods with the right period delimiting the type.
    > >
    > > If M$ means Microsoft, then I suggest you BNF formulate the minimal
    > > syntax of a file identifier and use this to parse the file identifier.
    > >

    >
    > Sorry. I'm a bit confused. I was only looking for something to handle
    > delimited text strings within a single file. How do M$'s file naming
    > "conventions" come into it. Were you expanding on the idea of using
    > ReiserFS instead of a program? I realise that the characters within
    > each string will be limited to the ASCII chars 0-127 inclusive (except
    > that I'd also like to exclude char0).


    OK, my mistake. Thought you were parsing a file name. You said "in
    filename" and not "in the file".
    >
    > If you're wondering where I'm heading with this, think of nested data -
    > like XML (only far more compact). I guess you could say that any
    > characters allowed in XML should be allowed. Further.. think of two
    > associated delimited strings - one to hold markup etc., the other the
    > data.
    >



    > Mike.
    , Oct 30, 2005
    #3
  4. Guest

    wrote:

    >
    > OK, my mistake. Thought you were parsing a file name. You said "in
    > filename" and not "in the file".
    >


    I guess you see now that I meant "in the file called FILENAME". Sorry
    for the confusion. The capitals aren't meant to be read loud btw - it's
    just a kind of notation that has become a habit, where the capitalized
    word relates to a declared variable (or constant). Well - I know what I
    mean ;-)

    Cheers
    Mike.
    , Oct 31, 2005
    #4
  5. Guest

    Steve O'Hara-Smith wrote:

    > > If you're wondering where I'm heading with this, think of nested data -
    > > like XML (only far more compact).

    >
    > If that's the goal look into ASN1.
    >


    Isn't ANS1 mostly about encoding data *type* - along with the data?
    That's a separate issue. I'm looking to handle nested delimited strings
    of any, or no specified, type. The data type (required for conversion
    to/from ASN1, say) of each delimited string, or group of strings, along
    with any other metadata such as markup, can be described or defined in
    an associated nested delimited string, or two, or three, or whatever.

    Nested data is all around. Indented program code, newsgroups and the
    threads within them, folders/directories, etc. etc.. It would be nice
    to have a really simple way to represent and manipulate nested
    structures up to 128 levels deep - much simpler than ASN1 and much more
    compact than XML, and yet easily transformable into either, or any
    other, format.

    If you take any nested data in any format - XML is an obvious example -
    it should be possible to represent it as a simple delimited string as I
    described in my OP. It would be good, I reckon, if I (with a little
    help) can come up with simple cross-platform tools to perform the
    functions also described in my OP.

    Cheers
    Mike.
    , Oct 31, 2005
    #5
  6. On 30 Oct 2005 17:02:25 -0800
    wrote:

    >
    > Steve O'Hara-Smith wrote:
    >
    > > > If you're wondering where I'm heading with this, think of nested data -
    > > > like XML (only far more compact).

    > >
    > > If that's the goal look into ASN1.
    > >

    >
    > Isn't ANS1 mostly about encoding data *type* - along with the data?


    I looked around for some references to give you and I found
    it hard to spot the nested tag-length-value mechanism I met as ASN.1
    around 1990 in the documentation for ASN.1 now. I think it's still there
    under the hood of standard types and constructions though.

    The essence of what I was thinking about was nested TLV
    structures which always seemed to me to be more robust than the
    paired delimiters of XML.

    --
    C:>WIN | Directable Mirror Arrays
    The computer obeys and wins. | A better way to focus the sun
    You lose and Bill collects. | licences available see
    | http://www.sohara.org/
    Steve O'Hara-Smith, Oct 31, 2005
    #6
  7. [Followups restricted to comp.programming.]

    In article <>, Steve O'Hara-Smith <> writes:
    >
    > The essence of what I was thinking about was nested TLV
    > structures which always seemed to me to be more robust than the
    > paired delimiters of XML.


    What would make TLV (by which I assume you mean type-length-value
    vectors, presumably with binary, fixed-length encodings for type and
    length) more robust than XML? It has less redundancy, and therefore
    less capacity for error detection and correction.

    A trivial example: say type is a single octet, and all 256 type codes
    are defined. Then it is impossible to detect if a type value is
    wrong (for whatever reason - program error, transmission error, etc),
    without additional context.

    XML makes many tradeoffs, and there are certainly applications where
    a TLV encoding of some sort is preferable due to various plausible
    constraints. But TLV is not "more robust" than XML in general.

    That said, I agree that nested TLV structures looks like a better
    choice for representing arbitrary structure data than the OP's
    proposal of in-band signalling with special flag bytes. That means
    restricting the domain of ordinary data values, which means some
    kind of shift-encoding of values that are outside that doman, and
    that's invariably a mess, error-prone, difficult to enhance while
    maintaining backward compatibility, and inefficient.

    --
    Michael Wojcik

    Unfortunately, as a software professional, tradition requires me to spend New
    Years Eve drinking alone, playing video games and sobbing uncontrollably.
    -- Peter Johnson
    Michael Wojcik, Nov 1, 2005
    #7
  8. On 30 Oct 2005 17:02:25 -0800, wrote:

    >
    > Steve O'Hara-Smith wrote:
    >
    > > > If you're wondering where I'm heading with this, think of nested data -
    > > > like XML (only far more compact).

    > >
    > > If that's the goal look into ASN1.
    > >

    >
    > Isn't ANS1 mostly about encoding data *type* - along with the data?
    > That's a separate issue. I'm looking to handle nested delimited strings
    > of any, or no specified, type. The data type (required for conversion
    > to/from ASN1, say) of each delimited string, or group of strings, along
    > with any other metadata such as markup, can be described or defined in
    > an associated nested delimited string, or two, or three, or whatever.
    >

    Not inherently. ASN.1 is about encoding any structure defined in a
    (specified) data language.

    You could certainly do n-ary trees of character strings as array of
    (discriminated) either string or (recursively) tree of strings. And
    since these types have different primitive tags, you don't need any
    added application tags. IIRC, may not be exactly right, I don't
    currently have tools or references at hand to check:

    StringTree ::= SEQUENCE OF CHOICE { IA5String, StringTree }

    or to include the (trivial) case of only one string

    StringTree ::= CHOICE { IA5String, SEQUENCE OF StringTree }

    ASN.1 is frequently, I think probably more often than not, _used_ in
    applications where it is desirable to encode data with type to allow
    for extensibility and upgradability in distributed applications. For
    example in crypto applications, the ones I have mostly worked on, when
    we want to transmit or store a key, what is in the key depends on the
    algorithm used, and we know from experience that over time new
    algorithms will be created and wanted, so standards like X.509 and
    PKCS 4, 10, 8/12 have ASN.1 constructs roughly equivalent to:
    struct { OID-identifying-algorithm , data-depending-on-that-OID }

    That way when some subset of the users and systems add a new
    algorithm, the other ones can unambiguously recognize that it's
    something they don't know (yet); and with only a little care in
    defining the ASN.1 they can skip the data they don't understand, and
    as long as they don't actually need to process that data (only store
    or forward it etc.) can proceed OK without even being upgraded. This
    is useful for applications that want it, but not mandatory.

    That said, I basically concur with mwojcik: ASN.1 is _a_ choice, with
    advantages and disadvantages; there are others. One of the features,
    IMO often a disadvantage, it shares with XML is that both are designed
    very generally, to handle essentially everything anybody wants, so
    tools that handle that generality are usually complex and arguably
    bloated. But if you don't use those tools and develop your own more
    limited specific ones you (must) reimplement quite a few wheels.

    - David.Thompson1 at worldnet.att.net
    Dave Thompson, Nov 14, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Harrison

    splitting delimited strings

    Mark Harrison, Jun 16, 2005, in forum: Python
    Replies:
    10
    Views:
    459
    Nicola Mingotti
    Jun 16, 2005
  2. Re: Handling delimited strings

    , Oct 29, 2005, in forum: C Programming
    Replies:
    7
    Views:
    340
    Dave Thompson
    Nov 14, 2005
  3. RyanL
    Replies:
    6
    Views:
    671
    Paul McGuire
    Aug 28, 2007
  4. Gianni Galore
    Replies:
    1
    Views:
    439
    code learner
    Jan 21, 2011
  5. Scott Bass
    Replies:
    4
    Views:
    110
    Tad McClellan
    May 12, 2005
Loading...

Share This Page