humongous flat file

Discussion in 'XML' started by Dennis Farr, Aug 7, 2003.

  1. Dennis Farr

    Dennis Farr Guest

    It has been suggested that rather than convert an already large flat
    file, with many similar rows, to XML, some type of header be attached
    to the file, containing some sort of meta-XML description of the rows
    that follow. The hope is that the result will not grow as large as a
    pure XML file, but still be easy to exchange. Multiple vendors would
    still be able to track format changes easily. The size of the flat
    file, without XML, is already an issue.

    If it is not already apparent, I'm new to XML. Does anything like this
    already exist? Thanks.

    Dennis Farr
    Treefrog Enterprises

    -- "Can giraffes swim?" --
     
    Dennis Farr, Aug 7, 2003
    #1
    1. Advertising

  2. Dennis Farr

    Andy Dingley Guest

    On 7 Aug 2003 07:39:57 -0700, (Dennis Farr) wrote:

    >If it is not already apparent, I'm new to XML. Does anything like this
    >already exist? Thanks.


    It's a bad idea, don't do it. These ideas were popular in the last
    century, when the "verbosity" of XML was seen as a problem.

    It isn't. get over it.


    If you want to do XML, then do it. It's not rocket science.

    Don't invent some whacko new pseudo-XML protocol to fix problems that
    aren't there.

    If you hate XML, then just say so. Enjoy your punch cards.
     
    Andy Dingley, Aug 7, 2003
    #2
    1. Advertising

  3. "Dennis Farr" <> wrote in message
    news:...
    > It has been suggested that rather than convert an already large flat
    > file, with many similar rows, to XML, some type of header be attached
    > to the file, containing some sort of meta-XML description of the rows
    > that follow. The hope is that the result will not grow as large as a
    > pure XML file, but still be easy to exchange. Multiple vendors would
    > still be able to track format changes easily. The size of the flat
    > file, without XML, is already an issue.
    >
    > If it is not already apparent, I'm new to XML. Does anything like this
    > already exist? Thanks.
    >
    > Dennis Farr
    > Treefrog Enterprises
    >
    > -- "Can giraffes swim?" --


    If your flat file contains fixed length records and the data is textual then
    you may already have existing overheads with redundant trailing spaces.
    These spaces would not be carried over to the XML file, hence you may have a
    large or some significant reduction in file size. There is no need to be
    overly verbose in your XML tag names for instance <CustomersSurname> tag can
    be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
    irrelevant to storing the data. An end application can provide the wordy
    descriptives.

    Denis
     
    Denis Saunders, Aug 8, 2003
    #3
  4. Dennis Farr

    Dennis Farr Guest

    "Denis Saunders" <> wrote in message news:<bgvbf4$ng$>...
    > If your flat file contains fixed length records and the data is textual then
    > you may already have existing overheads with redundant trailing spaces.
    > These spaces would not be carried over to the XML file, hence you may have a
    > large or some significant reduction in file size. There is no need to be
    > overly verbose in your XML tag names for instance <CustomersSurname> tag can
    > be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
    > irrelevant to storing the data. An end application can provide the wordy
    > descriptives.
    >
    > Denis


    Thanks. My data files are a mixture of rows from several database
    tables and for the most part there is no white space but tens of
    (mostly short and fixed length and encoded) columns per table, so the
    shortest tag names would at least double the size of the file.

    It would be nice to give an XML-like skeleton for each type of
    database row at the top of the file, and then just tag the records as
    to which table they come from, and then use the appropriate skeleton
    to parse the text in the tag. There may be thousands to tens of
    thousands of rows of each type, so the size savings would be
    considerable if we could do this, and if there is a way to do this and
    stay within established standards, that would make my day.

    I know it is a bit stone-age to complain about storage space, but that
    depends on the details of the applications, and quadrupling the size
    of a really large file can still be expensive. Size also affects
    transmission time, especially if encryption is involved. I'm not
    knocking XML, I'm hoping to make XML more attractive to more people.
     
    Dennis Farr, Aug 8, 2003
    #4
  5. "Dennis Farr" <> wrote in message
    news:...
    > "Denis Saunders" <> wrote in message

    news:<bgvbf4$ng$>...
    > > If your flat file contains fixed length records and the data is textual

    then
    > > you may already have existing overheads with redundant trailing spaces.
    > > These spaces would not be carried over to the XML file, hence you may

    have a
    > > large or some significant reduction in file size. There is no need to be
    > > overly verbose in your XML tag names for instance <CustomersSurname> tag

    can
    > > be reduced to <CS> as long as you keep uniqueness. Descriptive tag names

    are
    > > irrelevant to storing the data. An end application can provide the wordy
    > > descriptives.
    > >
    > > Denis

    >
    > Thanks. My data files are a mixture of rows from several database
    > tables and for the most part there is no white space but tens of
    > (mostly short and fixed length and encoded) columns per table, so the
    > shortest tag names would at least double the size of the file.


    It seems like you really really really want to use csv,
    but also get the seal of approval as xml.
    Advantage of xml is that there are a lot of parsers for reading it.
    If you kludge up the content, you lose that.
    However, you can do

    <everything>
    <file1>
    <row-csv>1,2,3333</row-csv>
    </file1>
    </everything>

    Also, you can add the csv headings.
    Highly unrecommended.

    > I know it is a bit stone-age to complain about storage space, but that
    > depends on the details of the applications, and quadrupling the size
    > of a really large file can still be expensive. Size also affects
    > transmission time, especially if encryption is involved. I'm not
    > knocking XML, I'm hoping to make XML more attractive to more people.


    Don't forget compression. All the repetitive tags are reduced to a few bits
    each.
     
    Steven Dilley, Aug 8, 2003
    #5
  6. Dennis Farr

    Andy Dingley Guest

    On 8 Aug 2003 09:15:09 -0700, (Dennis Farr) wrote:

    >Size also affects
    >transmission time, especially if encryption is involved.


    No it doesn't. If there are repeated strings in the file, then it
    improves compression efficiency. All significant transmissions are
    compressed these days, so this verbosity just doesn't matter in
    practice. This "XML is inefficient, so use cryptic 2-character element
    names" approach is completely bogus.
     
    Andy Dingley, Aug 8, 2003
    #6
  7. Dennis Farr

    Ed Beroset Guest

    compressing XML (was: humongous flat file)

    Andy Dingley wrote:
    > On 8 Aug 2003 09:15:09 -0700, (Dennis Farr) wrote:
    >
    >
    >>Size also affects
    >>transmission time, especially if encryption is involved.

    >
    >
    > No it doesn't. If there are repeated strings in the file, then it
    > improves compression efficiency. All significant transmissions are
    > compressed these days, so this verbosity just doesn't matter in
    > practice. This "XML is inefficient, so use cryptic 2-character element
    > names" approach is completely bogus.



    Have you tried testing that hypothesis? I have, and although I hate
    cryptic 2-character element names just as much you do, the fact is that
    it actually does compress better. Here's a link to an IBM site which
    illustrates this using test data:

    http://www-106.ibm.com/developerworks/xml/library/x-matters13.html

    Note, however, that there are probably better ways to address this than
    the method mentioned in the article. One possibility might be

    http://www.w3.org/TR/wbxml/

    It's worth noting that this is NOT a w3c recommendation. It's also
    worth noting that I haven't actually ever tried wbxml, so you can
    consider this my own untested hypothesis and treat it accordingly! :)

    I would be interested to hear from those who have successfully used
    alternative encodings for XML, especially ones for which the size
    reduction was a primary motivation.

    Ed
     
    Ed Beroset, Aug 9, 2003
    #7
  8. Dennis Farr

    Andy Dingley Guest

    Re: compressing XML (was: humongous flat file)

    On Fri, 08 Aug 2003 22:40:57 -0400, Ed Beroset
    <> wrote:

    >Have you tried testing that hypothesis?


    Yes, about 4 years ago - it's last century's problem.

    Even then, I was juggling XML and rich-media. XML is primarily a
    format for text content, so it's just _tiny_ in comparison to any
    image or video data. There's just no point in worrying over element
    name lengths, when there are JPEGs on the same server.

    Mainly I work in RDF. Fairly long names, lots of repetition of
    properties like "type", and honking great URIs all over the place.
    Switching <foo> to <fo> isn't going to make a blind bit of difference.

    Now encoding schemes for embedding binary data into XML content, now
    that's an issue worth saving bytes over.
     
    Andy Dingley, Aug 9, 2003
    #8
  9. Dennis Farr

    Ed Beroset Guest

    Re: compressing XML (was: humongous flat file)

    Andy Dingley wrote:
    > On Fri, 08 Aug 2003 22:40:57 -0400, Ed Beroset
    > <> wrote:
    >
    >
    >>Have you tried testing that hypothesis?

    >
    > Yes, about 4 years ago - it's last century's problem.
    >
    > Even then, I was juggling XML and rich-media. XML is primarily a
    > format for text content, so it's just _tiny_ in comparison to any
    > image or video data.


    I don't think that's the kind of data the OP had in mind. In the
    context of video data, it might indeed be tiny by comparison, but I
    suspect that most of us work with "last century's data" and so we still
    think about things like bandwidth, efficiency, and other anachronistic
    concepts of engineering.

    > Mainly I work in RDF. Fairly long names, lots of repetition of
    > properties like "type", and honking great URIs all over the place.
    > Switching <foo> to <fo> isn't going to make a blind bit of difference.


    In that context, maybe not, but let's try an experiment with real data
    of the non-RDF variety.

    The experiment:

    I chose the Wake County, North Carolina voter database as the source for
    my sample data. It's freely downloadable from the web, contains very
    typical kind of name and address data, and is large enough (with 415613
    records) to be able to draw some useful conclusions. I extracted the
    first five fields of each record of that plain-text database which the
    state government labels voter_reg_number, last_name, first_name,
    midl_name, and name_sufx. I think those are sufficiently expressive
    names that we'd all be able to guess their meanings without a second
    thought, so I used them as tag names, too. Wrapping each record up in
    <voter></voter> delimiters and the whole thing in <voters></voters>
    tags, and minimal other stuff, my test file turns out to be 60685379
    bytes long using an 8-bit encoding and Unix-style line endings (one per
    record).

    Compression:

    First, I tried various techniques to reduce the size of the XML file.

    The original file is voters1.xml and each voter record has these fields:
    voter_reg_number, last_name, first_name, midl_name, and name_sufx

    The second file is voters2.xml and each voter record has these fields:
    reg_number, last_name, first_name, midl_name, and name_sufx
    (The change is that voter_reg_number became just reg_number.)

    The third file is voters3.xml and each voter record has these fields:
    reg_number, name
    Within name there are four fields: last, first, midl, and sufx
    (The change is that name now has subfields.)

    The fourth file is voters4.xml and each voter record has these fields:
    reg_number, foo
    Within name there are four fields: last, first, midl, and sufx
    (The change is that name is changed to foo.)

    The fourth file is voters4.xml and each voter record has these fields:
    reg_number, fo
    Within name there are four fields: last, first, midl, and sufx
    (The change is that foo is changed to fo.)

    Here are the sizes and names of the files generated:

    60685379 voters1.xml
    55697543 voters2.xml
    44474912 voters3.xml
    43643606 voters4.xml
    42812300 voters5.xml

    18250 voters1.xml.bz2
    17519 voters2.xml.bz2
    14251 voters3.xml.bz2
    13921 voters4.xml.bz2
    12520 voters5.xml.bz2

    I'll leave it to you to analyze all the details, since I've provided all
    the data to do that, but I thought I'd point out a couple of salient
    points. Just a judicious use of shorter tags gives a compressed file
    that's 22% smaller (voters3.xml.bz2 compared to voters1.xml.bz2) and no
    less comprehensible by humans. Also, note that contrary to your guess,
    a change of a single tag from <foo> to <fo> yields a 10% decrease in
    size in the compressed files (voters5.xml.bz2 compared to
    voters4.xml.bz2) even though the uncompressed versions of those files
    only decreased in size by less than 2%.

    Conclusions:
    1. Using shorter tags may indeed save transmission time.
    2. Restructuring "flat" data may give better results without sacrificing
    clarity to human readers.
    3. Sometimes results are counterintuitive and data-dependent. Measuring
    effects on your actual data and comparing those to the engineering
    problem to be solved is the only sure way to proceed.

    I hope that helps clarify things. If anyone would like to duplicate
    this experiment, you can find the raw data at
    http://msweb03.co.wake.nc.us/bordelec/Waves/WavesOptions.asp

    Ed
     
    Ed Beroset, Aug 9, 2003
    #9
  10. Dennis Farr

    Andy Dingley Guest

    On 11 Aug 2003 06:55:01 -0700, (Dennis Farr) wrote:

    >When the data is as voluminous as, for example, an individual's
    >genetic makeup on the back of a health card, what if the space taken
    >up by the XML tags is much larger


    What indeed. Moore's Law. Throw some hardware at it.

    The problem is not about storing this stuff. My mobile phone gives a
    gazzilion bytes over to just storing ring tones. I don't even know how
    big the HD in my laptop is, it's just "big". Storage is not today's
    big problem.

    Now go to a library and work with MARC records for a while (or SS7, or
    almost anything where ASN.1 has played a part). Then find some old
    records from such a system and try to make sense of them. Chances are
    you can't. This is a serious problem. Find a digital dataset that's
    over 10 years old and read it. The failure rate is terrifying (read up
    on the BBC's Domesday Disk project)

    I don't give a damn about storage size - not my problem, I've got
    computers to do that for me. What I care about is future human
    understandability, or if I'm really lucky, machine understandability.

    >Is that the next logical step of evolution after XML?
    >Bioinformatics is just one example of really huge data files


    Go take a look at Stanford's Protege project.

    Or RDF, or DAML, or OWL


    >http://msdn.microsoft.com/library/d...-us/csvr2002/htm/cs_rp_xmlrefbizdesk_riqj.asp
    >seems to be on the right track. But I would prefer open source.


    Right track ? It's not even leaving the station.

    This is a regular approach to the problem and it's more bogus than a
    Cayman Islands $3 bill. Taking the dataset (with the implicit
    assumption that all XML data is extracted from an RDBMS) and then
    labelling it as "row/column" adds nothing to the semantics of the
    representation and it is perpetuating the database structure you've
    just pulled it from. It's no better than CSV !

    XML has a restrictive data model. It's a single-rooted tree, when the
    real world is more like a directed graph. But even so, it's a lot more
    expressive than this narrow "everything is a rectangular grid"
    approach.
     
    Andy Dingley, Aug 11, 2003
    #10
  11. Re: compressing XML (was: humongous flat file)

    "Ed Beroset", Andy Dingley and Dennis Farr wrote:
    > >
    > > >Size also affects
    > > >transmission time, especially if encryption is involved.

    > >
    > > No it doesn't. If there are repeated strings in the file, then it
    > > improves compression efficiency. All significant transmissions are
    > > compressed these days, so this verbosity just doesn't matter in
    > > practice. This "XML is inefficient, so use cryptic 2-character element
    > > names" approach is completely bogus.

    >


    This depends on the sequence: encrypt-then-compress does poorly:
    the repetitive tage are transformed into dissimilar strings, and they don't
    compress. Compress-then-encrypt is as good as plain compression.
    Q: Which order is actually used? What does https do? What if the
    source files are encrypted already?

    >
    > Have you tried testing that hypothesis? I have, and although I hate
    > cryptic 2-character element names just as much you do, the fact is that
    > it actually does compress better. Here's a link to an IBM site which
    > illustrates this using test data:
    >
    > http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
    >


    Very interesting analysis. To get the max compression, it looks like
    we need to compress before sending, rather than relying on the comm
    link to choose compression for us.

    --
    Steve
     
    Steven Dilley, Aug 11, 2003
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. clintonG

    2.0 Themes = Big Fat Humongous Pages

    clintonG, Mar 28, 2006, in forum: ASP .Net
    Replies:
    3
    Views:
    578
    Jeff Lynch
    Mar 28, 2006
  2. Nicolai P. Zwar

    How do I create humongous Pop-UP Windows?

    Nicolai P. Zwar, Oct 20, 2003, in forum: HTML
    Replies:
    2
    Views:
    387
    Nicolai P. Zwar
    Oct 20, 2003
  3. Nicolai P. Zwar

    How do I create humongous pop-up windows?

    Nicolai P. Zwar, Oct 20, 2003, in forum: HTML
    Replies:
    15
    Views:
    656
    Bagman
    Oct 22, 2003
  4. R. P.
    Replies:
    3
    Views:
    8,340
    Joe Kesselman
    Jun 22, 2006
  5. Tim Arnold
    Replies:
    5
    Views:
    334
    Diez B. Roggisch
    Nov 2, 2004
Loading...

Share This Page