logic question for text file updates

Discussion in 'Perl Misc' started by ccc31807, Mar 26, 2010.

  1. ccc31807

    ccc31807 Guest

    We have a csv source file of many thousands of records, with two
    columns, the ID and a
    status field. It has very recently come to my attention that
    occasionally the status of a record will change, with the change being
    significant enough that the record must be updated before the process
    runs. The update files consist of a small subset, sometimes a very
    small subset, of the records in the source file. (The update file has
    a number of other fields that can change also, but I'm only concerned
    with the status field.)

    My first inclination is to open the update file, create a hash with
    the ID as the key and the status as value, then open the source file,
    read each line, update the line if it exists in the hash, and write
    each line to a new output file. However, I can think of several
    different ways to do this -- I just don't know which way would be
    best. I don't particularly want to read every line and write every
    line of a source file when only a few lines (if any) need to be
    modified.

    My second inclination would be to use a database and write an update
    query for the records in the update file. But this seems a heavy
    weight solution to a light weight problem -- I would only be using the
    database to modify records, not to to any of the things we ordinarily
    use databases for.

    I've never had to do a small number of updates to a large file before,
    and it seems too trivial a task to use a database for. Any suggestions
    on a better way to do this?

    Thanks, CC.

    P.S. - The end product of this process is a data file with
    approximately 20 fields, written as comma separated, double quote
    delimited text, designed to be imported into Excel and Access by end
    users in performance of their duties.
     
    ccc31807, Mar 26, 2010
    #1
    1. Advertising

  2. ccc31807

    Guest

    On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <> wrote:

    >We have a csv source file of many thousands of records, with two
    >columns, the ID and a
    >status field. It has very recently come to my attention that
    >occasionally the status of a record will change, with the change being
    >significant enough that the record must be updated before the process
    >runs. The update files consist of a small subset, sometimes a very
    >small subset, of the records in the source file. (The update file has
    >a number of other fields that can change also, but I'm only concerned
    >with the status field.)
    >
    >My first inclination is to open the update file, create a hash with
    >the ID as the key and the status as value, then open the source file,
    >read each line, update the line if it exists in the hash, and write
    >each line to a new output file. However, I can think of several
    >different ways to do this -- I just don't know which way would be
    >best. I don't particularly want to read every line and write every
    >line of a source file when only a few lines (if any) need to be
    >modified.
    >
    >My second inclination would be to use a database and write an update
    >query for the records in the update file. But this seems a heavy
    >weight solution to a light weight problem -- I would only be using the
    >database to modify records, not to to any of the things we ordinarily
    >use databases for.
    >
    >I've never had to do a small number of updates to a large file before,
    >and it seems too trivial a task to use a database for. Any suggestions
    >on a better way to do this?
    >
    >Thanks, CC.
    >
    >P.S. - The end product of this process is a data file with
    >approximately 20 fields, written as comma separated, double quote
    >delimited text, designed to be imported into Excel and Access by end
    >users in performance of their duties.


    You don't have to write a new file back out to disk for a small change.
    You could design a disk that can grow or shrink its magnetic material
    on the fly, and just insert/remove metal sectors as needed.

    But, I think they trashed that idea when they invented frag-
    mentation capabilities.

    To circumvent fragments on a data level, you could re-design
    the file record so that a particular field of a record is
    fixed width relative to surrounding fields, sufficient enough
    to hold the largest variable data that field could possibly
    encounter.

    And if the field is small enough to accomodate all possible
    values, there is not that much "air" involved in relation to
    the overall file size.

    Since the field is fixed, the offset into the record to the
    field can be surmised and added to the location of the last
    record end position, allowing you to write a new fixed width
    value, guaranteeing not to overwrite the next field in that
    record.

    -sln
     
    , Mar 26, 2010
    #2
    1. Advertising

  3. ccc31807

    ccc31807 Guest

    On Mar 26, 2:35 pm, wrote:
    > On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <> wrote:
    > >We have a csv source file of many thousands of records, with two
    > >columns, the ID and a
    > >status field. It has very recently come to my attention that
    > >occasionally the status of a record will change, with the change being
    > >significant enough that the record must be updated before the process
    > >runs. The update files consist of a small subset, sometimes a very
    > >small subset, of the records in the source file. (The update file has
    > >a number of other fields that can change also, but I'm only concerned
    > >with the status field.)

    >
    > >My first inclination is to open the update file, create a hash with
    > >the ID as the key and the status as value, then open the source file,
    > >read each line, update the line if it exists in the hash, and write
    > >each line to a new output file. However, I can think of several
    > >different ways to do this -- I just don't know which way would be
    > >best. I don't particularly want to read every line and write every
    > >line of a source file when only a few lines (if any) need to be
    > >modified.

    >
    > >My second inclination would be to use a database and write an update
    > >query for the records in the update file. But this seems a heavy
    > >weight solution to a light weight problem -- I would only be using the
    > >database to modify records, not to to any of the things we ordinarily
    > >use databases for.

    >
    > >I've never had to do a small number of updates to a large file before,
    > >and it seems too trivial a task to use a database for. Any suggestions
    > >on a better way to do this?

    >
    > >Thanks, CC.

    >
    > >P.S. - The end product of this process is a data file with
    > >approximately 20 fields, written as comma separated, double quote
    > >delimited text, designed to be imported into Excel and Access by end
    > >users in performance of their duties.

    >
    > You don't have to write a new file back out to disk for a small change.
    > You could design a disk that can grow or shrink its magnetic material
    > on the fly, and just insert/remove metal sectors as needed.
    >
    > But, I think they trashed that idea when they invented frag-
    > mentation capabilities.
    >
    > To circumvent fragments on a data level, you could re-design
    > the file record so that a particular field of a record is
    > fixed width relative to surrounding fields, sufficient enough
    > to hold the largest variable data that field could possibly
    > encounter.
    >
    > And if the field is small enough to accomodate all possible
    > values, there is not that much "air" involved in relation to
    > the overall file size.
    >
    > Since the field is fixed, the offset into the record to the
    > field can be surmised and added to the location of the last
    > record end position, allowing you to write a new fixed width
    > value, guaranteeing not to overwrite the next field in that
    > record.
    >
    > -sln


    The key will always be a seven character integer. The value will
    always be a string with fewer than 20 characters. I COULD use a fixed
    width format, but my current format (for the source file) is pipe
    separated (e.g. 0059485|Current) and all my logic splits input on the
    pipe symbol.

    The keys are not consecutive, not ordered, and have large skips, i.e.,
    for several million records I might have ten thousand records in the
    source file which are randomly ordered (is that an oxymoron?).
    Treating the source file as an array would require many more array
    elements than records in the file.

    CC.
     
    ccc31807, Mar 26, 2010
    #3
  4. ccc31807

    Guest

    On Fri, 26 Mar 2010 12:00:17 -0700 (PDT), ccc31807 <> wrote:

    >On Mar 26, 2:35 pm, wrote:
    >> On Fri, 26 Mar 2010 11:14:36 -0700 (PDT), ccc31807 <> wrote:


    [snip]

    >> >My first inclination is to open the update file, create a hash with
    >> >the ID as the key and the status as value, then open the source file,
    >> >read each line, update the line if it exists in the hash, and write
    >> >each line to a new output file.


    >> >P.S. - The end product of this process is a data file with
    >> >approximately 20 fields, written as comma separated, double quote
    >> >delimited text, designed to be imported into Excel and Access by end
    >> >users in performance of their duties.

    >>


    >> Since the field is fixed, the offset into the record to the
    >> field can be surmised and added to the location of the last
    >> record end position, allowing you to write a new fixed width
    >> value, guaranteeing not to overwrite the next field in that
    >> record.
    >>

    >
    >The key will always be a seven character integer. The value will
    >always be a string with fewer than 20 characters. I COULD use a fixed
    >width format, but my current format (for the source file) is pipe
    >separated (e.g. 0059485|Current) and all my logic splits input on the
    >pipe symbol.
    >
    >The keys are not consecutive, not ordered, and have large skips, i.e.,
    >for several million records I might have ten thousand records in the
    >source file which are randomly ordered (is that an oxymoron?).
    >Treating the source file as an array would require many more array
    >elements than records in the file.
    >
    >CC.


    So, if you have a source file, delimited by | that you eventually make
    a dbl quote comma delimited csv file, you could make that status field fixed
    width (what 20 chars tops?) in the source. When you generate the dat, csv
    file, just strip white space from the beginning and end of the field
    before you double quote it to a csv file.

    Source file:
    - fields all dynamic width except status (which is fixed 20 char).
    - format

    <field1>|<field2>|<field3>|<field4>|<- status, 20 char ->|<field6>|<field7>|<field_last>\n

    You know the file position of the previous EOR. Use index() to find the pipe '|' char
    of the status field of the current record (4th in the example), add that to the previous
    EOR to get the write() position for the new status (if it changed).

    To find out if the status changed, do your split /'|'/ to get all the fields, check
    the ID/status from the update file, write out new "fixed width" status (format with
    printf or something) to the source file.

    When it comes time to generate the csv from the source, just trim spaces before you
    write it out.

    -sln
     
    , Mar 26, 2010
    #4
  5. ccc31807

    Guest

    On Fri, 26 Mar 2010 19:38:03 +0000, Ben Morrow <> wrote:

    >

    [snip]
    >Is there any way of 'blanking' a record? Normal CSV doesn't support
    >comments, and if you're importing into Excel you can't extend it to do
    >so; what does Excel do if you give it a file like
    >
    > one|two|three
    > ||||||||||||||
    > four|five|six


    In this case, the newline is the record delimeter, '|' is the
    field delimeter.
    You can have excel treat consecutive field delimeters as one.
    In this case, the ||||||||||| produces a blank record.
    This is the way Excel 2002 works, don't know if you can
    auto-remove blank records in newer versions though.

    -sln
     
    , Mar 26, 2010
    #5
  6. ccc31807

    ccc31807 Guest

    On Mar 26, 3:42 pm, wrote:
    > So, if you have a source file, delimited by | that you eventually make
    > a dbl quote comma delimited csv file, you could make that status field fixed
    > width (what 20 chars tops?) in the source. When you generate the dat, csv
    > file, just strip white space from the beginning and end of the field
    > before you double quote it to a csv file.


    Working backwards, my ultimate output file looks like this:
    "id","field2","field3","status","field5","field6"\n

    The 'status' field should be the current status, which rarely changes,
    but it's critical to use the most current status.

    I get about ten update files a year with the current status and a
    number of other fields that I don't care about. I take these files,
    strip out everything except the ID and the STATUS, and write that data
    into memory.

    Working frontwards, I build a source file with the two fields I
    referenced, like this: [id|status]

    I went ahead and bit the bullet, since I had to do something. I (1)
    save the source file to a backup, (2) read in the source file and save
    it to a hash on the ids, (3) read in the update file the same way, and
    (4) print out the hash to the source file. It's reasonably quick, less
    than a second (although I haven't bench marked it) and seems to be
    reliable.

    That said, I'd like to learn a more elegant way to do it.

    CC.

    > You know the file position of the previous EOR. Use index() to find the pipe '|' char
    > of the status field of the current record (4th in the example), add that to the previous
    > EOR to get the write() position for the new status (if it changed).
    >
    > To find out if the status changed, do your split /'|'/ to get all the fields, check
    > the ID/status from the update file, write out new "fixed width" status (format with
    > printf or something) to the source file.
    >
    > When it comes time to generate the csv from the source, just trim spaces before you
    > write it out.


    That sounds a lot more complicated than the brute force approach I
    used. But I appreciate your suggestion as treating the files as fixed
    width, and I will explore that later.

    CC
     
    ccc31807, Mar 26, 2010
    #6
  7. ccc31807

    ccc31807 Guest

    On Mar 26, 3:38 pm, Ben Morrow <> wrote:
    > Is there any way of 'blanking' a record? Normal CSV doesn't support
    > comments, and if you're importing into Excel you can't extend it to do
    > so; what does Excel do if you give it a file like


    Actually, I comment CSV files all the time, not for use by Excel, but
    for use by my scripts. The 'comments' are on interspersed lines
    beginning with #, so I can do this:
    while (<INPUT>)
    {
    next if /^#/;
    ...
    }

    >     - read the update file(s) into a hash,
    >     - open the source file read/write,
    >     - go through it looking for the appropriate records,
    >     - when you find one, wipe it out without changing the length or
    >       removing the newline,
    >     - add the changed records onto the end of the file, since the
    >       records weren't in order anyway.


    I don't see any real difference between this and reading the entire
    file into memory, at least for the size files I'm dealing this. IO is
    always a bottleneck, and unless space is limited it's better to use
    space than time.

    > It's generally not worth messing around with approaches like this,
    > though. Rewriting a file of a few MB doesn't exactly take long, and it's
    > much easier to get right.


    Yeah, I'm beginning to think that my investment in my time isn't worth
    the results.

    Thanks, CC.
     
    ccc31807, Mar 26, 2010
    #7
  8. ccc31807

    Guest

    On Fri, 26 Mar 2010 13:42:08 -0700 (PDT), ccc31807 <> wrote:

    >
    >That sounds a lot more complicated than the brute force approach I
    >used. But I appreciate your suggestion as treating the files as fixed
    >width, and I will explore that later.
    >
    >CC


    Actually, the brute force method you cite is far and away the
    much more complicated approach.

    Good luck!

    -sln
     
    , Mar 26, 2010
    #8
  9. ccc31807

    ccc31807 Guest

    On Mar 26, 5:15 pm, wrote:
    > Actually, the brute force method you cite is far and away the
    > much more complicated approach.


    I would be very interested in why you think this. It may depend on
    your definition of 'complicated.'

    In terms of writing the code, it was pretty simple. First, open the
    source file and read it into a hash. Second, open the update file and
    read it into the SAME(!) hash (thereby overwriting the old values
    where the hash keys are duplicated.) Third, write the hash back out to
    the source file.

    As to 'complicated' I have know people that use Access for data
    processing files, spending hours on end creating and building Access
    databases, queries, and reports to manipulate data. It takes them a
    lot longer to generate a report using Access than it does me, using
    Perl to munge the data. They say that my way is more 'complicated'
    because I use Perl (which is 'harder') and Access is easier. I say my
    way is less 'complicated' because I don't have to mess around with
    Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
    I have a very hard time understanding them, and (from my POV) I would
    say that you have a weird conception of 'complicated.'

    CC.
     
    ccc31807, Mar 26, 2010
    #9
  10. On Fri, 26 Mar 2010 12:00:17 -0700, ccc31807 wrote:

    > The key will always be a seven character integer. The value will always
    > be a string with fewer than 20 characters. I COULD use a fixed width
    > format, but my current format (for the source file) is pipe separated
    > (e.g. 0059485|Current) and all my logic splits input on the pipe symbol.


    Hurray for encapsulation. If from the start you encapsulated this (i.e.
    get a line, call a sub split_to_fields) you only would have had to update
    one sub.

    Not much help now, but something to keep in the back of your mind when
    designing your next program.

    M4
     
    Martijn Lievaart, Mar 26, 2010
    #10
  11. ccc31807

    Guest

    On Fri, 26 Mar 2010 14:33:36 -0700 (PDT), ccc31807 <> wrote:

    >On Mar 26, 5:15 pm, wrote:
    >> Actually, the brute force method you cite is far and away the
    >> much more complicated approach.

    >
    >I would be very interested in why you think this. It may depend on
    >your definition of 'complicated.'
    >
    >In terms of writing the code, it was pretty simple. First, open the
    >source file and read it into a hash. Second, open the update file and
    >read it into the SAME(!) hash (thereby overwriting the old values
    >where the hash keys are duplicated.) Third, write the hash back out to
    >the source file.
    >
    >As to 'complicated' I have know people that use Access for data
    >processing files, spending hours on end creating and building Access
    >databases, queries, and reports to manipulate data. It takes them a
    >lot longer to generate a report using Access than it does me, using
    >Perl to munge the data. They say that my way is more 'complicated'
    >because I use Perl (which is 'harder') and Access is easier. I say my
    >way is less 'complicated' because I don't have to mess around with
    >Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
    >I have a very hard time understanding them, and (from my POV) I would
    >say that you have a weird conception of 'complicated.'
    >
    >CC.


    Unfortunately, I can't find any scripts you have posted here.

    However, if you would like to make a technical comment on anything
    I write and post here, feel free to do so. And I will be glad to help
    if you run into dificulty understanding it.

    Cheers.

    -sln
     
    , Mar 26, 2010
    #11
  12. ccc31807 wrote:
    > We have a csv source file of many thousands of records, with two
    > columns, the ID and a
    > status field. It has very recently come to my attention that
    > occasionally the status of a record will change, with the change being
    > significant enough that the record must be updated before the process
    > runs. The update files consist of a small subset, sometimes a very
    > small subset, of the records in the source file. (The update file has
    > a number of other fields that can change also, but I'm only concerned
    > with the status field.)


    Is the status field of fixed length? If so, it can be changed in place.

    > My first inclination is to open the update file, create a hash with
    > the ID as the key and the status as value, then open the source file,
    > read each line, update the line if it exists in the hash, and write
    > each line to a new output file. However, I can think of several
    > different ways to do this -- I just don't know which way would be
    > best. I don't particularly want to read every line and write every
    > line of a source file when only a few lines (if any) need to be
    > modified.


    When you say the source file has "many thousand" records, how many
    thousand are you talking? Unless you are talking hundreds of thousands
    or thousands of thousands, I think that even spending the time to worry
    about alternatives to rewriting the file, much less implementing those
    alternatives, is a false economy.

    But why write it out at all? Read it in the exception file to a hash,
    read in the source file applying the exceptions in memory, and do
    whatever you need to do with the now accurate in memory records. Leave
    the source file as it is, and next time you need to do something with
    it, just re-apply the exception file to it again, again in memory.

    > My second inclination would be to use a database and write an update
    > query for the records in the update file. But this seems a heavy
    > weight solution to a light weight problem -- I would only be using the
    > database to modify records, not to to any of the things we ordinarily
    > use databases for.


    If you already have a database that is maintained and backed up, etc.,
    using it for an additional use may not be very heave weight.

    Xho
     
    Xho Jingleheimerschmidt, Mar 27, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guadala Harry

    CSS File Updates

    Guadala Harry, Aug 24, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    330
    Joerg Jooss
    Aug 24, 2004
  2. mariox19
    Replies:
    4
    Views:
    375
    mariox19
    Dec 13, 2007
  3. spike
    Replies:
    8
    Views:
    1,544
    Steve Holden
    Feb 9, 2010
  4. - - Vivian - - - - - -

    ASP updates: text and images

    - - Vivian - - - - - -, Jun 7, 2004, in forum: ASP General
    Replies:
    2
    Views:
    94
    Jeff Cochran
    Jun 7, 2004
  5. robert
    Replies:
    1
    Views:
    105
    Martin Honnen
    Aug 23, 2003
Loading...

Share This Page