Parsing 'dirty/corrupt data'. Advice wanted

Discussion in 'Perl Misc' started by burlo_stumproot@yahoo.se, Oct 29, 2004.

  1. Guest

    I'm finding myself in a position where I have to extract data from a
    file possibly filled with a lot of other junk of unknown length and
    format.

    The data has a strict format, a header line followed by lines of data
    that goes on for a fixed number of lines in some cases and in other
    cases until the next header line.

    My problem is that the data can at any point contain one or more lines
    with junk/data I dont want. It looks like the data is collected from
    an output device that listens to more than one application. (And I
    cant do anything about that). Some (or most) of the junk can be easily
    identified as such and can be removed but how to deal with the rest?


    Im not looking for code examples but rather advice on how to solve a
    problem like this in a robust and secure way.


    Currently I'm doing multiple passes over the data removing the obvious
    junk first. I then try to piece together the data by looking ahead in
    the file (if I dont find what I expect) trying to find a line that
    matches the line I want. It works most of the time but I'm conserned
    about the validity of the data and would of cource want it to work all
    the time.

    Another problem is that I dont know how much data I will recieve in
    one file so it's hard to know if I missed anything.


    Some short data examples:

    <example> # Can't know how many lines this block will contain
    0000 TFS001

    000 TERM 00000 0000001 00001 00000 0000043 00053 S
    005 TERM 00000 0000000 00000 00000 0000000 00000
    006 TDMF 00000 0000000 00000 00000 0000048 01305
    007 CONF 00000 0000000 00000 00000 0000000 00000
    009 TERM 00000 0000000 00000 00000 0000005 00006
    PRI265 DCH: 9 DATA: Q+P NOXLAN 47000 99000 0
    010 TERM 00000 0000001 00002 00000 0000107 00120
    021 TDMF 00000 0000000 00000 00000 0000040 00797
    022 CONF 00000 0000000 00000 00000 0000000 00004
    TRK136 93 11

    023 TERM 00000 0000001 00002 00000 0000041 00041 S CARR
    024 TERM 00000 0000000 00000 00000 0000007 00006
    </example>


    <example> # Block is 9 lines, line nr of data added, the rest is junk
    1: 030 RAN
    2:
    3: 00002 00002
    BUG440
    BUG440 : 00AC76B2 00001002 00008018 00004913 0000 19 0001 001
    000 0 73168 000020A5 00006137 00000008 00000000 0000 0001 000
    BUG440 + 0471C390 044C8418 044C5340 044C5016 04366226
    <<<< Here there can be many more lines like these >>>>
    BUG440 + 04365EB2 04365E10 0435E0A8 04B486AA 04B4837A
    BUG440 + 04B48306

    4:
    5: 0000000 00000
    6: 0000000 00003
    7: 00000 00000
    8: 00000
    9: 0000000 00000
    </example>


    In one file I found what appears to be a login session complete with
    commands and output. *sigh*



    Any help, pointers, reading suggestions???

    /PM
    From adress valid but rarly read.
    , Oct 29, 2004
    #1
    1. Advertising

  2. Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    >
    >
    > I'm finding myself in a position where I have to extract data from a
    > file possibly filled with a lot of other junk of unknown length and
    > format.
    >
    > The data has a strict format, a header line followed by lines of data
    > that goes on for a fixed number of lines in some cases and in other
    > cases until the next header line.
    >
    > My problem is that the data can at any point contain one or more lines
    > with junk/data I dont want. It looks like the data is collected from
    > an output device that listens to more than one application. (And I
    > cant do anything about that). Some (or most) of the junk can be easily
    > identified as such and can be removed but how to deal with the rest?


    Are you sure that only complete lines can intervene? In general,
    one process can overwrite parts of what another process writes to
    the same file.

    > Im not looking for code examples but rather advice on how to solve a
    > problem like this in a robust and secure way.


    That makes it a non-perl question.

    There is no such way. The intervening junk could happen to look exactly
    like a valid line of data. If you don't have means to check the validity
    of a data block you (think you) received, you'll never know.

    I snipped your example data below. Since you haven't explained how
    to tell valid lines from intervening ones, there is nothing we can
    learn from it.

    If you can control the output of "good" data, you could add line counts
    or checksums and other means of insuring data integrity. That way
    you would at least *know* if data is corrupted.

    If you can't control the output, reasonable data processing is impossible
    in that environment.

    Anno
    Anno Siegel, Oct 29, 2004
    #2
    1. Advertising

  3. Lord Ireland Guest

    <> wrote in message
    news:...
    >
    >
    > I'm finding myself in a position where I have to extract data from a
    > file possibly filled with a lot of other junk of unknown length and
    > format.
    >
    > The data has a strict format, a header line followed by lines of data
    > that goes on for a fixed number of lines in some cases and in other
    > cases until the next header line.
    >
    > My problem is that the data can at any point contain one or more lines
    > with junk/data I dont want. It looks like the data is collected from
    > an output device that listens to more than one application. (And I
    > cant do anything about that). Some (or most) of the junk can be easily
    > identified as such and can be removed but how to deal with the rest?
    >
    >
    > Im not looking for code examples but rather advice on how to solve a
    > problem like this in a robust and secure way.
    >
    >
    > Currently I'm doing multiple passes over the data removing the obvious
    > junk first. I then try to piece together the data by looking ahead in
    > the file (if I dont find what I expect) trying to find a line that
    > matches the line I want. It works most of the time but I'm conserned
    > about the validity of the data and would of cource want it to work all
    > the time.
    >
    > Another problem is that I dont know how much data I will recieve in
    > one file so it's hard to know if I missed anything.
    >
    >
    > Some short data examples:
    >
    > <example> # Can't know how many lines this block will contain
    > 0000 TFS001
    >
    > 000 TERM 00000 0000001 00001 00000 0000043 00053 S
    > 005 TERM 00000 0000000 00000 00000 0000000 00000
    > 006 TDMF 00000 0000000 00000 00000 0000048 01305
    > 007 CONF 00000 0000000 00000 00000 0000000 00000
    > 009 TERM 00000 0000000 00000 00000 0000005 00006
    > PRI265 DCH: 9 DATA: Q+P NOXLAN 47000 99000 0
    > 010 TERM 00000 0000001 00002 00000 0000107 00120
    > 021 TDMF 00000 0000000 00000 00000 0000040 00797
    > 022 CONF 00000 0000000 00000 00000 0000000 00004
    > TRK136 93 11
    >
    > 023 TERM 00000 0000001 00002 00000 0000041 00041 S CARR
    > 024 TERM 00000 0000000 00000 00000 0000007 00006
    > </example>
    >
    >
    > <example> # Block is 9 lines, line nr of data added, the rest is junk
    > 1: 030 RAN
    > 2:
    > 3: 00002 00002
    > BUG440
    > BUG440 : 00AC76B2 00001002 00008018 00004913 0000 19 0001 001
    > 000 0 73168 000020A5 00006137 00000008 00000000 0000 0001 000
    > BUG440 + 0471C390 044C8418 044C5340 044C5016 04366226
    > <<<< Here there can be many more lines like these >>>>
    > BUG440 + 04365EB2 04365E10 0435E0A8 04B486AA 04B4837A
    > BUG440 + 04B48306
    >
    > 4:
    > 5: 0000000 00000
    > 6: 0000000 00003
    > 7: 00000 00000
    > 8: 00000
    > 9: 0000000 00000
    > </example>
    >
    >
    > In one file I found what appears to be a login session complete with
    > commands and output. *sigh*
    >
    >
    >
    > Any help, pointers, reading suggestions???
    >
    > /PM
    > From adress valid but rarly read.


    Look mate, using perl for this is an act of insanity. Visual Basic is much
    better at this kind of stuff - here's a helpful article on how to purchase
    the 2003 version.

    http://msdn.microsoft.com/howtobuy/vbasic/default.aspx
    Lord Ireland, Oct 29, 2004
    #3
  4. Anno Siegel Guest

    Lord Ireland <> wrote in comp.lang.perl.misc:
    >
    > <> wrote in message
    > news:...
    > >
    > >
    > > I'm finding myself in a position where I have to extract data from a
    > > file possibly filled with a lot of other junk of unknown length and
    > > format.


    [...]

    > Look mate, using perl for this is an act of insanity. Visual Basic is much
    > better at this kind of stuff - here's a helpful article on how to purchase
    > the 2003 version.


    Care to explain how VB can solve the problem but Perl can't? And why
    you had to quote the entire article just to add this nonsense?

    Anno
    Anno Siegel, Oct 29, 2004
    #4
  5. D. Marxsen Guest

    <> schrieb im Newsbeitrag
    news:...

    > The data has a strict format, a header line followed by lines of data
    > that goes on for a fixed number of lines in some cases and in other
    > cases until the next header line.


    Maybe you can give some more precise info how you recognise a valid header
    or data line (x alphas here, x nums there, x blocks of y nums here, etc.).
    This may help to find a regexp which weeds out non-matching lines.


    Cheers,
    Detlef.


    --
    D. Marxsen, TD&DS GmbH
    (replace z with h, spam protection)
    D. Marxsen, Oct 29, 2004
    #5
  6. wrote:
    <snip>
    > <example> # Block is 9 lines, line nr of data added, the rest is junk
    > 1: 030 RAN
    > 2:
    > 3: 00002 00002
    > BUG440
    > BUG440 : 00AC76B2 00001002 00008018 00004913 0000 19 0001 001
    > 000 0 73168 000020A5 00006137 00000008 00000000 0000 0001 000
    > BUG440 + 0471C390 044C8418 044C5340 044C5016 04366226
    > <<<< Here there can be many more lines like these >>>>
    > BUG440 + 04365EB2 04365E10 0435E0A8 04B486AA 04B4837A
    > BUG440 + 04B48306
    >
    > 4:
    > 5: 0000000 00000
    > 6: 0000000 00003
    > 7: 00000 00000
    > 8: 00000
    > 9: 0000000 00000
    > </example>
    >
    >
    > In one file I found what appears to be a login session complete with
    > commands and output. *sigh*
    >
    >
    >
    > Any help, pointers, reading suggestions???


    Know your data. Know why one line is valid and another isn't. The
    data may appear to have no "logic" or "pattern" to it, but it's
    there somewhere.

    First place I might start is either split the line on whitespace or
    use unpack to get at least the first column. Then start testing for
    what is requires for a valid line. That's at first glance and
    without having any clue as to what the data is supposed to be/represent.

    HTH

    Jim
    James Willmore, Oct 29, 2004
    #6
  7. Lord Ireland <> wrote:
    ><> wrote in message
    > news:...
    >>
    >>
    >> I'm finding myself in a position where I have to extract data from a
    >> file possibly filled with a lot of other junk of unknown length and
    >> format.



    > Look mate, using perl for this is an act of insanity.



    Why is that?


    > Visual Basic is much
    > better at this kind of stuff - here's a helpful article on how to purchase
    > the 2003 version.



    You should put in a smiley when you make a joke.

    Otherwise people might think you are being serious.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Oct 29, 2004
    #7
  8. On Fri, 29 Oct 2004, Lord Ireland scribbled furiously:

    [comprehensive quote of problem, including signature, now removed]

    > Look mate, using perl for this is an act of insanity. Visual Basic
    > is much better at this kind of stuff - here's a helpful article on
    > how to purchase the 2003 version.


    Damn! Now my irony detector is in ruins. Where am I supposed to
    get a replacement, this late in the week? Have a care for your fellow
    usenauts, please.
    Alan J. Flavell, Oct 29, 2004
    #8
  9. wrote:

    >
    >
    > I'm finding myself in a position where I have to extract data from a
    > file possibly filled with a lot of other junk of unknown length and
    > format.
    >
    > The data has a strict format, a header line followed by lines of data
    > that goes on for a fixed number of lines in some cases and in other
    > cases until the next header line.
    >
    > My problem is that the data can at any point contain one or more lines
    > with junk/data I dont want. It looks like the data is collected from
    > an output device that listens to more than one application. (And I
    > cant do anything about that). Some (or most) of the junk can be easily
    > identified as such and can be removed but how to deal with the rest?
    >
    >
    > Im not looking for code examples but rather advice on how to solve a
    > problem like this in a robust and secure way.
    >


    My computer scientist brainspace pops up with thus question. It is analagous
    to parsing in the presence of syntax errors.

    Perhaps try http://search.cpan.org/~dconway/Parse-RecDescent-1.94/

    gtoomey
    Gregory Toomey, Oct 29, 2004
    #9
  10. Guest

    -berlin.de (Anno Siegel) writes:

    > <> wrote in comp.lang.perl.misc:
    > >
    > >
    > > I'm finding myself in a position where I have to extract data from a
    > > file possibly filled with a lot of other junk of unknown length and
    > > format.

    <snip>
    > > My problem is that the data can at any point contain one or more lines
    > > with junk/data I dont want. It looks like the data is collected from
    > > an output device that listens to more than one application. (And I
    > > cant do anything about that). Some (or most) of the junk can be easily
    > > identified as such and can be removed but how to deal with the rest?

    >
    > Are you sure that only complete lines can intervene? In general,
    > one process can overwrite parts of what another process writes to
    > the same file.


    No, but as yet I have not found any junk inside a line. And if I later
    find lines like that I'm going report it in the app and suggest a manual
    edit of the file. I have to draw the line somwhere and my line is that
    I have to trust my lines :)

    > > Im not looking for code examples but rather advice on how to solve a
    > > problem like this in a robust and secure way.

    >
    > That makes it a non-perl question.
    >
    > There is no such way. The intervening junk could happen to look exactly
    > like a valid line of data. If you don't have means to check the validity
    > of a data block you (think you) received, you'll never know.


    I can make "some" validity checks on the data. If it's varies too much
    from erlier and later data I can report it as questionable.

    > If you can't control the output, reasonable data processing is impossible
    > in that environment.


    And yet I have to. I suppose it depends on your value of "reasonable" :)

    As I wrote above I'm doing exstencive look-ahead in the file for the missing
    data lines. If I dont find a line that matches[1] (I stop looking when I
    find another block header) I mark the datablock as invalid. This happens
    very rarly, most of the time the lookahead works.

    What I wanted to know was if anyone had a better idea or suggestion for
    improvment to this strategy.


    [1]
    The regexp for valid lines are very simple but varies between the lines
    in a block.
    line 1 in block matches regexp_A
    line 2 in block matches regexp_B
    line 3 in block matches regexp_A
    line 4 in block matches regexp_C
    ....



    /PM
    From adress valid but rarly read.
    , Oct 30, 2004
    #10
  11. Guest

    James Willmore <> writes:

    > wrote:
    > <snip>


    <snip of example data>

    > Know your data. Know why one line is valid and another isn't. The
    > data may appear to have no "logic" or "pattern" to it, but it's there
    > somewhere.


    I know how my data looks and the regular expression to find them are
    simple. What I was hoping for by asking here was if anyone had a better
    strategy than the one I have now. What I do now is extensive lookahead
    in the file until I find a line that matches. If I cant find it before
    a new "block header"-line is found I report the block as broken.

    > what is requires for a valid line. That's at first glance and without
    > having any clue as to what the data is supposed to be/represent.


    I wish I knew what it represented too. I have it in the documentation
    but have not had the time to read it yet. I have to do that soon since
    I plan to do a reasonabilitytest[1] on the data.

    And later I have to make pretty pictures of it in excel and powerpoint!
    Ooh JOY!!


    [1] Hmm cant find a good word for this right now.


    /PM
    From adress valid but rarly read.
    , Oct 30, 2004
    #11
  12. Eric Bohlman Guest

    wrote in news::

    > I know how my data looks and the regular expression to find them are
    > simple. What I was hoping for by asking here was if anyone had a better
    > strategy than the one I have now. What I do now is extensive lookahead
    > in the file until I find a line that matches. If I cant find it before
    > a new "block header"-line is found I report the block as broken.


    Draw a state diagram and implement a state machine. You might find some of
    the DFA::* modules on CPAN to be helpful (and there's a writeup on them
    somewhere in the archives of perl.com) though from what it sounds like it's
    simple enough to implement in straight code. That way you should be able
    to deal with only one line at a time without lookahead.
    Eric Bohlman, Oct 30, 2004
    #12
  13. Guest

    Eric Bohlman <> writes:

    > wrote in news::
    >
    > > I know how my data looks and the regular expression to find them are
    > > simple. What I was hoping for by asking here was if anyone had a better
    > > strategy than the one I have now. What I do now is extensive lookahead
    > > in the file until I find a line that matches. If I cant find it before
    > > a new "block header"-line is found I report the block as broken.

    >
    > Draw a state diagram and implement a state machine. You might find some of
    > the DFA::* modules on CPAN to be helpful (and there's a writeup on them
    > somewhere in the archives of perl.com) though from what it sounds like it's
    > simple enough to implement in straight code. That way you should be able
    > to deal with only one line at a time without lookahead.


    Looks interesting. I'll have a closer look at this. Thanks for the tip.

    /PM

    From adress valid but rarly read
    , Oct 30, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Harvey
    Replies:
    0
    Views:
    672
    Harvey
    Jul 16, 2004
  2. Harvey
    Replies:
    1
    Views:
    822
    Daniel
    Jul 16, 2004
  3. sjl
    Replies:
    5
    Views:
    2,875
    S. Justin Gengo
    Nov 8, 2005
  4. Replies:
    4
    Views:
    3,284
    Jan Faerber
    Sep 24, 2008
  5. markspace
    Replies:
    6
    Views:
    312
    Arne Vajhøj
    Nov 7, 2011
Loading...

Share This Page