B
burlo_stumproot
I'm finding myself in a position where I have to extract data from a
file possibly filled with a lot of other junk of unknown length and
format.
The data has a strict format, a header line followed by lines of data
that goes on for a fixed number of lines in some cases and in other
cases until the next header line.
My problem is that the data can at any point contain one or more lines
with junk/data I dont want. It looks like the data is collected from
an output device that listens to more than one application. (And I
cant do anything about that). Some (or most) of the junk can be easily
identified as such and can be removed but how to deal with the rest?
Im not looking for code examples but rather advice on how to solve a
problem like this in a robust and secure way.
Currently I'm doing multiple passes over the data removing the obvious
junk first. I then try to piece together the data by looking ahead in
the file (if I dont find what I expect) trying to find a line that
matches the line I want. It works most of the time but I'm conserned
about the validity of the data and would of cource want it to work all
the time.
Another problem is that I dont know how much data I will recieve in
one file so it's hard to know if I missed anything.
Some short data examples:
<example> # Can't know how many lines this block will contain
0000 TFS001
000 TERM 00000 0000001 00001 00000 0000043 00053 S
005 TERM 00000 0000000 00000 00000 0000000 00000
006 TDMF 00000 0000000 00000 00000 0000048 01305
007 CONF 00000 0000000 00000 00000 0000000 00000
009 TERM 00000 0000000 00000 00000 0000005 00006
PRI265 DCH: 9 DATA: Q+P NOXLAN 47000 99000 0
010 TERM 00000 0000001 00002 00000 0000107 00120
021 TDMF 00000 0000000 00000 00000 0000040 00797
022 CONF 00000 0000000 00000 00000 0000000 00004
TRK136 93 11
023 TERM 00000 0000001 00002 00000 0000041 00041 S CARR
024 TERM 00000 0000000 00000 00000 0000007 00006
</example>
<example> # Block is 9 lines, line nr of data added, the rest is junk
1: 030 RAN
2:
3: 00002 00002
BUG440
BUG440 : 00AC76B2 00001002 00008018 00004913 0000 19 0001 001
000 0 73168 000020A5 00006137 00000008 00000000 0000 0001 000
BUG440 + 0471C390 044C8418 044C5340 044C5016 04366226
<<<< Here there can be many more lines like these >>>>
BUG440 + 04365EB2 04365E10 0435E0A8 04B486AA 04B4837A
BUG440 + 04B48306
4:
5: 0000000 00000
6: 0000000 00003
7: 00000 00000
8: 00000
9: 0000000 00000
</example>
In one file I found what appears to be a login session complete with
commands and output. *sigh*
Any help, pointers, reading suggestions???
/PM
From adress valid but rarly read.
file possibly filled with a lot of other junk of unknown length and
format.
The data has a strict format, a header line followed by lines of data
that goes on for a fixed number of lines in some cases and in other
cases until the next header line.
My problem is that the data can at any point contain one or more lines
with junk/data I dont want. It looks like the data is collected from
an output device that listens to more than one application. (And I
cant do anything about that). Some (or most) of the junk can be easily
identified as such and can be removed but how to deal with the rest?
Im not looking for code examples but rather advice on how to solve a
problem like this in a robust and secure way.
Currently I'm doing multiple passes over the data removing the obvious
junk first. I then try to piece together the data by looking ahead in
the file (if I dont find what I expect) trying to find a line that
matches the line I want. It works most of the time but I'm conserned
about the validity of the data and would of cource want it to work all
the time.
Another problem is that I dont know how much data I will recieve in
one file so it's hard to know if I missed anything.
Some short data examples:
<example> # Can't know how many lines this block will contain
0000 TFS001
000 TERM 00000 0000001 00001 00000 0000043 00053 S
005 TERM 00000 0000000 00000 00000 0000000 00000
006 TDMF 00000 0000000 00000 00000 0000048 01305
007 CONF 00000 0000000 00000 00000 0000000 00000
009 TERM 00000 0000000 00000 00000 0000005 00006
PRI265 DCH: 9 DATA: Q+P NOXLAN 47000 99000 0
010 TERM 00000 0000001 00002 00000 0000107 00120
021 TDMF 00000 0000000 00000 00000 0000040 00797
022 CONF 00000 0000000 00000 00000 0000000 00004
TRK136 93 11
023 TERM 00000 0000001 00002 00000 0000041 00041 S CARR
024 TERM 00000 0000000 00000 00000 0000007 00006
</example>
<example> # Block is 9 lines, line nr of data added, the rest is junk
1: 030 RAN
2:
3: 00002 00002
BUG440
BUG440 : 00AC76B2 00001002 00008018 00004913 0000 19 0001 001
000 0 73168 000020A5 00006137 00000008 00000000 0000 0001 000
BUG440 + 0471C390 044C8418 044C5340 044C5016 04366226
<<<< Here there can be many more lines like these >>>>
BUG440 + 04365EB2 04365E10 0435E0A8 04B486AA 04B4837A
BUG440 + 04B48306
4:
5: 0000000 00000
6: 0000000 00003
7: 00000 00000
8: 00000
9: 0000000 00000
</example>
In one file I found what appears to be a login session complete with
commands and output. *sigh*
Any help, pointers, reading suggestions???
/PM
From adress valid but rarly read.