SAX - is there an equivalent to the DOM .nodeTypedValue for reading the whole node data at once?

J

jimmyfishbean

Hi,

I am using VB6, SAX (implementing IVBSAXContentHandler).

I need to extract binary encoded data (images) from large XML files and
decode this data and generate the appropriate images onto disk. My XML
files have the following structure:

<?xml version="1.0" encoding="utf-8" ?>
<imagepla xmlns:dt="urn:schemas-microsoft-com:datatypes">
<attachment>
<primary_id>28899</primary_id>
<filename>userguide3.pdf</filename>
<file
dt:dt="bin.base64">JVBERi0xLjMNJeLjz9MNCjU5NTAgMCBvYmoNPDwgDS9MaW5lYXJpemVkIDEgDS9PIDU5NTMgDS9I
IFsgMTM4OSAzODY0IF0gDS9MIDUwNTEyOTggDS9FIDEwMTQ3NCANL04gMTUzIA0vVCA0OTMyMTc4
.........
...................
</file>
</attachment>
<attachment>
......
......
</attachment>
</imagepla>

The encoded data (in the <file> element) neds to be extracted and then
decoded. I am trying to use SAX but I cannot read the whole of the
<file> element data at once (i.e. using DOM I would use
DOMDoc.nodeTypedValue). I understand that the DOM loads the whole
document into memory therefore the nodeTypedValue can be used.

I am using the following extract of code:

Dim strTmp as String
Dim byArr() as Byte

Private Sub IVBSAXContentHandler_characters(text As String)
...
strTmp = strTmp & text
...
btArr = strTmp
Open strAttFile For Binary As #1
Put #1, 1, btArr
Close #1
...
End Sub

The problem is that only 1 line at a time of the <file> node data is
passed to this sub. Therefore I need to reconstruct the whole of the
binary data for the image in a temp variable (strTmp), before I
determine the end of the file and then write it to disk.

This takes a vast amount of time (i.e. 20 minutes to try and decode a
4MB image). The XML file will contain 100s of images, so really the
current way of processing is no good at all.


Is there a way to read the whole of the data from the <file> node in
one go?
Also, I will be extracting the binary data and then use DOM to rewrite
the XML file without the binary data (so the user has a copy of the
original XML file - but a much smaller one since no binary in it).
Should I use DOM or SAXReader/SAXWriter?

Greatly appreciated. Thanks.

Jimmy
 
M

Malcolm Dew-Jones

(e-mail address removed) wrote:
: Hi,

: I am using VB6, SAX (implementing IVBSAXContentHandler).

: I need to extract binary encoded data (images) from large XML files and
: decode this data and generate the appropriate images onto disk. My XML
: files have the following structure:

: <?xml version="1.0" encoding="utf-8" ?>
: <imagepla xmlns:dt="urn:schemas-microsoft-com:datatypes">
: <attachment>
: <primary_id>28899</primary_id>
: <filename>userguide3.pdf</filename>
: <file
: dt:dt="bin.base64">JVBERi0xLjMNJeLjz9MNCjU5NTAgMCBvYmoNPDwgDS9MaW5lYXJpemVkIDEgDS9PIDU5NTMgDS9I
: IFsgMTM4OSAzODY0IF0gDS9MIDUwNTEyOTggDS9FIDEwMTQ3NCANL04gMTUzIA0vVCA0OTMyMTc4
: ........
: ..................
: </file>
: </attachment>
: <attachment>
: ......
: ......
: </attachment>
: </imagepla>

: The encoded data (in the <file> element) neds to be extracted and then
: decoded. I am trying to use SAX but I cannot read the whole of the
: <file> element data at once (i.e. using DOM I would use
: DOMDoc.nodeTypedValue). I understand that the DOM loads the whole
: document into memory therefore the nodeTypedValue can be used.

: I am using the following extract of code:

: Dim strTmp as String
: Dim byArr() as Byte

: Private Sub IVBSAXContentHandler_characters(text As String)
: ...
: strTmp = strTmp & text
: ...
: btArr = strTmp
: Open strAttFile For Binary As #1
: Put #1, 1, btArr
: Close #1
: ...
: End Sub

: The problem is that only 1 line at a time of the <file> node data is
: passed to this sub. Therefore I need to reconstruct the whole of the
: binary data for the image in a temp variable (strTmp), before I
: determine the end of the file and then write it to disk.

: This takes a vast amount of time (i.e. 20 minutes to try and decode a
: 4MB image). The XML file will contain 100s of images, so really the
: current way of processing is no good at all.


: Is there a way to read the whole of the data from the <file> node in
: one go?

In SAX in general you cannot ever be sure to read the whole of the
character data at once, though there is a slim chance that the sax module
you have available in VB has an option to do that (I have no idea, I
wouldn't count on it).

But why do you need to read the whole thing into memory? Base64 can be
decoded on the fly. Each sequence of four characters gives you three
bytes of data. Read a chunk, decode multiples of four characters at one
go and write them out. You may have to worry about the last few bytes
that have to hold over from one read to the next to get a multiple of
four.

And where is the slow down? I suspect that the string concatenation is to
blame. VB may be allocating a longer string each time and then copying
all the existing data plus the appended data into it. If you keep doing
that for an eventually large string it could get very slow. Can you
preallocate a much larger string and use substr to push the data into that
single large string. (VB substr, is that right?
substr(the_line,offset,len) = data_to_insert, something like that.)


: Also, I will be extracting the binary data and then use DOM to rewrite
: the XML file without the binary data (so the user has a copy of the
: original XML file - but a much smaller one since no binary in it).
: Should I use DOM or SAXReader/SAXWriter?

If you are not changing anything else in the xml except removing the
file data (and possibly replacing that one tag) then I would think it
easiest use a sax approach. As you read the data you also spool it back
out, except that one tag. I suppose a SAXWriter would help do that.


$0.10
 
K

kryptomoon

Hi,

I am using VB6, SAX (implementing IVBSAXContentHandler).

I need to extract binary encoded data (images) from large XML files and
decode this data and generate the appropriate images onto disk. My XML
files have the following structure:

<?xml version="1.0" encoding="utf-8" ?>
<imagepla xmlns:dt="urn:schemas-microsoft-com:datatypes">
<attachment>
<primary_id>28899</primary_id>
<filename>userguide3.pdf</filename>
<file
dt:dt="bin.base64">JVBERi0xLjMNJeLjz9MNCjU5NTAgMCBvYmoNPDwgDS9MaW5lYXJpemVkIDEgDS9PIDU5NTMgDS9I
IFsgMTM4OSAzODY0IF0gDS9MIDUwNTEyOTggDS9FIDEwMTQ3NCANL04gMTUzIA0vVCA0OTMyMTc4
........
..................
</file>
</attachment>
<attachment>
......
......
</attachment>
</imagepla>

The encoded data (in the <file> element) neds to be extracted and then
decoded. I am trying to use SAX but I cannot read the whole of the
<file> element data at once (i.e. using DOM I would use
DOMDoc.nodeTypedValue). I understand that the DOM loads the whole
document into memory therefore the nodeTypedValue can be used.

I am using the following extract of code:

Dim strTmp as String
Dim byArr() as Byte

Private Sub IVBSAXContentHandler_characters(text As String)
...
strTmp = strTmp & text
...
btArr = strTmp
Open strAttFile For Binary As #1
Put #1, 1, btArr
Close #1
...
End Sub

The problem is that only 1 line at a time of the <file> node data is
passed to this sub. Therefore I need to reconstruct the whole of the
binary data for the image in a temp variable (strTmp), before I
determine the end of the file and then write it to disk.

This takes a vast amount of time (i.e. 20 minutes to try and decode a
4MB image). The XML file will contain 100s of images, so really the
current way of processing is no good at all.


Is there a way to read the whole of the data from the <file> node in
one go?
Also, I will be extracting the binary data and then use DOM to rewrite
the XML file without the binary data (so the user has a copy of the
original XML file - but a much smaller one since no binary in it).
Should I use DOM or SAXReader/SAXWriter?

Greatly appreciated. Thanks.

Jimmy

Try NOT to open/close the file on each "characters" event.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top