How to read sequentially from a random point in a large Xml File.(200 - 2000 MB)

Discussion in 'ASP General' started by Schwartzberg, Apr 3, 2008.

  1. Schwartzberg

    Schwartzberg Guest

    Hello

    Have a huge XML file with multitudes of "LogEntry" nodes / text
    lines.
    A small sample of this xml/text content is below.
    The file could be anywhere between 200 to 2000 MB.
    My questions comes in two parts.

    (A)
    I would like a solution (or ideas for it), in C#, -- to randomly
    access a really huge xml file, and to sequentially read only the
    "memory permiting" number of nodes into memory, from a place randomly
    selected in the huge file. The application otherwise returns an out
    of memory error or gets very slow, if i try to load the entire file,
    because the user like to "scroll" through the file, viewing different
    parts. Like when scrolling through a huge Word document.

    How is this (best) done?

    (B)
    What is, and/or how would i estimate, the max amount of xml or text
    from the file that the application can have in it's memory? The
    application is both a web applicatin and a windows standalone.

    On a 32bit machine with 2GB Ram, the virual memory is 2GB, which gives
    an answer.
    But i have a Java app that goes in a unhandled heap error already when
    loading xml from a 200MB size file.

    Any ideas, solutions, or links concerning the above (especially (A))?

    One avenue is to try to base a sequential reader on a random access
    stream.

    I tried this idea. I based the XmlTextReader (for seqeuntial read) on
    the FileStream (for randon access), but this didnt work. There is
    some test code at the bottom of this email that shows some of this.

    I used the FileStream for random access via the FileStream.Seek(..)
    method.
    But the XmlTextReader.Read() didn't start reading from the new
    position.

    The following:
    FileStream.Seek(<Random NewPosition>, SeekOrigin.Begin);
    FileStream.Read();
    would read from a the new position, but it didnt effect the
    positioning of XmlTextReader.Read().
    Even though XmlTextReader is based on the same FileStream.

    It caused though the last read of the XmlTextReader to validate the
    xml erroneously (when the xml was actually ok).

    An alternative is to base a StreamReader on a FileStream.
    The StreamReader.BaseStream is available for random access, and the
    StreamReader is there for sequential read.
    But i think the same problem is there, as when basing the
    XmlTextReader on the FileStream.

    As a side thought to the problem, - it could be more easily solved if
    MicroSoft offered an indexing mechanism (for application purposes) on
    NTSF files. But this isn't the case. Or if i could load the huge
    file into a database table, but the requirement is only to use xml
    files (or flat files), so this isn't an option.

    This question involves several "technologies". So i am posting it on
    several newsgroups.

    Here's a sample of the XML:
    Each "LogEntry" node is viewed as line of text in a GridView
    controller.

    <Logs AtrA="AllTheLogs">
    <Log AtrA="log1" AtrB="Machine nr 1">
    <LogEntry AtrA="name1" AtrB="time" AtrC="location" />
    <LogEntry AtrA="name2" AtrB="time" AtrC="location" />
    <LogEntry AtrA="name3" AtrB="time" AtrC="location" />
    <LogEntry AtrA="name4" AtrB="time" AtrC="location" />
    </Log>
    <Log AtrA="log2" AtrB="Machine nr 1">
    <LogEntry AtrA="name5" AtrB="time" AtrC="location" />
    <LogEntry AtrA="name6" AtrB="time" AtrC="location" />
    </Log>
    </Logs>



    Some test code using XmlTextReader(FileStream) based on a file with
    the above xml.
    I used the VS debugger to look into the variables.

    System.IO.FileStream fs = null;
    int i = 0;
    long[] bookMarks = new long[4000];
    String[] linesOfText = new String[4000];
    byte[] aBuffer = new byte[1000];
    char[] charBuffer = new char[1000];
    try
    {
    fs = new FileStream("c:\\aXMLfile.xml",
    FileMode.OpenOrCreate);
    System.Xml.XmlTextReader reader = new
    XmlTextReader(fs);

    long lngthOfFS = fs.Length;

    Boolean a = false;
    while (reader.Read())
    {
    bookMarks = fs.Position;
    StreamReader sr = new StreamReader(fs);

    if (i == 2)
    {
    fs.Read(aBuffer, 0, aBuffer.Length);
    fs.Position = 0;
    fs.Read(aBuffer, 0, aBuffer.Length);
    for (int g = 0; g < aBuffer.Length; g++)
    {
    charBuffer[g] = (char)aBuffer[g];
    }
    }

    linesOfText = "Attribute count: "
    + reader.AttributeCount
    + ", NodeType: "
    + reader.NodeType
    + ", Name: "
    + reader.Name
    + ", value: "
    + reader.Value;
    a = reader.HasAttributes;

    if (reader.HasAttributes)
    {
    for (int ii = 0; ii < reader.AttributeCount; ii
    ++)
    {
    reader.MoveToAttribute(ii);
    linesOfText = linesOfText
    + "Attribute " + ii.ToString() + ":"
    + ", Name: "
    + reader.Name
    + ", value: "
    + reader.Value;
    }

    }

    i++;
    }
    }
    catch(Exception e)
    {
    String message = e.ToString();
    }
    finally
    {
    fs.Unlock(0, fs.Length);
    }

    Other references:
    Efficient Techniques for Modifying Large XML Files
    http://msdn2.microsoft.com/en-us/library/aa302289.aspx
    XML Reader with Bookmarks
    http://msdn2.microsoft.com/en-us/library/aa302292.aspx
    The Best of Both Worlds: Combining XPath with the XmlReader
    http://msdn2.microsoft.com/en-us/library/ms950778.aspx

    Comments to references:

    Helena Kupkova developed a XmlBookmarkReader class (based on
    XmlReader). But when XmlBookmarkReader sets a bookmark on a read
    node, it caches it and the following node, to be able to "replay" the
    bookmark when it is needed. On huge files, an early bookmark will
    cache the xml content of the file until the applicaiton runs out of
    memory.

    Dare Obasanjo XPathReader doesnt avoid a sequential read of the file,
    testing for each read, for a match for one or more xpaths. For a new
    XPath, the code would have to seqential reading from the start of the
    file.




    --
    Regards,
    Paul
     
    Schwartzberg, Apr 3, 2008
    #1
    1. Advertising

  2. Schwartzberg

    Guest

    On Apr 3, 4:02 pm, Schwartzberg <> wrote:

    > (A)
    > I would like a solution (or ideas for it), in C#,  -- to randomly
    > access a really huge xml file, and to sequentially read only the
    > "memory permiting" number of nodes into memory, from a place randomly
    > selected in the huge file.   The application otherwise returns an out
    > of memory error or gets very slow, if i try to load the entire file,
    > because the user like to "scroll" through the file, viewing different
    > parts.   Like when scrolling through a huge Word document.


    I don't know how much this might help, there must certainly be a
    simpler solution around, but...

    You might try treating the large XML file as if it were just text.
    Don't read the whole file, just copy out X number of lines from line
    number Y to Z in the file. (or bytes if going by lines is problematic)
    Then take the lines you have copied and put them into a string or a
    temporary file. Parse it to find the nearest beginning of a node and
    cut out what is before it and then find the last complete node and cut
    out what is after it. Then treat that remaining text as if it were the
    XML file and parse that into the app. Then the user can scroll through
    that subset of the XML file and then you can repeat the same procedure
    for the next section of the file. If the process is fast enough (or
    you do it in a separate thread while the user is scrolling) the user
    will never notice that the file was broken up. You would need to
    remember to start from before line number Z if you ended up needing to
    cut out some lines so you get that node that was eliminated on the end
    however.

    This is the first solution that comes to mind. It is perhaps
    needlessly complex though.
     
    , Apr 3, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. paul
    Replies:
    3
    Views:
    900
    Duende
    Jan 24, 2005
  2. Joe Wright
    Replies:
    0
    Views:
    554
    Joe Wright
    Jul 27, 2003
  3. Grzegorz Adam Hankiewicz
    Replies:
    3
    Views:
    380
    Andrew Dalke
    Jul 13, 2003
  4. Grzegorz Adam Hankiewicz
    Replies:
    0
    Views:
    342
    Grzegorz Adam Hankiewicz
    Jul 26, 2003
  5. Saraswati lakki
    Replies:
    0
    Views:
    1,415
    Saraswati lakki
    Jan 6, 2012
Loading...

Share This Page