reading attributes with no quotes using XmlTextReader

Discussion in 'ASP .Net' started by apiringmvp, Nov 28, 2006.

  1. apiringmvp

    apiringmvp Guest

    All,

    So I am creating a function that gets a short blurb of html from a
    blog. I would like to retain all html formating and images. The code
    below works well, with the exception of one issue.

    My issue:
    ---------------------
    When a blog's html has attributes with no quotes i get an exception.

    Here's the example of the blog I am dealing with.
    <p align=center>Some text from the blog.</p>

    Questions:
    ----------------------
    Is there a way to get the XmlTextReader to allow attributes without
    quotes?

    If not, do you like RegExs for this replace?

    Then, Does anyone know any RegExs that could do this replace?


    Code:
    ----------------------
    public static string GetContentShortBlurb(string content, int len)
    {
    try
    {
    using (System.IO.MemoryStream ms = new
    System.IO.MemoryStream())
    {
    if (!content.TrimStart(' ', '\r',
    '\n').StartsWith("<"))
    content = "<p>" + content + "</p>";

    byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
    + content + "</doc>");
    ms.Write(cb, 0, cb.Length);
    ms.Position = 0;

    // create Reader for parsing
    XmlTextReader xr = new XmlTextReader(ms);

    // Create Writer for output
    System.Text.StringBuilder sb = new
    System.Text.StringBuilder();
    XmlWriterSettings xws = new XmlWriterSettings();
    xws.ConformanceLevel = ConformanceLevel.Fragment;
    xws.Encoding = new System.Text.UTF8Encoding(false);
    XmlWriter xw = XmlTextWriter.Create(sb, xws);

    xr.Read();

    int strCount = 0;
    int nodesToEnd = 0;
    while (strCount < len)
    {
    xr.Read();

    if (xr.NodeType == XmlNodeType.EndElement)
    {
    if (xr.Name == "doc") break;

    xw.WriteEndElement();
    nodesToEnd--;
    }

    if (xr.NodeType == XmlNodeType.Element)
    {
    xw.WriteStartElement(xr.Name);

    nodesToEnd++;

    // write attributes
    while (xr.MoveToNextAttribute())
    {
    xw.WriteAttributeString(xr.Name, xr.Value);
    }
    }

    if (xr.NodeType == XmlNodeType.Text)
    {
    string inner = xr.Value;
    if (inner.Length + strCount > len)
    {
    inner = inner.Substring(0,
    inner.LastIndexOf(' ', len - strCount)) + " ...";
    }
    xw.WriteString(inner);
    strCount += inner.Length;
    }
    }

    for (int i = 0; i < nodesToEnd; i++)
    xw.WriteEndElement();

    xr.Close();
    xw.Close();


    return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
    "");
    }
    }
    catch (Exception ex)
    {
    // Just do the standard old string trim
    string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
    string output = Regex.Replace(content, stripHtmlEx, "");
    if (output.Length > len)
    output = "<p>" + output.Substring(0,
    output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
    ....</p>";
    return output;
    }
    }
     
    apiringmvp, Nov 28, 2006
    #1
    1. Advertising

  2. apiringmvp

    Karl Seguin Guest

    You're problem, which you might already know, is that you are trying to use
    a XML Text Reader to read non-XML content. XML strictly requires all
    attributes to be enclosed in double quotes. HTML is based on SGML which
    doesn't have such a requirement. XHTML on the other hand is based on XML
    and so you shouldn't have any problems.

    All this to say that there probably isn't a way to make XmlTExtReader work
    without quote - if it did, it wouldn't be an Xml reader...Unfortunetly,
    there isn't an SgmlTextReader - which is really what you should be using.

    You could try to use regular expressions to turn your content into valid
    XML, but I think you'll keep running into new issues with this...first it'll
    be missing double quotes, then missing closing tags....

    Using a regular expression or even just string manipulation (index of and
    substrings) is probably the right way to go...

    Karl


    --
    http://www.openmymind.net/
    http://www.fuelindustries.com/


    "apiringmvp" <> wrote in message
    news:...
    > All,
    >
    > So I am creating a function that gets a short blurb of html from a
    > blog. I would like to retain all html formating and images. The code
    > below works well, with the exception of one issue.
    >
    > My issue:
    > ---------------------
    > When a blog's html has attributes with no quotes i get an exception.
    >
    > Here's the example of the blog I am dealing with.
    > <p align=center>Some text from the blog.</p>
    >
    > Questions:
    > ----------------------
    > Is there a way to get the XmlTextReader to allow attributes without
    > quotes?
    >
    > If not, do you like RegExs for this replace?
    >
    > Then, Does anyone know any RegExs that could do this replace?
    >
    >
    > Code:
    > ----------------------
    > public static string GetContentShortBlurb(string content, int len)
    > {
    > try
    > {
    > using (System.IO.MemoryStream ms = new
    > System.IO.MemoryStream())
    > {
    > if (!content.TrimStart(' ', '\r',
    > '\n').StartsWith("<"))
    > content = "<p>" + content + "</p>";
    >
    > byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
    > + content + "</doc>");
    > ms.Write(cb, 0, cb.Length);
    > ms.Position = 0;
    >
    > // create Reader for parsing
    > XmlTextReader xr = new XmlTextReader(ms);
    >
    > // Create Writer for output
    > System.Text.StringBuilder sb = new
    > System.Text.StringBuilder();
    > XmlWriterSettings xws = new XmlWriterSettings();
    > xws.ConformanceLevel = ConformanceLevel.Fragment;
    > xws.Encoding = new System.Text.UTF8Encoding(false);
    > XmlWriter xw = XmlTextWriter.Create(sb, xws);
    >
    > xr.Read();
    >
    > int strCount = 0;
    > int nodesToEnd = 0;
    > while (strCount < len)
    > {
    > xr.Read();
    >
    > if (xr.NodeType == XmlNodeType.EndElement)
    > {
    > if (xr.Name == "doc") break;
    >
    > xw.WriteEndElement();
    > nodesToEnd--;
    > }
    >
    > if (xr.NodeType == XmlNodeType.Element)
    > {
    > xw.WriteStartElement(xr.Name);
    >
    > nodesToEnd++;
    >
    > // write attributes
    > while (xr.MoveToNextAttribute())
    > {
    > xw.WriteAttributeString(xr.Name, xr.Value);
    > }
    > }
    >
    > if (xr.NodeType == XmlNodeType.Text)
    > {
    > string inner = xr.Value;
    > if (inner.Length + strCount > len)
    > {
    > inner = inner.Substring(0,
    > inner.LastIndexOf(' ', len - strCount)) + " ...";
    > }
    > xw.WriteString(inner);
    > strCount += inner.Length;
    > }
    > }
    >
    > for (int i = 0; i < nodesToEnd; i++)
    > xw.WriteEndElement();
    >
    > xr.Close();
    > xw.Close();
    >
    >
    > return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
    > "");
    > }
    > }
    > catch (Exception ex)
    > {
    > // Just do the standard old string trim
    > string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
    > string output = Regex.Replace(content, stripHtmlEx, "");
    > if (output.Length > len)
    > output = "<p>" + output.Substring(0,
    > output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
    > ...</p>";
    > return output;
    > }
    > }
    >
     
    Karl Seguin, Nov 28, 2006
    #2
    1. Advertising

  3. Your stuck to using string manipulation, and its not likely to be the
    easiest task.

    I have to ask - if its from a blog, why cant you syndicate the RSS and
    consume it

    --
    --
    Regards

    John Timney (MVP)
    VISIT MY WEBSITE:
    http://www.johntimney.com
    http://www.johntimney.com/blog


    "apiringmvp" <> wrote in message
    news:...
    > All,
    >
    > So I am creating a function that gets a short blurb of html from a
    > blog. I would like to retain all html formating and images. The code
    > below works well, with the exception of one issue.
    >
    > My issue:
    > ---------------------
    > When a blog's html has attributes with no quotes i get an exception.
    >
    > Here's the example of the blog I am dealing with.
    > <p align=center>Some text from the blog.</p>
    >
    > Questions:
    > ----------------------
    > Is there a way to get the XmlTextReader to allow attributes without
    > quotes?
    >
    > If not, do you like RegExs for this replace?
    >
    > Then, Does anyone know any RegExs that could do this replace?
    >
    >
    > Code:
    > ----------------------
    > public static string GetContentShortBlurb(string content, int len)
    > {
    > try
    > {
    > using (System.IO.MemoryStream ms = new
    > System.IO.MemoryStream())
    > {
    > if (!content.TrimStart(' ', '\r',
    > '\n').StartsWith("<"))
    > content = "<p>" + content + "</p>";
    >
    > byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
    > + content + "</doc>");
    > ms.Write(cb, 0, cb.Length);
    > ms.Position = 0;
    >
    > // create Reader for parsing
    > XmlTextReader xr = new XmlTextReader(ms);
    >
    > // Create Writer for output
    > System.Text.StringBuilder sb = new
    > System.Text.StringBuilder();
    > XmlWriterSettings xws = new XmlWriterSettings();
    > xws.ConformanceLevel = ConformanceLevel.Fragment;
    > xws.Encoding = new System.Text.UTF8Encoding(false);
    > XmlWriter xw = XmlTextWriter.Create(sb, xws);
    >
    > xr.Read();
    >
    > int strCount = 0;
    > int nodesToEnd = 0;
    > while (strCount < len)
    > {
    > xr.Read();
    >
    > if (xr.NodeType == XmlNodeType.EndElement)
    > {
    > if (xr.Name == "doc") break;
    >
    > xw.WriteEndElement();
    > nodesToEnd--;
    > }
    >
    > if (xr.NodeType == XmlNodeType.Element)
    > {
    > xw.WriteStartElement(xr.Name);
    >
    > nodesToEnd++;
    >
    > // write attributes
    > while (xr.MoveToNextAttribute())
    > {
    > xw.WriteAttributeString(xr.Name, xr.Value);
    > }
    > }
    >
    > if (xr.NodeType == XmlNodeType.Text)
    > {
    > string inner = xr.Value;
    > if (inner.Length + strCount > len)
    > {
    > inner = inner.Substring(0,
    > inner.LastIndexOf(' ', len - strCount)) + " ...";
    > }
    > xw.WriteString(inner);
    > strCount += inner.Length;
    > }
    > }
    >
    > for (int i = 0; i < nodesToEnd; i++)
    > xw.WriteEndElement();
    >
    > xr.Close();
    > xw.Close();
    >
    >
    > return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
    > "");
    > }
    > }
    > catch (Exception ex)
    > {
    > // Just do the standard old string trim
    > string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
    > string output = Regex.Replace(content, stripHtmlEx, "");
    > if (output.Length > len)
    > output = "<p>" + output.Substring(0,
    > output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
    > ...</p>";
    > return output;
    > }
    > }
    >
     
    John Timney \(MVP\), Nov 28, 2006
    #3
  4. You are going to run into very serious problems using an XMLTextReader
    to operate on HTML. HTML is almost always NOT valid XML.

    You'd rather use regular expressions to manipulate the text.

    On 28 Nov 2006 07:24:56 -0800, "apiringmvp" <>
    wrote:

    >All,
    >
    >So I am creating a function that gets a short blurb of html from a
    >blog. I would like to retain all html formating and images. The code
    >below works well, with the exception of one issue.
    >
    >My issue:
    >---------------------
    >When a blog's html has attributes with no quotes i get an exception.
    >
    >Here's the example of the blog I am dealing with.
    ><p align=center>Some text from the blog.</p>
    >
    >Questions:
    >----------------------
    >Is there a way to get the XmlTextReader to allow attributes without
    >quotes?
    >
    >If not, do you like RegExs for this replace?
    >
    >Then, Does anyone know any RegExs that could do this replace?
    >
    >
    >Code:
    >----------------------
    >public static string GetContentShortBlurb(string content, int len)
    > {
    > try
    > {
    > using (System.IO.MemoryStream ms = new
    >System.IO.MemoryStream())
    > {
    > if (!content.TrimStart(' ', '\r',
    >'\n').StartsWith("<"))
    > content = "<p>" + content + "</p>";
    >
    > byte[] cb = System.Text.Encoding.UTF8.GetBytes("<doc>"
    >+ content + "</doc>");
    > ms.Write(cb, 0, cb.Length);
    > ms.Position = 0;
    >
    > // create Reader for parsing
    > XmlTextReader xr = new XmlTextReader(ms);
    >
    > // Create Writer for output
    > System.Text.StringBuilder sb = new
    >System.Text.StringBuilder();
    > XmlWriterSettings xws = new XmlWriterSettings();
    > xws.ConformanceLevel = ConformanceLevel.Fragment;
    > xws.Encoding = new System.Text.UTF8Encoding(false);
    > XmlWriter xw = XmlTextWriter.Create(sb, xws);
    >
    > xr.Read();
    >
    > int strCount = 0;
    > int nodesToEnd = 0;
    > while (strCount < len)
    > {
    > xr.Read();
    >
    > if (xr.NodeType == XmlNodeType.EndElement)
    > {
    > if (xr.Name == "doc") break;
    >
    > xw.WriteEndElement();
    > nodesToEnd--;
    > }
    >
    > if (xr.NodeType == XmlNodeType.Element)
    > {
    > xw.WriteStartElement(xr.Name);
    >
    > nodesToEnd++;
    >
    > // write attributes
    > while (xr.MoveToNextAttribute())
    > {
    > xw.WriteAttributeString(xr.Name, xr.Value);
    > }
    > }
    >
    > if (xr.NodeType == XmlNodeType.Text)
    > {
    > string inner = xr.Value;
    > if (inner.Length + strCount > len)
    > {
    > inner = inner.Substring(0,
    >inner.LastIndexOf(' ', len - strCount)) + " ...";
    > }
    > xw.WriteString(inner);
    > strCount += inner.Length;
    > }
    > }
    >
    > for (int i = 0; i < nodesToEnd; i++)
    > xw.WriteEndElement();
    >
    > xr.Close();
    > xw.Close();
    >
    >
    > return Regex.Replace(sb.ToString(), "<\\?xml\\b[^>]*>",
    >"");
    > }
    > }
    > catch (Exception ex)
    > {
    > // Just do the standard old string trim
    > string stripHtmlEx = "</?([A-Z][A-Z0-9]*)\\b[^>]*>";
    > string output = Regex.Replace(content, stripHtmlEx, "");
    > if (output.Length > len)
    > output = "<p>" + output.Substring(0,
    >output.LastIndexOf(' ', len)).Replace("\r\n", "</p>\r\n<p>") + "
    >...</p>";
    > return output;
    > }
    > }

    --

    Bits.Bytes.
    http://bytes.thinkersroom.com
     
    Rad [Visual C# MVP], Nov 28, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DotNet

    XmlTextReader

    DotNet, Feb 6, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    442
    Chris Jackson
    Feb 6, 2004
  2. =?Utf-8?B?WE1MIHJlYWRpbmcgd2l0aCBYTUxUZXh0UmVhZGVy

    XMLTextReader is not defined

    =?Utf-8?B?WE1MIHJlYWRpbmcgd2l0aCBYTUxUZXh0UmVhZGVy, Jan 26, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    2,063
    William F. Robertson, Jr.
    Jan 26, 2005
  3. Simon Harris

    XMLTextReader

    Simon Harris, May 10, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    490
    Simon Harris
    May 11, 2005
  4. =?Utf-8?B?Um9iZXJ0IFcu?=

    Using XMLTextReader with Asp.net

    =?Utf-8?B?Um9iZXJ0IFcu?=, Apr 30, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    10,227
    =?Utf-8?B?Um9iZXJ0IFcu?=
    Apr 30, 2006
  5. Replies:
    2
    Views:
    484
    bruce barker
    Aug 28, 2007
Loading...

Share This Page