Special characters in attributes

Discussion in 'HTML' started by SDG, Sep 19, 2007.

  1. SDG

    SDG Guest

    Hi, I'm writing a web scraper to extract text from a web page, and I
    need to know what characters can be present inside an attribute of a
    tag.
    So far, in the code of my program, I've written that attributes can
    contain this characters: '!=@/ \[]#.:_()-&;?
    Did I forget something? I've looked if there's an official
    specification (like a regular expression for HTML or even only for
    attributes), but so far I haven't found anything.
    Thanks a lot
     
    SDG, Sep 19, 2007
    #1
    1. Advertising

  2. On Sep 19, 1:30 pm, SDG <> wrote:
    > Hi, I'm writing a web scraper to extract text from a web page, and I
    > need to know what characters can be present inside an attribute of a
    > tag.


    Any, although some attributes have limits on what is allowed, although
    those limits aren't usually expressed by the DTD (e.g. the width
    attribute takes an integer or an integer followed by a percentage
    sign), and other characters (& for example) have special meaning.

    --
    David Dorward
    http://dorward.me.uk/
    http://blog.dorward.me.uk/
     
    David Dorward, Sep 19, 2007
    #2
    1. Advertising

  3. Scripsit SDG:

    > Hi, I'm writing a web scraper to extract text from a web page,


    Sounds like reinventing the wheel. Do you intend to reinvent it from
    scratch, or are you using some software package for parsing HTML?

    > and I
    > need to know what characters can be present inside an attribute of a
    > tag.


    Apparently you are not using some software package for parsing HTML. Do you
    really think you are competent enough to consider SGML parsing, XML parsing,
    and tagsoup parsing, including their conflicts?

    > So far, in the code of my program, I've written that attributes can
    > contain this characters: '!=@/ \[]#.:_()-&;?


    What an interesting set of characters. I think it's probably the set you
    found lying on your keyboard, excluding - for some odd reason - letters and
    digits. And you didn't notice e.g. the poor lonesome "+" or the
    innocent-looking "$".

    > Did I forget something?


    Oh, just about 1,000,000 characters. (I'm not kidding. The character set of
    HTML is defined as UCS, commonly known as the Unicode character set, though
    more formally the ISO 10646 set. Currently only about 100,000 code points
    have been allocated, but can you disallow, in HTML parsing, the unassigned
    code points? Hardly.)

    > I've looked if there's an official
    > specification (like a regular expression for HTML or even only for
    > attributes), but so far I haven't found anything.


    There are several official specifications for HTML. Didn't you know this?
    The character repertoire allowed inside an attribute value depends on the
    declaration of the attribute, but it can be CDATA, i.e. arbitrary character
    data, excluding just the string delimiter (" or ') and, with some variation
    between HTML versions, the ampersand character & as such in many or all
    contexts. So the question is what can and needs to _excluded_ (or, better,
    treated as markup errors).

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Sep 19, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    33,063
    Stefan Mueller
    Jul 23, 2006
  2. Peter Jakobi
    Replies:
    1
    Views:
    437
    Joris Gillis
    Jul 1, 2005
  3. Replies:
    2
    Views:
    1,102
    Ingo Menger
    May 31, 2007
  4. rvino
    Replies:
    0
    Views:
    4,668
    rvino
    Aug 14, 2007
  5. majna
    Replies:
    4
    Views:
    686
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page