xml data or other?

Discussion in 'Python' started by Artie Ziff, Nov 9, 2012.

  1. Artie Ziff

    Artie Ziff Guest

    Hello,

    I want to process XML-like data like this:

    <testname=ltpacpi.sh>
    <description>
    ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.

    <\description>
    <test_location>
    ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
    <\test_location>
    <\testname>


    After manually editing the data above, the python module
    xml.etree.ElementTree parses it without failing due to error in the data
    structure.

    Edits were substituting '/' for '\' on the end tags, and adding the
    following structure:

    <?xml version="1.0"?>
    <data>
    <testname name=ltpacpi.sh>
    ...
    <\testname>
    </data>


    Is there a name for the format above (perhaps xhtml)?
    I'd like to find a python module that can translate it to proper xml.
    Does one exist? etree?

    Many thanks!
    az
    Artie Ziff, Nov 9, 2012
    #1
    1. Advertising

  2. Artie Ziff

    rusi Guest

    On Nov 9, 5:54 pm, Artie Ziff <> wrote:
    > Hello,
    >
    > I want to process XML-like data like this:

    <snipped>
    > Edits were substituting '/' for '\' on the end tags, and adding the
    > following structure:


    If thats all you want, you can try the following:

    # obviously this should come from a file
    input= """<testname=ltpacpi.sh>
    <description>
    ACPI (Advanced Control Power & Integration) testscript
    for 2.5 kernels.

    <\description>
    <test_location>
    ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
    <\test_location>
    <\testname>"""

    prefix = """<?xml version="1.0"?>
    <data>
    """

    postfix = """</data>"""

    correctedInput = prefix + input.replace("\\", "/") + postfix
    # submit correctedinput to etree
    rusi, Nov 9, 2012
    #2
    1. Advertising

  3. Artie Ziff

    Guest

    On Friday, November 9, 2012 12:54:56 PM UTC, Artie Ziff wrote:
    > Hello,
    >
    >
    >
    > I want to process XML-like data like this:
    >
    >
    >
    > <testname=ltpacpi.sh>
    >
    > <description>
    >
    > ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.
    >
    >
    >
    > <\description>
    >
    > <test_location>
    >
    > ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
    >
    > <\test_location>
    >
    > <\testname>
    >
    >
    >
    >
    >
    > After manually editing the data above, the python module
    >
    > xml.etree.ElementTree parses it without failing due to error in the data
    >
    > structure.
    >
    >
    >
    > Edits were substituting '/' for '\' on the end tags, and adding the
    >
    > following structure:
    >
    >
    >
    > <?xml version="1.0"?>
    >
    > <data>
    >
    > <testname name=ltpacpi.sh>
    >
    > ...
    >
    > <\testname>
    >
    > </data>
    >
    >
    >
    >
    >
    > Is there a name for the format above (perhaps xhtml)?
    >
    > I'd like to find a python module that can translate it to proper xml.
    >
    > Does one exist? etree?
    >
    >
    >
    > Many thanks!
    >
    > az


    maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info
    , Nov 13, 2012
    #3
  4. Artie Ziff

    Guest

    On Friday, November 9, 2012 12:54:56 PM UTC, Artie Ziff wrote:
    > Hello,
    >
    >
    >
    > I want to process XML-like data like this:
    >
    >
    >
    > <testname=ltpacpi.sh>
    >
    > <description>
    >
    > ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.
    >
    >
    >
    > <\description>
    >
    > <test_location>
    >
    > ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
    >
    > <\test_location>
    >
    > <\testname>
    >
    >
    >
    >
    >
    > After manually editing the data above, the python module
    >
    > xml.etree.ElementTree parses it without failing due to error in the data
    >
    > structure.
    >
    >
    >
    > Edits were substituting '/' for '\' on the end tags, and adding the
    >
    > following structure:
    >
    >
    >
    > <?xml version="1.0"?>
    >
    > <data>
    >
    > <testname name=ltpacpi.sh>
    >
    > ...
    >
    > <\testname>
    >
    > </data>
    >
    >
    >
    >
    >
    > Is there a name for the format above (perhaps xhtml)?
    >
    > I'd like to find a python module that can translate it to proper xml.
    >
    > Does one exist? etree?
    >
    >
    >
    > Many thanks!
    >
    > az


    maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info
    , Nov 13, 2012
    #4
  5. Artie Ziff

    Artie Ziff Guest

    On 11/9/12 5:50 AM, rusi wrote:
    > On Nov 9, 5:54 pm, Artie Ziff <> wrote:
    > # submit correctedinput to etree

    I was very grateful to get the "leg up" on getting started down that
    right path with my coding. Many thanks to you, rusi. I took your
    excellent advices and have this working.

    class Converter():
    PREFIX = """<?xml version="1.0"?>
    <data>
    """
    POSTFIX = "</data>"
    def __init__(self, data):
    self.data = data
    self.writeXML()
    def writeXML(self):
    pattern = re.compile('<testname=(.*)>')
    replaceStr = r'<testname name="\1">'
    xmlData = re.sub(pattern, replaceStr, self.data)
    self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
    self.POSTFIX

    ### main
    # input to script is directory:
    # sanitize trailing slash
    testPkgDir = sys.argv[1].rstrip('/')
    # Within each test package directory is doc/testcase
    tcDocDir = "doc/testcases"
    # set input dir, containing broken files
    tcTxtDir = os.path.join(testPkgDir, tcDocDir)
    # set output dir, to write proper XML files
    tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
    if not os.path.exists(tcXmlDir):
    os.makedirs(tcXmlDir)
    # iterate through files in input dir
    for filename in os.listdir(tcTxtDir):
    # set filepaths
    filepathTXT = os.path.join(tcTxtDir, filename)
    base = os.path.splitext(filename)[0]
    fileXML = base + ".xml"
    filepathXML = os.path.join(tcXmlDir, fileXML)
    # read broken file, convert to proper XML
    with open(filepathTXT) as f:
    c = Converter(f.read())
    xmlFO = open(filepathXML, 'w') # xmlFileObject
    xmlFO.write(c.dataXML)
    xmlFO.close()

    ###

    Writing XML files so to see whats happening. My plan is to
    keep xml data in memory and parse with xml.etree.ElementTree.

    Unfortunately, xml parsing fails due to angle brackets inside
    description tags. In particular, xml.etree.ElementTree.parse()
    aborts on '<' inside xml data such as the following:

    <testname name="cron_test.sh">
    <description>
    This testcase tests if crontab <filename> installs the cronjob
    and cron schedules the job correctly.
    <\description>

    ##

    What is right way to handle the extra angle brackets?
    Substitute on line-by-line basis, if that works?
    Or learn to write a simple stack-style parser, or
    recursive descent, it may be called?

    I am open to comments to improve my code more to be more readable,
    pythonic, or better.

    Many thanks
    AZ
    Artie Ziff, Nov 18, 2012
    #5
  6. Artie Ziff

    rusi Guest

    On Nov 18, 6:32 pm, Artie Ziff <> wrote:
    > Unfortunately, xml parsing fails due to angle brackets inside
    > description tags. In particular, xml.etree.ElementTree.parse()
    > aborts on '<' inside xml data such as the following:
    >
    > <testname name="cron_test.sh">
    >      <description>
    >          This testcase tests if crontab <filename> installs thecronjob
    >          and cron schedules the job correctly.
    >      <\description>
    >
    > ##
    >
    > What is right way to handle the extra angle brackets?
    > Substitute on line-by-line basis, if that works?
    > Or learn to write a simple stack-style parser, or
    > recursive descent, it may be called?
    >
    > I am open to comments to improve my code more to be more readable,
    > pythonic, or better.
    >
    > Many thanks
    > AZ


    Start with cgi.escape perhaps?
    http://docs.python.org/2/library/cgi.html
    rusi, Nov 18, 2012
    #6
  7. Artie Ziff

    rusi Guest

    rusi, Nov 18, 2012
    #7
  8. Artie Ziff wrote:

    >
    > On 11/9/12 5:50 AM, rusi wrote:

    > > On Nov 9, 5:54 pm, Artie Ziff <> wrote:
    >> # submit correctedinput to etree

    > I was very grateful to get the "leg up" on getting started down that
    > right path with my coding. Many thanks to you, rusi. I took your
    > excellent advices and have this working.
    >
    > class Converter():
    > PREFIX = """<?xml version="1.0"?>
    > <data>
    > """
    > POSTFIX = "</data>"
    > def __init__(self, data):
    > self.data = data
    > self.writeXML()
    > def writeXML(self):
    > pattern = re.compile('<testname=(.*)>')
    > replaceStr = r'<testname name="\1">'
    > xmlData = re.sub(pattern, replaceStr, self.data)
    > self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
    > self.POSTFIX
    >
    > ### main
    > # input to script is directory:
    ># sanitize trailing slash
    > testPkgDir = sys.argv[1].rstrip('/')
    > # Within each test package directory is doc/testcase
    > tcDocDir = "doc/testcases"
    > # set input dir, containing broken files
    > tcTxtDir = os.path.join(testPkgDir, tcDocDir)
    > # set output dir, to write proper XML files
    > tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
    > if not os.path.exists(tcXmlDir):
    > os.makedirs(tcXmlDir)
    > # iterate through files in input dir
    > for filename in os.listdir(tcTxtDir):
    > # set filepaths
    > filepathTXT = os.path.join(tcTxtDir, filename)
    > base = os.path.splitext(filename)[0]
    > fileXML = base + ".xml"
    > filepathXML = os.path.join(tcXmlDir, fileXML)
    > # read broken file, convert to proper XML
    > with open(filepathTXT) as f:
    > c = Converter(f.read())
    > xmlFO = open(filepathXML, 'w') # xmlFileObject
    > xmlFO.write(c.dataXML)
    > xmlFO.close()
    >
    > ###
    >
    > Writing XML files so to see whats happening. My plan is to
    > keep xml data in memory and parse with xml.etree.ElementTree.
    >
    > Unfortunately, xml parsing fails due to angle brackets inside
    > description tags. In particular, xml.etree.ElementTree.parse()
    > aborts on '<' inside xml data such as the following:
    >
    > <testname name="cron_test.sh">
    > <description>
    > This testcase tests if crontab <filename> installs the cronjob
    > and cron schedules the job correctly.
    > <\description>
    >
    > ##
    >
    > What is right way to handle the extra angle brackets?
    > Substitute on line-by-line basis, if that works?
    > Or learn to write a simple stack-style parser, or
    > recursive descent, it may be called?


    I think your description text should be in a CDATA section.
    http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML

    ~Ramit


    This email is confidential and subject to important disclaimers and
    conditions including on offers for the purchase or sale of
    securities, accuracy and completeness of information, viruses,
    confidentiality, legal privilege, and legal entity disclaimers,
    available at http://www.jpmorgan.com/pages/disclosures/email.
    Prasad, Ramit, Nov 19, 2012
    #8
  9. Prasad, Ramit, 19.11.2012 22:42:
    > Artie Ziff wrote:
    >> Writing XML files so to see whats happening. My plan is to
    >> keep xml data in memory and parse with xml.etree.ElementTree.
    >>
    >> Unfortunately, xml parsing fails due to angle brackets inside
    >> description tags. In particular, xml.etree.ElementTree.parse()
    >> aborts on '<' inside xml data such as the following:
    >>
    >> <testname name="cron_test.sh">
    >> <description>
    >> This testcase tests if crontab <filename> installs the cronjob
    >> and cron schedules the job correctly.
    >> <\description>
    >>
    >> ##
    >>
    >> What is right way to handle the extra angle brackets?
    >> Substitute on line-by-line basis, if that works?
    >> Or learn to write a simple stack-style parser, or
    >> recursive descent, it may be called?

    >
    > I think your description text should be in a CDATA section.
    > http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML


    Ah, don't bother with CDATA. Just make sure the data gets properly escaped,
    any XML serialiser will do that for you. Just generate the XML using
    ElementTree and you'll be fine. Generating XML as literal text is not a
    good idea.

    Stefan
    Stefan Behnel, Nov 20, 2012
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page