xml data or other?

A

Artie Ziff

Hello,

I want to process XML-like data like this:

<testname=ltpacpi.sh>
<description>
ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.

<\description>
<test_location>
ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
<\test_location>
<\testname>


After manually editing the data above, the python module
xml.etree.ElementTree parses it without failing due to error in the data
structure.

Edits were substituting '/' for '\' on the end tags, and adding the
following structure:

<?xml version="1.0"?>
<data>
<testname name=ltpacpi.sh>
...
<\testname>
</data>


Is there a name for the format above (perhaps xhtml)?
I'd like to find a python module that can translate it to proper xml.
Does one exist? etree?

Many thanks!
az
 
R

rusi

Hello,

I want to process XML-like data like this:
Edits were substituting '/' for '\' on the end tags, and adding the
following structure:

If thats all you want, you can try the following:

# obviously this should come from a file
input= """<testname=ltpacpi.sh>
<description>
ACPI (Advanced Control Power & Integration) testscript
for 2.5 kernels.

<\description>
<test_location>
ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
<\test_location>
<\testname>"""

prefix = """<?xml version="1.0"?>
<data>
"""

postfix = """</data>"""

correctedInput = prefix + input.replace("\\", "/") + postfix
# submit correctedinput to etree
 
S

shivers.paul

Hello,



I want to process XML-like data like this:



<testname=ltpacpi.sh>

<description>

ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.



<\description>

<test_location>

ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh

<\test_location>

<\testname>





After manually editing the data above, the python module

xml.etree.ElementTree parses it without failing due to error in the data

structure.



Edits were substituting '/' for '\' on the end tags, and adding the

following structure:



<?xml version="1.0"?>

<data>

<testname name=ltpacpi.sh>

...

<\testname>

</data>





Is there a name for the format above (perhaps xhtml)?

I'd like to find a python module that can translate it to proper xml.

Does one exist? etree?



Many thanks!

az

maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info
 
S

shivers.paul

Hello,



I want to process XML-like data like this:



<testname=ltpacpi.sh>

<description>

ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.



<\description>

<test_location>

ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh

<\test_location>

<\testname>





After manually editing the data above, the python module

xml.etree.ElementTree parses it without failing due to error in the data

structure.



Edits were substituting '/' for '\' on the end tags, and adding the

following structure:



<?xml version="1.0"?>

<data>

<testname name=ltpacpi.sh>

...

<\testname>

</data>





Is there a name for the format above (perhaps xhtml)?

I'd like to find a python module that can translate it to proper xml.

Does one exist? etree?



Many thanks!

az

maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info
 
A

Artie Ziff

# submit correctedinput to etree
I was very grateful to get the "leg up" on getting started down that
right path with my coding. Many thanks to you, rusi. I took your
excellent advices and have this working.

class Converter():
PREFIX = """<?xml version="1.0"?>
<data>
"""
POSTFIX = "</data>"
def __init__(self, data):
self.data = data
self.writeXML()
def writeXML(self):
pattern = re.compile('<testname=(.*)>')
replaceStr = r'<testname name="\1">'
xmlData = re.sub(pattern, replaceStr, self.data)
self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
self.POSTFIX

### main
# input to script is directory:
# sanitize trailing slash
testPkgDir = sys.argv[1].rstrip('/')
# Within each test package directory is doc/testcase
tcDocDir = "doc/testcases"
# set input dir, containing broken files
tcTxtDir = os.path.join(testPkgDir, tcDocDir)
# set output dir, to write proper XML files
tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
if not os.path.exists(tcXmlDir):
os.makedirs(tcXmlDir)
# iterate through files in input dir
for filename in os.listdir(tcTxtDir):
# set filepaths
filepathTXT = os.path.join(tcTxtDir, filename)
base = os.path.splitext(filename)[0]
fileXML = base + ".xml"
filepathXML = os.path.join(tcXmlDir, fileXML)
# read broken file, convert to proper XML
with open(filepathTXT) as f:
c = Converter(f.read())
xmlFO = open(filepathXML, 'w') # xmlFileObject
xmlFO.write(c.dataXML)
xmlFO.close()

###

Writing XML files so to see whats happening. My plan is to
keep xml data in memory and parse with xml.etree.ElementTree.

Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:

<testname name="cron_test.sh">
<description>
This testcase tests if crontab <filename> installs the cronjob
and cron schedules the job correctly.
<\description>

##

What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?

I am open to comments to improve my code more to be more readable,
pythonic, or better.

Many thanks
AZ
 
R

rusi

Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:

<testname name="cron_test.sh">
     <description>
         This testcase tests if crontab <filename> installs thecronjob
         and cron schedules the job correctly.
     <\description>

##

What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?

I am open to comments to improve my code more to be more readable,
pythonic, or better.

Many thanks
AZ

Start with cgi.escape perhaps?
http://docs.python.org/2/library/cgi.html
 
P

Prasad, Ramit

Artie said:
# submit correctedinput to etree
I was very grateful to get the "leg up" on getting started down that
right path with my coding. Many thanks to you, rusi. I took your
excellent advices and have this working.

class Converter():
PREFIX = """<?xml version="1.0"?>
<data>
"""
POSTFIX = "</data>"
def __init__(self, data):
self.data = data
self.writeXML()
def writeXML(self):
pattern = re.compile('<testname=(.*)>')
replaceStr = r'<testname name="\1">'
xmlData = re.sub(pattern, replaceStr, self.data)
self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
self.POSTFIX

### main
# input to script is directory:
# sanitize trailing slash
testPkgDir = sys.argv[1].rstrip('/')
# Within each test package directory is doc/testcase
tcDocDir = "doc/testcases"
# set input dir, containing broken files
tcTxtDir = os.path.join(testPkgDir, tcDocDir)
# set output dir, to write proper XML files
tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
if not os.path.exists(tcXmlDir):
os.makedirs(tcXmlDir)
# iterate through files in input dir
for filename in os.listdir(tcTxtDir):
# set filepaths
filepathTXT = os.path.join(tcTxtDir, filename)
base = os.path.splitext(filename)[0]
fileXML = base + ".xml"
filepathXML = os.path.join(tcXmlDir, fileXML)
# read broken file, convert to proper XML
with open(filepathTXT) as f:
c = Converter(f.read())
xmlFO = open(filepathXML, 'w') # xmlFileObject
xmlFO.write(c.dataXML)
xmlFO.close()

###

Writing XML files so to see whats happening. My plan is to
keep xml data in memory and parse with xml.etree.ElementTree.

Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:

<testname name="cron_test.sh">
<description>
This testcase tests if crontab <filename> installs the cronjob
and cron schedules the job correctly.
<\description>

##

What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?

I think your description text should be in a CDATA section.
http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML

~Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top