parsing to XML

steeve_dun · Oct 6, 2005

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

what I want is something like:

#### beginning of xml output ####
<glossary>
<definition>
<word> HTML </word>
<meaning> HyperText Markup Language </meaning>
</definition>
<definition>
<word> WWW </word>
<meaning> World Wide Web </meaning>
</definition>
</glossary>
#### end of xml output ####

---steeve

Anno Siegel · Oct 6, 2005

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

Your example doesn't show the variability of the data. Examples never
do, they only ever give a lower bound. There can always be a variant
that doesn't happen to appear in the example.

Can a "definition" span lines? Assuming that it can, you can't process
the text line-wise without major trickery. You'll need all of it in
memory . Here is a method that extracts the definitions from the text
and puts them in a hash:

my $text = <<'END_TEXT';
\\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \\glossary {WWW}{World Wide Web}
END_TEXT

my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

Generating XML from the hash is probably a job for one of the XML modules.

Anno

steeve_dun · Oct 7, 2005

Thank you very much, that was very helpful !

Anno said:
Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

Click to expand...

Your example doesn't show the variability of the data. Examples never
do, they only ever give a lower bound. There can always be a variant
that doesn't happen to appear in the example.

Can a "definition" span lines? Assuming that it can, you can't process
the text line-wise without major trickery. You'll need all of it in
memory . Here is a method that extracts the definitions from the text
and puts them in a hash:

my $text = <<'END_TEXT';
\\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \\glossary {WWW}{World Wide Web}
END_TEXT

my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

Generating XML from the hash is probably a job for one of the XML modules.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

robic0 · Oct 9, 2005

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

what I want is something like:

#### beginning of xml output ####
<glossary>
<definition>
<word> HTML </word>
<meaning> HyperText Markup Language </meaning>
</definition>
<definition>
<word> WWW </word>
<meaning> World Wide Web </meaning>
</definition>
</glossary>
#### end of xml output ####

---steeve

I don't understand.
If you have a "structure" in mind, you don't show it.
XML is purely "structure" driven...
input "to" a structure, output "from" a structure.
Thats the definition of "simple" xml. If you want to
get into "complex" (nested) xml (and I don't know anybody that
does) that is beyond the scope of a question here, it seems.

So if you have simple xml "structure" in mind (with attributes)
then what would that be? You have to separate populating that
structure with the generation of "simple" xml output.
Schema can then be generated once you know what you want to do.
The largest software houses, including M$Shit use simple xml
because the xml is just a medium to transport structured data.
There should be no ambiguity that "nested" html could have.
You don't want to go down that road.

Also, I don't know what you mean by "definitions" in that (html?)
document. Just what is it your trying to accomplish?

I'm tempted to quit out of frustration	1	Aug 13, 2023
Whitespace problems, xml-parsing	5	Apr 15, 2008
Parsing String of Named Function & Converting To Source	5	Oct 18, 2011
XML Parsing Problem in Internet Explorer	1	Oct 11, 2008
Syncro Soft Announces New Release of Oxygen XML Editor version 15.1	0	Oct 7, 2013
XML Parsing Puzzle	2	May 4, 2006
A Look At The Advantages and Drawbacks of XML	13	Jan 22, 2013
xml parsing escape characters	16	Jan 19, 2005

parsing to XML

steeve_dun

Anno Siegel

steeve_dun

robic0

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads