parsing to XML

S

steeve_dun

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

what I want is something like:

#### beginning of xml output ####
<glossary>
<definition>
<word> HTML </word>
<meaning> HyperText Markup Language </meaning>
</definition>
<definition>
<word> WWW </word>
<meaning> World Wide Web </meaning>
</definition>
</glossary>
#### end of xml output ####


---steeve
 
A

Anno Siegel

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

Your example doesn't show the variability of the data. Examples never
do, they only ever give a lower bound. There can always be a variant
that doesn't happen to appear in the example.

Can a "definition" span lines? Assuming that it can, you can't process
the text line-wise without major trickery. You'll need all of it in
memory . Here is a method that extracts the definitions from the text
and puts them in a hash:

my $text = <<'END_TEXT';
\\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \\glossary {WWW}{World Wide Web}
END_TEXT

my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

Generating XML from the hash is probably a job for one of the XML modules.

Anno
 
S

steeve_dun

Thank you very much, that was very helpful !


Anno said:
Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

Your example doesn't show the variability of the data. Examples never
do, they only ever give a lower bound. There can always be a variant
that doesn't happen to appear in the example.

Can a "definition" span lines? Assuming that it can, you can't process
the text line-wise without major trickery. You'll need all of it in
memory . Here is a method that extracts the definitions from the text
and puts them in a hash:

my $text = <<'END_TEXT';
\\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \\glossary {WWW}{World Wide Web}
END_TEXT

my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

Generating XML from the hash is probably a job for one of the XML modules.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
R

robic0

Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

what I want is something like:

#### beginning of xml output ####
<glossary>
<definition>
<word> HTML </word>
<meaning> HyperText Markup Language </meaning>
</definition>
<definition>
<word> WWW </word>
<meaning> World Wide Web </meaning>
</definition>
</glossary>
#### end of xml output ####


---steeve
I don't understand.
If you have a "structure" in mind, you don't show it.
XML is purely "structure" driven...
input "to" a structure, output "from" a structure.
Thats the definition of "simple" xml. If you want to
get into "complex" (nested) xml (and I don't know anybody that
does) that is beyond the scope of a question here, it seems.

So if you have simple xml "structure" in mind (with attributes)
then what would that be? You have to separate populating that
structure with the generation of "simple" xml output.
Schema can then be generated once you know what you want to do.
The largest software houses, including M$Shit use simple xml
because the xml is just a medium to transport structured data.
There should be no ambiguity that "nested" html could have.
You don't want to go down that road.

Also, I don't know what you mean by "definitions" in that (html?)
document. Just what is it your trying to accomplish?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top