Reading XML to relational tables

T

Ted Flethuseo

Hi everyone,

I need to build 3 relational tables from an xml text. In this tables, I
need to keep track of words that have the <emph> and <bold> tags in them
along with the
word mentioned and its count in the <p> tag. This is easier to
illustrate with an example:

I need to take this text:

<p> My name is <strong>Ted</strong>, and I like <emph>coffee</emph>.
<strong>Ted</strong> does not like tea. </p>
<p> I have a brother who likes <emph>tea</emph> but does not like
<emph>coffee</emph> </p>

To 3 normalized tables like this:

...p_table...
p_id desc
1 My name is....
2 I have a ....


...p_to_emph_table...
p_id e_id count
1 2 1
2 1 1
2 2 1


...emph_table...
e_id emph_word
1 Tea
2 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

Any ideas appreciated,

Ted.
 
J

Jesús Gabriel y Galán

Hi everyone,

I need to build 3 relational tables from an xml text. In this tables, I
need to keep track of words that have the <emph> and <bold> tags in them
along with the
word mentioned and its count in the <p> tag. This is easier to
illustrate with an example:

I need to take this text:

<p> My name is <strong>Ted</strong>, and I like <emph>coffee</emph>.
<strong>Ted</strong> does not like tea. </p>
<p> I have a brother who likes <emph>tea</emph> but does not like
<emph>coffee</emph> </p>

To 3 normalized tables like this:

...p_table...
p_id =A0 =A0desc
1 =A0 =A0 =A0 My name is....
2 =A0 =A0 =A0 I have a ....


...p_to_emph_table...
p_id =A0 =A0e_id =A0 =A0count
1 =A0 =A0 =A0 2 =A0 =A0 =A0 1
2 =A0 =A0 =A0 1 =A0 =A0 =A0 1
2 =A0 =A0 =A0 2 =A0 =A0 =A0 1


...emph_table...
e_id =A0 =A0emph_word
1 =A0 =A0 =A0 Tea
2 =A0 =A0 =A0 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

What I'd do is parse the XML (use Nokogiri, for example) and get all p
elements. For each p element, insert it into p_table if not present
and get its id. Look at all emph inside the p element, and for each of
them:
- Check if the word is already in emph_table and get the id or
- Insert it into emph_table and get the id

With that id, insert or update a row in the p_to_emph_table with the p
and the word id.

This is a straightforward approach that should work. Make a try (ask
any question that blocks you) and let us know how it goes.

Jesus.
 
T

Ted Flethuseo

Hi Jesus,

Thank you for your help. Right now I am stuck trying to traverse the
elements in a single xml::element. I know I can use this elements method
to list the elements, but I am not sure how
I can traverse through them and get their contents individually.

xml = File.read('translateXML.xml')
doc = Nokogiri::XML(xml)

# split into sentences first
arr = doc.search('p')

puts arr[0].elements
 
J

Jesús Gabriel y Galán

Hi Jesus,

Thank you for your help. Right now I am stuck trying to traverse the
elements in a single xml::element. I know I can use this elements method
to list the elements, but I am not sure how
I can traverse through them and get their contents individually.

xml = File.read('translateXML.xml')
doc = Nokogiri::XML(xml)

# split into sentences first
arr = doc.search('p')

Try something like:

require 'nokogiri'

doc = Nokogiri::XML(File.read("p.xml"))
doc.search("p").each do |p_element|
puts "---------"
puts p_element.text
p_element.css("emph,strong").each do |emph|
puts "Highlighted: #{emph.text}"
end
end

Jesus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top