N
nutsmuggler
Hello folks.
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:
<NTMMemoryDb>
<Description>
</Description>
<Segment>0000000001
<Control>
00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.
000BB1CTmst.idd
</Control>
<Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Source>
<Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Target>
</Segment>
<Segment>0000000002
<Control>
00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR.
000Configuration_PDSG.IDE
</Control>
<Source><titleblk>
<title>Configuration information and guidelines</title>
</titleblk></Source>
<Target><titleblk>
<title>Informazioni e istruzioni per la configurazione</title>
</titleblk></Target>
etc...
These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
"opened" tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.
My code is quite simple :
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
class Listener
include StreamListener
$segment = ""
$result = ""
$is_there = false
def tag_start(name, attributes)
if name == "Source"
$segment << "EN:"
end
if name == "Target"
$segment << "IT:"
end
end
def tag_end(name)
if name == "Target"
if $is_there
$result << $segment
end
$segment = ""
$is_there = false
end
if name == "NTMMemoryDb"
puts $result
end
end
def text(text)
$segment << text
if text =~ /blade/
$is_there = true
end
end
end
listener = Listener.new
parser =
Parsers::StreamParser.new(File.new("bch01aad006_MEMORIA.EXP"),
listener)
parser.parse
I need to bypass mistakes, and tell StreamListener: "when you
encounter a faulty segment, don't bother!"
How do I achieve this?
Thanks in advance,
Davide
I am trying to build a simple XML parser to extract data from IBM
translation manager memories. Here is a sample os such memory files:
<NTMMemoryDb>
<Description>
</Description>
<Segment>0000000001
<Control>
00012200000001178876638English(U.S.)ITALIANIBMIDDOCBB1CTMST.
000BB1CTmst.idd
</Control>
<Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Source>
<Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
noteindent="no-noteindent"
brand="default-brand"></Target>
</Segment>
<Segment>0000000002
<Control>
00000300000001178876638English(U.S.)ITALIANIBMIDDOCCONFIGUR.
000Configuration_PDSG.IDE
</Control>
<Source><titleblk>
<title>Configuration information and guidelines</title>
</titleblk></Source>
<Target><titleblk>
<title>Informazioni e istruzioni per la configurazione</title>
</titleblk></Target>
etc...
These memory files are quite similar to XML files, but I suspect they
actually conform to another standard. In fact, they often include
"opened" tags; these because they store segments of translation; thus,
when the translation is referred to a website or a SGML document, the
original HTML or SGML might be split in two or more parts. So I often
encounter faulty segments; open tags generate a REXML fault.
My code is quite simple :
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
class Listener
include StreamListener
$segment = ""
$result = ""
$is_there = false
def tag_start(name, attributes)
if name == "Source"
$segment << "EN:"
end
if name == "Target"
$segment << "IT:"
end
end
def tag_end(name)
if name == "Target"
if $is_there
$result << $segment
end
$segment = ""
$is_there = false
end
if name == "NTMMemoryDb"
puts $result
end
end
def text(text)
$segment << text
if text =~ /blade/
$is_there = true
end
end
end
listener = Listener.new
parser =
Parsers::StreamParser.new(File.new("bch01aad006_MEMORIA.EXP"),
listener)
parser.parse
I need to bypass mistakes, and tell StreamListener: "when you
encounter a faulty segment, don't bother!"
How do I achieve this?
Thanks in advance,
Davide