Taking out text between symbols and joining together

B

Bob05 Dr

Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

<Title>Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<ProtocolName>None</ProtocolName>

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None


So far I have this:

require 'rexml/document'
include REXML
file = File.new("1.xml")
doc = Document.new(file)
puts doc
aFile = File.new("1.txt", "w")
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?
 
J

Jesús Gabriel y Galán

Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

<Title>Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<ProtocolName>None</ProtocolName>

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None


So far I have this:

require 'rexml/document'
include REXML
file = File.new("1.xml")
doc = Document.new(file)
puts doc
aFile = File.new("1.txt", "w")
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?

First of all, your document doesn't parse well, because it has two root nodes.
After solving that, what you need is to get to each element and
extract its text children nodes.
Take a look at:

http://www.germane-software.com/software/rexml/docs/tutorial.html

And the methods:

elements
[]
text

of the API. Experiment a little in IRB:

irb(main):001:0> s = <<EOF
irb(main):002:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):003:0" (GPM06600002310)</Title>
irb(main):004:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):005:0" <ProtocolName>None</ProtocolName>
irb(main):006:0" EOF
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n"
irb(main):007:0>
irb(main):008:0*
irb(main):009:0* require 'rexml/document'
=> true
irb(main):010:0> include REXML
=> Object
irb(main):011:0> doc = Document.new s
REXML::parseException: #<RuntimeError: attempted adding second root
element to document>

ooooops, two root elements. I'll add a fake one surrounding everything:

irb(main):012:0> s = <<EOF
irb(main):013:0" <ROOT>
irb(main):014:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):015:0" (GPM06600002310)</Title>
irb(main):016:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):017:0" <ProtocolName>None</ProtocolName>
irb(main):018:0" </ROOT>
irb(main):019:0" EOF
=> "<ROOT>\n<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n</ROOT>\n"
irb(main):020:0> doc = Document.new s
=> <UNDEFINED> ... </>
irb(main):025:0> doc.elements
=> #<REXML::Elements:0xb72907e0 @element=<UNDEFINED> ... </>>
irb(main):026:0> doc.elements.each {|el| p el}
<ROOT> ... </>
=> [<ROOT> ... </>]
irb(main):027:0> doc.to_a
=> [<ROOT> ... </>, "\n"]
irb(main):028:0> doc.elements.to_a
=> [<ROOT> ... </>]
irb(main):032:0> doc.elements["/Title"]
=> nil
irb(main):033:0> doc.elements["Title"]
=> nil
irb(main):034:0> root = doc.root
=> <ROOT> ... </>
irb(main):035:0> root.elements["Title"]
=> <Title> ... </>
irb(main):036:0> root.elements["Title"].to_s
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>"

Look, it seems that with that I can get the text of the Title element.
Let's see if there's a better way:

irb(main):039:0> root.elements["Title"].methods.sort
=> ["<<", "==", "===", "=~", "[]", "[]=", "__id__", "__send__", "add",
"add_attribute", "add_attributes", "add_element", "add_namespace",
"add_text", "all?", "any?", "attribute", "attributes", "bytes",
"cdatas", "children", "class", "clone", "collect", "comments",
"context", "context=", "count", "cycle", "dclone", "deep_clone",
"delete", "delete_at", "delete_attribute", "delete_element",
"delete_if", "delete_namespace", "detect", "display", "document",
"drop", "drop_while", "dup", "each", "each_child", "each_cons",
"each_element", "each_element_with_attribute",
"each_element_with_text", "each_index", "each_recursive",
"each_slice", "each_with_index", "elements", "entries", "enum_cons",
"enum_for", "enum_slice", "enum_with_index", "eql?", "equal?",
"expanded_name", "extend", "find", "find_all", "find_first_recursive",
"find_index", "first", "freeze", "frozen?", "fully_expanded_name",
"get_elements", "get_text", "grep", "group_by", "has_attributes?",
"has_elements?", "has_name?", "has_text?", "hash", "id",
"ignore_whitespace_nodes", "include?", "indent", "index",
"index_in_parent", "inject", "insert_after", "insert_before",
"inspect", "instance_eval", "instance_exec", "instance_of?",
"instance_variable_defined?", "instance_variable_get",
"instance_variable_set", "instance_variables", "instructions",
"is_a?", "kind_of?", "length", "local_name", "map", "max", "max_by",
"member?", "method", "methods", "min", "min_by", "minmax",
"minmax_by", "name", "name=", "namespace", "namespaces",
"next_element", "next_sibling", "next_sibling=", "next_sibling_node",
"nil?", "node_type", "none?", "object_id", "one?", "parent",
"parent=", "parent?", "partition", "prefix", "prefix=", "prefixes",
"previous_element", "previous_sibling", "previous_sibling=",
"previous_sibling_node", "private_methods", "protected_methods",
"public_methods", "push", "raw", "reduce", "reject", "remove",
"replace_child", "replace_with", "respond_to?", "reverse_each",
"root", "root_node", "select", "send", "singleton_methods", "size",
"sort", "sort_by", "taint", "tainted?", "take", "take_while", "tap",
"text", "text=", "texts", "to_a", "to_enum", "to_s", "to_set", "type",
"unshift", "untaint", "whitespace", "write", "xpath", "zip"]

There's a text method in there, would that do what I expect?

irb(main):040:0> root.elements["Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"

Bingo !

Is there a way to access it directly from the doc, instead of having a
root variable?

irb(main):042:0> doc.elements["ROOT/Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"


Now you can do the same for the other elements. I also recommend you
learn XPath and CSS selectors if you are going to be parsing markup,
and also look at other parsers like Nokogiri. This example was pretty
simple, but these things can get nasty.

Jesus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top