Taking out text between symbols and joining together

Bob05 Dr · Jun 30, 2010

Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

<Title>Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<ProtocolName>None</ProtocolName>

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None

So far I have this:

require 'rexml/document'
include REXML
file = File.new("1.xml")
doc = Document.new(file)
puts doc
aFile = File.new("1.txt", "w")
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?

Jesús Gabriel y Galán · Jun 30, 2010

Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

<Title>Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<ProtocolName>None</ProtocolName>

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None

So far I have this:

require 'rexml/document'
include REXML
file = File.new("1.xml")
doc = Document.new(file)
puts doc
aFile = File.new("1.txt", "w")
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?

First of all, your document doesn't parse well, because it has two root nodes.
After solving that, what you need is to get to each element and
extract its text children nodes.
Take a look at:

http://www.germane-software.com/software/rexml/docs/tutorial.html

And the methods:

elements
[]
text

of the API. Experiment a little in IRB:

irb(main):001:0> s = <<EOF
irb(main):002:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):003:0" (GPM06600002310)</Title>
irb(main):004:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):005:0" <ProtocolName>None</ProtocolName>
irb(main):006:0" EOF
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n"
irb(main):007:0>
irb(main):008:0*
irb(main):009:0* require 'rexml/document'
=> true
irb(main):010:0> include REXML
=> Object
irb(main):011:0> doc = Document.new s
REXML:

arseException: #<RuntimeError: attempted adding second root
element to document>

ooooops, two root elements. I'll add a fake one surrounding everything:

irb(main):012:0> s = <<EOF
irb(main):013:0" <ROOT>
irb(main):014:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):015:0" (GPM06600002310)</Title>
irb(main):016:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):017:0" <ProtocolName>None</ProtocolName>
irb(main):018:0" </ROOT>
irb(main):019:0" EOF
=> "<ROOT>\n<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n</ROOT>\n"
irb(main):020:0> doc = Document.new s
=> <UNDEFINED> ... </>
irb(main):025:0> doc.elements
=> #<REXML::Elements:0xb72907e0 @element=<UNDEFINED> ... </>>
irb(main):026:0> doc.elements.each {|el| p el}
<ROOT> ... </>
=> [<ROOT> ... </>]
irb(main):027:0> doc.to_a
=> [<ROOT> ... </>, "\n"]
irb(main):028:0> doc.elements.to_a
=> [<ROOT> ... </>]
irb(main):032:0> doc.elements["/Title"]
=> nil
irb(main):033:0> doc.elements["Title"]
=> nil
irb(main):034:0> root = doc.root
=> <ROOT> ... </>
irb(main):035:0> root.elements["Title"]
=> <Title> ... </>
irb(main):036:0> root.elements["Title"].to_s
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>"

Look, it seems that with that I can get the text of the Title element.
Let's see if there's a better way:

irb(main):039:0> root.elements["Title"].methods.sort
=> ["<<", "==", "===", "=~", "[]", "[]=", "__id__", "__send__", "add",
"add_attribute", "add_attributes", "add_element", "add_namespace",
"add_text", "all?", "any?", "attribute", "attributes", "bytes",
"cdatas", "children", "class", "clone", "collect", "comments",
"context", "context=", "count", "cycle", "dclone", "deep_clone",
"delete", "delete_at", "delete_attribute", "delete_element",
"delete_if", "delete_namespace", "detect", "display", "document",
"drop", "drop_while", "dup", "each", "each_child", "each_cons",
"each_element", "each_element_with_attribute",
"each_element_with_text", "each_index", "each_recursive",
"each_slice", "each_with_index", "elements", "entries", "enum_cons",
"enum_for", "enum_slice", "enum_with_index", "eql?", "equal?",
"expanded_name", "extend", "find", "find_all", "find_first_recursive",
"find_index", "first", "freeze", "frozen?", "fully_expanded_name",
"get_elements", "get_text", "grep", "group_by", "has_attributes?",
"has_elements?", "has_name?", "has_text?", "hash", "id",
"ignore_whitespace_nodes", "include?", "indent", "index",
"index_in_parent", "inject", "insert_after", "insert_before",
"inspect", "instance_eval", "instance_exec", "instance_of?",
"instance_variable_defined?", "instance_variable_get",
"instance_variable_set", "instance_variables", "instructions",
"is_a?", "kind_of?", "length", "local_name", "map", "max", "max_by",
"member?", "method", "methods", "min", "min_by", "minmax",
"minmax_by", "name", "name=", "namespace", "namespaces",
"next_element", "next_sibling", "next_sibling=", "next_sibling_node",
"nil?", "node_type", "none?", "object_id", "one?", "parent",
"parent=", "parent?", "partition", "prefix", "prefix=", "prefixes",
"previous_element", "previous_sibling", "previous_sibling=",
"previous_sibling_node", "private_methods", "protected_methods",
"public_methods", "push", "raw", "reduce", "reject", "remove",
"replace_child", "replace_with", "respond_to?", "reverse_each",
"root", "root_node", "select", "send", "singleton_methods", "size",
"sort", "sort_by", "taint", "tainted?", "take", "take_while", "tap",
"text", "text=", "texts", "to_a", "to_enum", "to_s", "to_set", "type",
"unshift", "untaint", "whitespace", "write", "xpath", "zip"]

There's a text method in there, would that do what I expect?

irb(main):040:0> root.elements["Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"

Bingo !

Is there a way to access it directly from the doc, instead of having a
root variable?

irb(main):042:0> doc.elements["ROOT/Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"

Now you can do the same for the other elements. I also recommend you
learn XPath and CSS selectors if you are going to be parsing markup,
and also look at other parsers like Nokogiri. This example was pretty
simple, but these things can get nasty.

Jesus.

Basic iteration through text	1	Aug 9, 2009
Python point location of intersect between two lines	0	Feb 28, 2018
REXML document creation speed	2	Feb 19, 2008
parse a csv file into a text file	29	Feb 6, 2014
csv read clean up and write out to csv	2	Nov 2, 2012
changing the format of a text file	2	Feb 25, 2009
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
help needed with subprocess, pipes and parameters	0	Jul 13, 2012

Taking out text between symbols and joining together

Bob05 Dr

Jesús Gabriel y Galán

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads