Parsing XML and storing attributes in MySQL using Perl

Discussion in 'Perl Misc' started by jkugler, Jun 12, 2006.

  1. jkugler

    jkugler Guest

    Hello,

    I am trying to store huge amounts of data from xml files and put them
    into a MySQL database. The xml files all are in this format:

    <?xml version="1.0" encoding="UTF-8"?>
    <snp_submission xsi:schemaLocation="http://www.hapmap.org
    http://hapmap.cshl.org/xml-schema/2003-1
    1-04/hapmap.xsd"
    lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
    xmlns="http
    ://www.hapmap.org"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <batch_info>
    <class>DELETE</class>
    <group lsid="urn:lsid:dcc.hapmap.cshl.org:Labgroup:DBSNP:1"/>
    <date_created>2005-09-15</date_created>
    <contact>
    <name>Steve Sherry</name>
    <email></email>
    </contact>
    </batch_info>
    <snps>
    <snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111">
    <snp_class>
    <bac-overlap/>
    </snp_class>
    <sequence>

    <flank_5>caaaggaatataaatcattctattataaagatacatgcacAGGgctgggtgcagtggctcacacctgtaatccc
    agcactttgggaggccaaggcgggtggatcacctgaggacaggagtttgagaccagcctagccaacatggggaaactccatctctactaaaaatacaaa
    aattagccaggtatagtggtgcacacctgtaataccagctactttggaggctgaggcaggagaatcgctggaacccaggaggcagaggtcaaagtgagc
    caagatcataccattgcactccagcctgggcaacaagagcaaaactccatcttaaaaaaatatatatatatacatatacatacatatatatacacatat
    atatacatatatacagatattatatatgtaaatgtatatatatgtgtatatatatacacatatatatacatattataactacatatatatacacacaca
    catacatatacatgcacacatatgtttattgcagcactatttacgatagaaaatacatggaatcctcccaaatgcccatcaatgatatattggataaag
    aaaatgtgatatatattcaccatggaatactatgcagccgttaaaataaatgagatcatgttctttgcagggacatggatgaagctggaagccatcacc
    ctcagcaaactaacacaggaaaagaaaaccaaacaccacatgttctcagtcgtaagagggagttgaacaatgagagcaaacacatggatacatggaggg
    gaacaacacacaccagggcctctcagcgggacaggggtaggagaCCATCAGGACAaacacgtggatacatggaggggaacaacacacaccagggcctct
    cagggggacagggggtaggagaccatcaagacaaacacgtggatacatggaggggaacaacacacaccagggcctctcagggggacagggggtaggaga
    ccatcaggacaaacacgtggatacatggaggggaa</flank_5>
    <variation>
    <allele base="C"/>
    <allele base="T"/>
    </variation>

    <flank_3>aacacacaccagggcctctcagggggacagggggtaggagaccatcaRgacaaacacgtggRtacatggagggg
    aacaacacacaccagggcctctcagggggacggggggtagRagaccatcaggacaaatagctaatgcatgcagggcctcatacctaggtgatgggttga
    tgggtgcagcaaaccaccatggcacacatttacctatgtatcaaacctaYactttctgcacgtgtatcccagaacataaaataaaatttaaaaaatata
    taCACTGATTCATGATCTCCTttctctccttctgaaacactctttaaaactttttagcatttccccctctgtcttccatgtctcctaactacatgtttc
    ttattttccatgtctttattcctgtgttcattttggatagccccttctgacctatattacagtttactagttcactcttcaactgcttctaacatacta
    atattctgttaaaaccattcatttgggtttaaatttcaattatgttattctctatggacattctatttgttttcttttaatcttcttggccattctcta
    gagtttcctgttccattatgatatttttaattttttgttttactttaaacatactaaatatagttattttattttattttctgtatctgatactttcaa
    taactgcagtctttgctagtcttttttctgtgctcttgctcatagtttttttcatttgttttCATGATTagaaaaacagagagagaagaaggagagtaa
    agggaggaggcggaggaggagaaaagaagaaagcagagaagaagggacagagaaaaaaaggaagTTGGTTCTAACGTTTCTCTAACAACTGGCTTCAGT
    GAAACACTCCCACCTTGTGGATTTTTAGGTTATTGAAATTAACCAGTCTTCtgggtgcagcacaccaacatggcacatgtatacatatgtaacaaacct
    gcactttgtgcacatgtaccctaaaacttaaagta</flank_3>
    <length_class value="full-length"/>
    </sequence>
    <genomic_locations>
    <genomic_location active="true" type="exact">
    <assembly_version>ncbi_build34</assembly_version>
    <coordinates chrom="11" start="60749" stop="60749"
    strand="forward"/>
    </genomic_location>
    </genomic_locations>
    <neighbour_snps>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100"/>
    </neighbour_snps>
    </snp>

    I am trying to extract the following information: lsid, allele base,
    cooordinates chrom, start, and strand. I would ideally like to use a
    perl script to pull this data out and put it in MySQL, but I have had
    no luck with XML::parser since the information I am looking for is
    embedded in the element's attributes.

    If anyone has any ideas on how to get this data out of xml and into
    MySQL, it would be much appreciated.

    Thanks so much,

    --James
     
    jkugler, Jun 12, 2006
    #1
    1. Advertisements

  2. jkugler

    robic0 Guest

    [xml snipped]

    Hi James.

    If you want you could use RXParse.pm (ver .91) posted on this group.
    It would send the same attribute info to a tag handler.
    Should be the same as the one you used.
    From there you could check the "name" for the "value" data.

    If you could hang on a while, I'm writing filter interfaces for
    tags, namespace, attributes, content, etc.. The filters can be
    either regexp or dos style. Each filter is specific, set by the
    user who provides the handler to recieve that specific info.

    Whats nice is that the regexp provided by the user is applied to
    just the sub-parsed global data, the category is guaranteed.
    Tag is tag, attrib is attrib, content is content.

    Its almost there. Yours is a basic search but the intent is to
    provide search/replace as well as multiple filter capability.
    At my stage of development with this its really pretty easy.
    The hold up is a base compliant parser, which is almost done.

    Anyway, just some info for you. No need to use RXParse. The difference
    between my module and all thats out there is its whole purpose is meant
    to be interactive. Funny how the simple things you would expect from
    high power software just aint there, and its so easy to do!

    robic0
    (god of porn)
     
    robic0, Jun 12, 2006
    #2
    1. Advertisements

  3. jkugler

    J. Gleixner Guest

    jkugler wrote:
    [...]
    Use XML::parser to get the element's value, then parse the value using
    regular expressions, split, or various other means.

    What have you tried? No one can point out coding errors, unless you post
    your code.

    Look at using DBI, and DBD::mysql for interacting with MySQL. It's all
    available on CPAN ( http://search.cpan.org/ ).
     
    J. Gleixner, Jun 12, 2006
    #3
  4. jkugler

    xhoster Guest

    What is huge? Is it the number of files, or the number of entries per
    file? If the former, I would probably use XML::Simple. If the later,
    maybe look into XML::Twig.


    lsid appears at more than one level.

    I haven't had any trouble getting element attributes from XML files with
    XML::parser. Can you show some of what you have tried so far?
    Here is something trivial I whipped up that just shows how to get the
    attributes for the allele:


    use strict;
    use Data::Dumper;
    use XML::parser;
    my $p1 = new XML::parser(Style =>'Subs');
    open my $fh, "test.xml" or die $!;
    $p1->parse($fh);

    sub allele {
    my ($expat,$ele,%att) = @_;
    print Dumper \%att
    };

    __END__
    $VAR1 = {
    'base' => 'C'
    };
    $VAR1 = {
    'base' => 'T'
    };

    no element found at line 73, column 0, byte 3851 at
    /tools/GNU/perl/5.8.7_64/lib/site_perl/5.8.7/x86_64-linux/XML/Parser.pm
    line 187


    (That error is only there because the sample file you provided was
    incomplete.)

    Xho
     
    xhoster, Jun 12, 2006
    #4
  5. jkugler

    robic0 Guest

    No no! Attribute data is "content", why invoke Simple when its just content?
    The attribute is known, any tag handler will do.
    Don't make it into a federal case, <gods name=jeezz/>.
     
    robic0, Jun 12, 2006
    #5
  6. jkugler

    John Bokma Guest

    [..]
    XML::parser can handle attributes perfectly. What exactly didn't work?
    See: http://johnbokma.com/perl/element-id-for-given-parent.html
    XML::Twig? Give me a book? :)
     
    John Bokma, Jun 13, 2006
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.