tidy to convert google scholar page in xml

Discussion in 'Python' started by রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 8, 2012.

  1. Dear friends,
    I am trying to convert a google scholar page to xml.
    First, I am getting the mapge using the script:
    #!/usr/bin/python
    from HTMLParser import HTMLParser
    import urllib2
    response =
    urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
    f=open('sch.html','w')
    f.write(response.read())

    Which is giving sch.html starting as:
    <!doctype html><html><head><meta http-equiv="Content-Type"
    content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
    content="IE=Edge"><meta name="viewport"
    content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

    if I try tidy to convert this html page to xml, I get:
    $ tidy <sch.html |more
    line 3 column 40 - Warning: <style> isn't allowed in <div> elements
    line 3 column 23 - Info: <div> previously mentioned
    /**************************
    AND MANY MORE WARNNING
    **************************/
    Info: Document content looks like HTML 4.01 Transitional
    Info: No system identifier in emitted doctype
    131 warnings, 0 errors were found!

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
    <head>
    <meta name="generator" content=
    "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta name="viewport" content=
    "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
    <meta name="format-detection" content="telephone=no">
    <title>albert einstein+1905 - Google Scholar</title>

    <script type="text/javascript">
    var gs_ts=Number(new Date());
    </script>
    <style type="text/css">
    html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
    0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
    top{position:relative;min-width:980px;_width:expression(document.documentElement
    ..clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
    #gs_top{min-width:
    300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
    ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back


    So, this is still in html, not in xml. How can I convert the page to
    xml?
    রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 8, 2012
    #1
    1. Advertising

  2. রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

    Dave Angel Guest

    On 10/08/2012 07:11 AM, রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€ wrote:
    > Dear friends,
    > I am trying to convert a google scholar page to xml.
    > First, I am getting the mapge using the script:
    > #!/usr/bin/python
    > from HTMLParser import HTMLParser
    > import urllib2
    > response =
    > urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
    > f=open('sch.html','w')
    > f.write(response.read())
    >
    > Which is giving sch.html starting as:
    > <!doctype html><html><head><meta http-equiv="Content-Type"
    > content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
    > content="IE=Edge"><meta name="viewport"
    > content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">
    >
    > if I try tidy to convert this html page to xml, I get:
    > $ tidy <sch.html |more
    > line 3 column 40 - Warning: <style> isn't allowed in <div> elements
    > line 3 column 23 - Info: <div> previously mentioned
    > /**************************
    > AND MANY MORE WARNNING
    > **************************/
    > Info: Document content looks like HTML 4.01 Transitional
    > Info: No system identifier in emitted doctype
    > 131 warnings, 0 errors were found!
    >
    > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    > <html>
    > <head>
    > <meta name="generator" content=
    > "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
    > <meta http-equiv="Content-Type" content=
    > "text/html; charset=us-ascii">
    > <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    > <meta name="viewport" content=
    > "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
    > <meta name="format-detection" content="telephone=no">
    > <title>albert einstein+1905 - Google Scholar</title>
    >
    > <script type="text/javascript">
    > var gs_ts=Number(new Date());
    > </script>
    > <style type="text/css">
    > html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
    > 0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
    > top{position:relative;min-width:980px;_width:expression(document.documentElement
    > .clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
    > #gs_top{min-width:
    > 300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
    > ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back
    >
    >
    > So, this is still in html, not in xml. How can I convert the page to
    > xml?
    >


    What makes you think it's possible? (Possible automatically, that is)
    There is no mapping from html to xml, so a program that tries this is
    just guessing in many places. Further, many, if not most, web pages are
    not even valid html, just good enough to work with most browsers. Now,
    if the page was in valid xhtml, then it would already be valid xml.

    Do you have a license from google? If not, better read their terms of
    service. While they probably won't pursue the occasional page scraping,
    you should consider the costs before spending too much effort. Besides,
    they have APIs for most of their services, and there might be one
    that'll be much easier to use than trying to scrape the html.

    Do you have a plan for what to do when the page layout changes?

    You should look into Beautiful Soup; it's designed for parsing sloppily
    written html. I've no direct experience with it, but it gets
    recommended a lot.


    --

    DaveA
    Dave Angel, Oct 8, 2012
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric
    Replies:
    0
    Views:
    525
  2. Toby White

    Limited XML tidy

    Toby White, Aug 23, 2005, in forum: Python
    Replies:
    0
    Views:
    295
    Toby White
    Aug 23, 2005
  3. Gonsolo

    H-Index with Google Scholar

    Gonsolo, Feb 25, 2009, in forum: Python
    Replies:
    0
    Views:
    627
    Gonsolo
    Feb 25, 2009
  4. Replies:
    1
    Views:
    338
    David RF
    May 23, 2012
  5. রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

    get google scholar using python

    রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 1, 2012, in forum: Python
    Replies:
    4
    Views:
    443
    Jerry Hill
    Oct 1, 2012
Loading...

Share This Page