Total Beginner - Extracting Data from a Database Online (Screenshot)

L

logan.c.graham

Hey guys,

I'm learning Python and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:

http://i.imgur.com/KgvSKWk.jpg

What this is is a publicly-accessible webpage that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.

I'd like to use Python to do it -- crawl the page and extract the data in a usable way.

I'd love your input! I'm just a learner.
 
D

Dave Angel

Hey guys,

I'm learning Python
Welcome.

and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:

na

What this is is a publicly-accessible webpage

No, it's just a jpeg file, an image.
that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.

I'd like to use Python to do it -- crawl the page and extract the data in a usable way.

But there's no page to crawl. You may have to start by finding an ocr
to interpret the image as characters. Or find some other source for
your data.
 
C

Carlos Nepomuceno

### table_data_extraction.py ###
# Usage: table[id][row][column]
# tables[0]       : 1st table
# tables[1][2]    : 3rd row of 2nd table
# tables[3][4][5] : cell content of 6th column of 5th row of 4th table
# len(table)      : quantity of tables
# len(table[6])   : quantity of rows of 7th table
# len(table[7][8]): quantity of columns of 9th row of 8th table

impor re
import urllib2

#to retrieve the contents of the page
page = urllib2.urlopen("http://example.com/page.html").read().strip()

#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]


Pretty simple. Good luck!

----------------------------------------
 
D

Dave Angel

<SNIP>
page = urllib2.urlopen("http://example.com/page.html").read().strip()

#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]


Pretty simple. Good luck!

Only if the page is html, which the OP's was not. It was an image. Try
parsing that with regex.
 
C

Chris Angelico

http://i.imgur.com/KgvSKWk.jpg

What this is is a publicly-accessible webpage...

If that's a screenshot of something that we'd be able to access
directly, then why not just post a link to the actual thing? More
likely I'm thinking it's NOT publicly accessible, which is why it's
been censored.

ChrisA
 
L

logan.c.graham

Sorry to be unclear -- it's a screenshot of the webpage, which is publicly accessible, but it contains sensitive information. A bad combination, admittedly, and something that'll be soon fixed.
 
J

John Ladasky

#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]


Pretty simple.

Two nested list comprehensions, with regex pattern matching?

Logan did say he was a "total beginner." :^)
 
L

logan.c.graham

#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
Pretty simple.



Two nested list comprehensions, with regex pattern matching?



Logan did say he was a "total beginner." :^)



Oh goodness, yes, I have no clue.
 
C

Carlos Nepomuceno

----------------------------------------
Date: Mon, 27 May 2013 17:58:00 -0700
Subject: Re: Total Beginner - Extracting Data from a Database Online (Screenshot)
From: (e-mail address removed)
To: (e-mail address removed) [...]

Oh goodness, yes, I have no clue.

For example:

# to retrieve the contents of all column '# fb' (11th column from the imageyou sent)

c11 = [tables[0][r][10] for r in range(len(tables[0]))]
#      ----------------                -------------
#      this is the content             this is the quantity
#      of the 11th cell                of rows in table[0]
#      of row 'r'
 
P

Phil Connell

----------------------------------------
Date: Mon, 27 May 2013 17:58:00 -0700
Subject: Re: Total Beginner - Extracting Data from a Database Online (Screenshot)
From: (e-mail address removed)
To: (e-mail address removed) [...]

Oh goodness, yes, I have no clue.

For example:

# to retrieve the contents of all column '# fb' (11th column from the image you sent)

c11 = [tables[0][r][10] for r in range(len(tables[0]))]

Or rather:

c11 = [row[10] for row in tables[0]]

In most cases, range(len(x)) is a sign that you're doing it wrong :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,044
Messages
2,570,388
Members
47,052
Latest member
ketan

Latest Threads

Top