Mining strings from a HTML document.

D

Derick van Niekerk

Hi,

I am new to Python and have been doing most of my work with PHP until
now. I find Python to be *much* nicer for the development of local apps
(running on my machine) but I am very new to the Python way of thinking
and I don't realy know where to start other than just by doing it...so
far I'm just through the tut :)

My problem is as follows:
I have an html file with a list of records from a database. The list of
records is delimited with a comment and the format is as follows:

<!-- comment first-->
<a href="slfdhksah kkshdfksahdf">Record 1</a>
<b>Field1</b>Data data data<br><b>Field2</b>Data data
data<br><b>Field3</b>Data data data<br><b>Field4</b>Data data data<br>

<a href="slfdhksah kkshdfksahdf">Record 2</a>
<b>Field1</b>Data data data<br><b>Field2</b>Data data
data<br><b>Field3</b>Data data data<br><b>Field4</b>Data data data<br>

<a href="slfdhksah kkshdfksahdf">Record 3</a>
<b>Field1</b>Data data data<br><b>Field2</b>Data data
data<br><b>Field3</b>Data data data<br><b>Field4</b>Data data data<br>
<!-- comment last-->

The data fields could be up to 2 or 3 paragraphs each. The number and
names of fields may differ between records (some info in one, but not
the other - ie null values do not show up in the html)

What are the string functions I would use and how would I use them? I
saw something about html parsing in python, but that might be overkill.
Babysteps.

Thanks
 
D

Derick van Niekerk

Thanks, Jay!

I'll try this out today. Trying to write my own parser is such a pain.
This BeatifullSoup script is very nice! I'll give it a try.

If you can help me out with an example of how to do what I explained, I
would appreciate it. I actually finished doing an import last night,
but there is no way I'm creating another parser from scratch!

I tried figuring out what to do by going through the code, but I am
still waay too fresh to understand generators and some of the coding
conventions.

Thanks again
 
D

Derick van Niekerk

I'm battling to understand this. I am switching to python while in a
production environment so I am tossed into the deep end. Python seems
easier to learn than other languages, but some of the conventions still
trip me up. Thanks for the link - I'll have to go through all the
previous chapters to understand this one though...

I suppose very few books on python start off with HTML processing in
stead of 'hello world' :p

Could you give me an example of how to use it to extract basic
information from the web page? I need a bit of a hit-the-ground-running
approach to python. You'll see that the data in my example isn't
encapsulated in tags - is there still an easy way to extract it using
the parser module?

Thanks
 
D

Derick van Niekerk

Runsun Pan helped me out with the following:

You can also try the following very primitive solution that I
sometimes
use to extract simple information in a quick and dirty way:

def extract(text,s1,s2):
''' Extract strings wrapped between s1 and s2.
['test', 'extract()', 'does multiple extract']

'''
beg = [1,0][text.startswith(s1)]
tmp = text.split(s1)[beg:]
end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
return [ x.split(s2)[0] for x in tmp if
len(x.split(s2))>1][:end]


This will help out a *lot*! Thank you. This is a better bet than the
parser in this particular implementation because the data I need is not
encapsulated in tags! Field names are within <b></b> tags followed by
plain text data and ended with a <br> tag. This was my main problem
with a parser, but your extract fuction solves it beautifully!

I'm posting back to the NG in just in case it is of value to anyone
else.

Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?

Thanks for the help!
-d-
 
C

Cameron Laird

.
.
.
I suppose very few books on python start off with HTML processing in
stead of 'hello world' :p
.
.
.
.... very few, perhaps, but how many do you need when the
one example is so strong? In any case, you'll want to look
into *Text Processing in Python* <URL: http://gnosis.cx/TPiP/ >.
 
M

Magnus Lycka

Derick said:
Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?

It's not as strange as it looks. [1,0] is a list. If you put []
after a list, it's for indexing, right? (Unless there's one or
two ':' somehere, in which case it's slicing.)

text.startswith(s1) evaluates to True or False, which is equivalent
to 1 or 0 in a numerical context. [1,0][0] is 1, and [1,0][1] is
0, so you could say that it's a somewhat contrieved way of writing
"beg = int(not text.startswith(s1))" or "beg = 1 - text.startswith(s1)"
 
R

Runsun Pan

def extract(text,s1,s2):
''' Extract strings wrapped between s1 and s2.
['test', 'extract()', 'does multiple extract']

'''
beg = [1,0][text.startswith(s1)]
tmp = text.split(s1)[beg:]
end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end]
Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?

The idea is using .split( ) to cut the string in different manners.
For a string:

-----AderickB----ArunsunB------

first cut at A gives you [-----, derickB------, runsunB-----] (line-1,2)
2nd cut at B gives you [ derick, runsun ] (line-3,4)

The function uses list comprehension heavily. As Magnus already explained,
line-1 is just a switch. Same as line-3. These two lines exist to solve the
difference between

-----AderickB----ArunsunB------
AderickB----ArunsunB------

or

-----AderickB----ArunsunB------
-----AderickB----ArunsunB

That is, if the original raw string startswith or ends with s1 or s2, special
consideration should be taken.

Line-2 and -4 are just common practice of list slicing that u should be
able to find in any python tutorial.

Let us know if it's still not clear.

--
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
Runsun Pan, PhD
(e-mail address removed)
Nat'l Center for Macromolecular Imaging
http://ncmi.bcm.tmc.edu/ncmi/
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
 
D

Derick van Niekerk

Thanks Guys!

I've written several functions yesterday to import from different types
of raw data including html and different text formats. In the end I
never used the extract function or the parser module, but your advice
put me on the right track. All these functions are now in a single
object and the inner workings are abstracted (as much as python
allows). So a single object can now import from any file without me
having to worry about what file it is!

Might not sound like much, but the whole OOP thing is new to me too, so
I am very happy with what python could do for me.

Now just to get this stuff into MySQL...new topic :)

Thanks for all your help!
-d-
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top