R
Rio
Hi, my goal is to extract data from cells within tables from certain pages
(sportsbooks odds)!
I'm using java to achieve this, I get the source of the page, place it in a
string and pass that string (basically source of the html page) to methods
that cut it out sequentially.
First they find whatever it is between
<table.....any text including data, attributes and other tags up
until...../table>
Whatever is in there belongs to table 1 and that substring is cut out and
passed to another method that finds
<tr............anything..................../tr>
that's row 1, that substring is cut out and passed to another method that
finds
<td...............anything............../td> and finally strips other tags
and extracts data.
Finally every cell has its data, table number, row number and cell number.
The program works for the great majority of pages I'm trying to extract data
from. It obviously fails when it encounters table within table
<table (1).....
<table (2 within table 1)....
..........
.../table (2)>
/table (1)>
becuse it cuts from first <table opening tag until first /table> closing
tag. That's also a problem I can deal with.
NOW, THE PROBLEM!
But one particular page is giving me headache. I noticed my programm wrongly
counts cells, rows, misplaces data etc. I designed a method TO COUNT EACH
OCCURENCE OF opening and closing <table <tr and <td tags and found out that
NUMBER OF OPENING AND CLOSING TAGS IS NOT THE SAME and therefore I can't
design the programm that can correctly find what I want.
THE QUESTION IS: How is it possible and how does IE know where one table
(table row or cell) starts and where it ends and is it possible that some
<table <tr or <td tags actually only serve to describe attributes of that
table or row, if so how can I recognize them?
Big thanks to anyone who just reads this
!
(sportsbooks odds)!
I'm using java to achieve this, I get the source of the page, place it in a
string and pass that string (basically source of the html page) to methods
that cut it out sequentially.
First they find whatever it is between
<table.....any text including data, attributes and other tags up
until...../table>
Whatever is in there belongs to table 1 and that substring is cut out and
passed to another method that finds
<tr............anything..................../tr>
that's row 1, that substring is cut out and passed to another method that
finds
<td...............anything............../td> and finally strips other tags
and extracts data.
Finally every cell has its data, table number, row number and cell number.
The program works for the great majority of pages I'm trying to extract data
from. It obviously fails when it encounters table within table
<table (1).....
<table (2 within table 1)....
..........
.../table (2)>
/table (1)>
becuse it cuts from first <table opening tag until first /table> closing
tag. That's also a problem I can deal with.
NOW, THE PROBLEM!
But one particular page is giving me headache. I noticed my programm wrongly
counts cells, rows, misplaces data etc. I designed a method TO COUNT EACH
OCCURENCE OF opening and closing <table <tr and <td tags and found out that
NUMBER OF OPENING AND CLOSING TAGS IS NOT THE SAME and therefore I can't
design the programm that can correctly find what I want.
THE QUESTION IS: How is it possible and how does IE know where one table
(table row or cell) starts and where it ends and is it possible that some
<table <tr or <td tags actually only serve to describe attributes of that
table or row, if so how can I recognize them?
Big thanks to anyone who just reads this