Search multiple pages

Joined
Sep 4, 2023
Messages
5
Reaction score
0
Hi, I'm completely new to scripting / coding etc, but looking for some pointers.

What I'd like to do is to search for text in a series of web pages. For example: https://www.karaoke-version.co.uk/suggestion_7.html

What I'd like to do is pull up each page in sequence (suggestion_7, suggestion_8 etc.) and search for text within it, to find out which text that appears on.

I learned BASIC as a kid and would have done this with a FOR / NEXT loop, but I don't know how to go about it in this context.

Any help would be appreciated!
 
Joined
Jul 4, 2023
Messages
376
Reaction score
42
For this case I suggest use python with package called "requests" to fetch web pages and package "beautifulsoup4" for parsing the HTML content, e.g.
Python:
# Define the URL pattern and search string
base_url = "https://www.karaoke-version.co.uk/suggestion_{}.html"
search_string = "Dirty Love"

# Define the range of pages to scrape
start_page = 9
end_page = 11  # Change this to the last page you want to scrape

# Loop through the pages
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)
    print("\n", url, sep="")

    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Search for the desired string in the HTML
        print(f"String: {search_string}")
        if not search_string in soup.get_text():
            print("not ", end="")
        print(f"found on page {page_num}")
    else:
        print(f"Failed to retrieve page")
 
Joined
Jul 4, 2023
Messages
376
Reaction score
42
Here is full code

Python:
import requests
from bs4 import BeautifulSoup

# Define the URL pattern and search string
base_url = "https://www.karaoke-version.co.uk/suggestion_{}.html"
search_string = "Dirty Love"

# Define the range of pages to scrape
start_page = 9
end_page = 11  # Change this to the last page you want to scrape

# Loop through the pages
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)
    print("\n", url, sep="")

    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Search for the desired string in the HTML
        print(f"String: {search_string}")
        if not search_string in soup.get_text():
            print("not ", end="")
          
        print(f"found on page {page_num}")
    else:
        print(f"Failed to retrieve page")

example how looks result if you run code from IDLE Shell built-in Python

search.png


Make sure you have the requests and BeautifulSoup modules installed in your Python environment before running this code. You can install them using pip: pip install requests beautifulsoup4

search2.png
 
Last edited:
Joined
Sep 4, 2023
Messages
5
Reaction score
0
Hi, I've finally got this running – sort of. The problem I'm encountering is that it works for 'normal' text on the page, but not for the hyperlinked entries – it just returns 'not found' for each page. I've tested this by using "song" as a string, and it displays "found on page", but using text relating to a song or artist doesn't work. Is this a limitation of the get_text function?
 
Joined
Sep 4, 2022
Messages
129
Reaction score
16
Hi, I've finally got this running – sort of. The problem I'm encountering is that it works for 'normal' text on the page, but not for the hyperlinked entries – it just returns 'not found' for each page. I've tested this by using "song" as a string, and it displays "found on page", but using text relating to a song or artist doesn't work. Is this a limitation of the get_text function?

by querying Html Objects, maybe the py look for 'pure plain text', while attributes / url / tags keep in the 'Html elements category'.
 
Joined
Sep 4, 2023
Messages
5
Reaction score
0
This is over my head, I'm afraid! Is there a way to make beautiful soup search the whole displayed page?
 
Joined
Jul 4, 2023
Messages
376
Reaction score
42
beautiful soup search the whole displayed page?
Yes, using the beautifulsoup you can easily check the DOM of web page, check this example:
Python:
import requests
from bs4 import BeautifulSoup

# URL of the page to analyze
url = 'https://www.karaoke-version.co.uk/suggestion_9.html'

# Fetch the page content
response = requests.get(url)

if response.status_code == 200:
    page_content = response.text
    # Parse the page content using Beautiful Soup
    soup = BeautifulSoup(page_content, 'html.parser')  # You can choose a different parser, e.g., 'lxml', if you prefer

    # Find all <a> links inside <ul> elements with the class "suggestions-list" and with an href attribute containing "instrumental-mp3"
    links = soup.select('ul.suggestions-list a[href*="instrumental-mp3"]')

    # Now, the variable 'links' contains only the links that meet the criteria

    for link in links:
        print(link)  # Print the found links
        print(link['href'])
        print('-' * 50)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

similar to
JavaScript:
const a = document.querySelectorAll('ul[class*=suggestions-list] a[href*="instrumental-mp3"]');
console.log(a);
or
JavaScript:
const a = document.querySelectorAll('ul.suggestions-list a[href*="instrumental-mp3"]');
console.log(a);

for html code from: https://www.karaoke-version.co.uk/suggestion_9.html
1698694781624.png


[ CSS Selector Reference ]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,794
Messages
2,569,641
Members
45,355
Latest member
SJLChristi

Latest Threads

Top