Search multiple pages

antonuzzo · Sep 4, 2023

Hi, I'm completely new to scripting / coding etc, but looking for some pointers.

What I'd like to do is to search for text in a series of web pages. For example: https://www.karaoke-version.co.uk/suggestion_7.html

What I'd like to do is pull up each page in sequence (suggestion_7, suggestion_8 etc.) and search for text within it, to find out which text that appears on.

I learned BASIC as a kid and would have done this with a FOR / NEXT loop, but I don't know how to go about it in this context.

Any help would be appreciated!

VBService · Sep 4, 2023

For this case I suggest use python with package called "requests" to fetch web pages and package "beautifulsoup4" for parsing the HTML content, e.g.

Python:

# Define the URL pattern and search string
base_url = "https://www.karaoke-version.co.uk/suggestion_{}.html"
search_string = "Dirty Love"

# Define the range of pages to scrape
start_page = 9
end_page = 11  # Change this to the last page you want to scrape

# Loop through the pages
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)
    print("\n", url, sep="")

    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Search for the desired string in the HTML
        print(f"String: {search_string}")
        if not search_string in soup.get_text():
            print("not ", end="")
        print(f"found on page {page_num}")
    else:
        print(f"Failed to retrieve page")

antonuzzo · Sep 5, 2023

Thank you so much – now I just need to figure out how to use Python!

Really appreciated

VBService · Sep 5, 2023

Here is full code

Python:

import requests
from bs4 import BeautifulSoup

# Define the URL pattern and search string
base_url = "https://www.karaoke-version.co.uk/suggestion_{}.html"
search_string = "Dirty Love"

# Define the range of pages to scrape
start_page = 9
end_page = 11  # Change this to the last page you want to scrape

# Loop through the pages
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)
    print("\n", url, sep="")

    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Search for the desired string in the HTML
        print(f"String: {search_string}")
        if not search_string in soup.get_text():
            print("not ", end="")
          
        print(f"found on page {page_num}")
    else:
        print(f"Failed to retrieve page")

example how looks result if you run code from IDLE Shell built-in Python

Make sure you have the requests and BeautifulSoup modules installed in your Python environment before running this code. You can install them using pip: pip install requests beautifulsoup4

antonuzzo · Oct 27, 2023

Hi, I've finally got this running – sort of. The problem I'm encountering is that it works for 'normal' text on the page, but not for the hyperlinked entries – it just returns 'not found' for each page. I've tested this by using "song" as a string, and it displays "found on page", but using text relating to a song or artist doesn't work. Is this a limitation of the get_text function?

FResher · Oct 27, 2023

antonuzzo said:
Hi, I've finally got this running – sort of. The problem I'm encountering is that it works for 'normal' text on the page, but not for the hyperlinked entries – it just returns 'not found' for each page. I've tested this by using "song" as a string, and it displays "found on page", but using text relating to a song or artist doesn't work. Is this a limitation of the get_text function?

by querying Html Objects, maybe the py look for 'pure plain text', while attributes / url / tags keep in the 'Html elements category'.

antonuzzo · Oct 30, 2023

This is over my head, I'm afraid! Is there a way to make beautiful soup search the whole displayed page?

VBService · Oct 30, 2023

antonuzzo said:
beautiful soup search the whole displayed page?

Yes, using the beautifulsoup you can easily check the DOM of web page, check this example:

Python:

import requests
from bs4 import BeautifulSoup

# URL of the page to analyze
url = 'https://www.karaoke-version.co.uk/suggestion_9.html'

# Fetch the page content
response = requests.get(url)

if response.status_code == 200:
    page_content = response.text
    # Parse the page content using Beautiful Soup
    soup = BeautifulSoup(page_content, 'html.parser')  # You can choose a different parser, e.g., 'lxml', if you prefer

    # Find all <a> links inside <ul> elements with the class "suggestions-list" and with an href attribute containing "instrumental-mp3"
    links = soup.select('ul.suggestions-list a[href*="instrumental-mp3"]')

    # Now, the variable 'links' contains only the links that meet the criteria

    for link in links:
        print(link)  # Print the found links
        print(link['href'])
        print('-' * 50)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

similar to

JavaScript:

const a = document.querySelectorAll('ul[class*=suggestions-list] a[href*="instrumental-mp3"]');
console.log(a);

or

JavaScript:

const a = document.querySelectorAll('ul.suggestions-list a[href*="instrumental-mp3"]');
console.log(a);

for html code from: https://www.karaoke-version.co.uk/suggestion_9.html

[ CSS Selector Reference ]

DUPLICATE MODS, PLEASE DELETE, SORRY!	1	Sep 4, 2023
Reverse search for a website	2	Wednesday at 7:44 PM
MDX pages not rendering in Gatsby.js	0	Oct 25, 2023
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
How to write an advanced search?	3	Mar 2, 2022
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Multiple screen in wxphyton	0	Oct 31, 2017
Bash scripts for web apps	1	Jan 16, 2023

Search multiple pages

antonuzzo

VBService

antonuzzo

VBService

antonuzzo

FResher

antonuzzo

VBService

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads