What is Web Scraping
Web scraping is a data extraction technique where automated scripts or tools retrieve and parse the HTML or other markup language of web pages to extract specific information. This process involves sending HTTP requests to target websites, retrieving the page content, and utilizing parsing techniques to isolate and collect relevant data elements such as text, images, or links. Web scraping is employed for various purposes, including data analysis, research, or aggregating information from multiple sources. However, it should be conducted ethically and in accordance with the terms of service of the targeted websites to avoid legal and ethical concerns.
In a nutshell, the
meaning of web scraping is extracting data from a website.
How Python can be Used for Web Scraping
Python is widely used for web scraping due to its simplicity, rich libraries, and powerful tools. Developers often leverage libraries such as BeautifulSoup and lxml to parse HTML content easily, while the requests library is employed for making HTTP requests to websites. Additionally, frameworks like Scrapy provide a more structured approach for building web scrapers. The combination of these tools allows developers to send requests to web pages, retrieve HTML content, and extract desired data efficiently. Python's readability and ease of use make it a popular choice for web scraping projects, enabling developers to create effective scripts for data extraction, analysis, and automation. However, it's essential to be mindful of ethical considerations and comply with the terms of service of the targeted websites to ensure responsible and legal use of web scraping with Python.
Tools and Libraries Available for Web Scraping
The most useful libraries and frameworks available in Python for web scraping are:
- requests:
- BeautifulSoup:
- lxml:
- Scrapy:
- Selenium:
- Requests-HTML:
- MechanicalSoup:
- PyQuery:
- Urllib:
How Web Scraping Works
Web scraping works by sending an HTTP request to a targeted website, retrieving the HTML content of the webpage, parsing the HTML to identify and extract specific data elements, and then storing or using the extracted information. Tools and libraries like requests, BeautifulSoup, and lxml in Python are commonly employed for automating this process. Ethical considerations and adherence to a website's terms of service are crucial to ensure responsible and legal web scraping.
Install Python Web Scraping Libraries
To install Python web scraping libraries, open a terminal or command prompt and use pip, the Python package installer. Execute the following commands to install commonly used web scraping libraries: pip install requests for making HTTP requests, pip install beautifulsoup4 for HTML parsing with BeautifulSoup, pip install lxml for high-performance XML and HTML processing, pip install scrapy for a comprehensive web scraping framework, and pip install selenium for browser automation. Ensure Python and pip are installed on your system, and if you're working within a virtual environment, activate it before running these commands to keep your project dependencies isolated.
Python Web Scraping with Requests
To perform web
scraping in Python, use an HTTP request library (e.g. requests) along with a
parser library (e.g. BeautifulSoup) or use a browser-based library (e.g.
Selenium).
I have a video
tutorial to teach you how to do web scraping with the Python requests library
(an alternative is the requests-HTML library).
Essentially all you
need is to:
Install python
Install the requests
library
1 |
$ pip install requests |
Run this code to fetch
a web page.
1 2 3 4 5 6 7 8 9 10 11 |
# import the request library import requests # Define your URL url = 'https://crawler-test.com/' # Fetch the page response = requests.get(url) # Show the Response response.text |
The output is the HTML
of the web page in unicode (text) format.
Python Web Scraping with BeautifulSoup
To parse the HTML of the page using BeautifulSoup in Python, install the library, fetch
the HTML with an HTTP request library and parse the HTML using the BeautifulSoup() class on the HTML.
First, Install the
BeautifulSoup library.
1 |
$ pip3 install beautifulsoup4 |
After, import bs4,
parse the page with it and use the BeautifulSoup methods to get various HTML
tags from the page.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Fetch the HTML import requests url = 'https://crawler-test.com/' response = requests.get(url) # Import BeautifulSoup from
bs4 import BeautifulSoup # Parse the HTML soup = BeautifulSoup(response.text, 'html.parser') # Extract any HTML tag soup.find('title') |
Python Web Scraping with lxml
In Python, the lxml library is a parsing library that can be
used to extract information from HTML or XML. The lxml library can be used as
an alternative to the BeautifulSoup library.
To use the lxml library in Python, fetch the page with
the requests library and parse the HTML with the html.fromstring() method.
Install the lxml library with pip.
$ pip install lxml
Parse the textual
representation of HTML with the fromstring() method.
1 2 3 4 5 |
import requests from
lxml import
html r =
requests.get() tree = html.fromstring(r.text) |
The advantage of lxml
over BeautifulSoup is that it is possible to use XPath expressions to extract
data.
1 2 |
# Parse with XPath in lxml html.xpath('//*[@class="panel-header"]') |
Python Web Scraping with Requests-HTML
The requests-HTML library is an HTML parser that lets you use CSS Selectors and XPath expressions to extract the information that you want
from a web page. It also offers the capacity to perform JavaScript Rendering.
To scrape a web page
in Python with the requests-HTML library, first install the required Python
libraries and then use the HTMLSession() class initialize the session object. Then, perform a GET
request using the .get() method.
Install libraries
pip install requests
requests-HTML urlparse4
Scrape the website
using HTMLSession() in
Python.
1 2 3 4 5 6 |
# Scrape website with
Requests-HTML from
requests_html import
HTMLSession session = HTMLSession() r =
session.get('https://crawler-test.com/') r.html.find('title')[0].text |
You may get the
following error if you are using a Jupyter Notebook, in that case use the AsyncHTMLSession.
RuntimeError:
Cannot use HTMLSession within an existing event loop.
Python Web Scraping with Selenium
To use the Selenium
web browser application to perform Web Scraping in Python, install the required
library and the web driver, instantiate the webdriver module and use the get()
method to open the web browser to the web page that you want to extract
information from.
Selenium works by
opening a browser and loads a web page, just like you would do when you browse
the web. This way, it renders the page so that you can scrape anything on it.
Selenium is incredible to automate any browser activities.
To use Selenium in
Python Web Scraping, you need to install:
1 |
$ pip3 install webdriver-manager
Selenium |
And then run this code
to show the H1 of the page for example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from
selenium import
webdriver from
selenium.webdriver.chrome.service import Service from
webdriver_manager.chrome import ChromeDriverManager from
selenium.webdriver.common.by import By # Install Webdriver driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) # Fetch page driver.get('https://crawler-test.com/') try: # Get the
span element elem = driver.find_element(By.TAG_NAME,
'h1') # Show
scraped content print(elem.text) except Exception as e: print('There
was an error', e) # Quit browser when done driver.quit() |
Python Web Scraping with Scrapy
To scrape web pages
with Scrapy in Python, create your own custom web crawler in
Python. You will need to install the required library and use either the
scrapy shell or the Selector() object.
First, install the
required Python libraries.
1 |
$ pip3 install scrapy pyOpenSSL
lxml |
A quick overview to
show what you can do with Scrapy is by opening the Terminal and then run the
scrapy shell command
1 |
$ scrapy shell |
And try these few
commands:
1 2 3 4 5 |
fetch('https://crawler-test.com/') view(response) response.status print(response.text) response.xpath('//title/text()').get() |
You can also use the
Scrapy Selector object with requests to scrape a web page.
1 2 3 4 5 6 7 8 9 10 |
from
scrapy import
Selector import requests url = 'https://crawler-test.com/' response = requests.get(url) html = response.content sel = Selector(text = html) sel.xpath('//a/@href').extract() |
read other Python related articles
Popular Python Modules in 2024
Building an ATM Simulator in Python
0 Comments