Web Scrapping (Beautiful Soup) - part 1

 

Web Scraping is the process of extracting useful informations from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Web scraping just works like a person browsing different pages website and copy pastedown all the contents. When you run the code, it will send a request to the server and the data or source code for website is contained in the response you get. What you then do is parse the response data and extract out the parts you want.

Unlike a web browser, our web scraping code won’t interpret the page’s source code and display the page visually. Instead, we’ll write some custom code that filters through the page’s source code looking for specific elements we’ve specified, and extracting whatever content we’ve instructed it to extract.


Web Scraping vs Crawling

While web scraping is the process to parse the website source code to extract useful data, the term "Web Crawling" refers to the process of finding websites by iteratively going through a set of websites to find more target website links to crawl. 

Below are some popular web scrapping tools in python :

  • BeautifulSoup4 - It is used for parsing HTML and XML documents. It creates a parse tree for Html/XML docs and extracts information from them.
  • ScrapyScrapy is a advance framework for large-scale web scraping like scraping thousands of web pages, etc.

------------------------------
----------------------------------------------------------------------------------

BeautifulSoup4  (Useful : 1] Click 2] Click)

Beautiful Soup is a minimal Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

We can use various types of parsers with beautifulsoup, some of them are "lxml", "lxml-xml", "html.parser", or "html5lib" etc. 

NOTE : The parser name mentioned, must be installed already as part of your Python pacakges. For instance ,html.parser, is an in-built, 'with-batteries' package shipped with Python. You could install other parsers such as lxml or html5lib. )

The typical sequence for web scraping with beautifulsoup is as follows :

  • Get the HTML source code for the requested web document.
  • Parse the HTML using Beautifulsoup library, this will return a beautifulsoup object which represents the whole document as a tree structure.
  • Extract the required data from tags using the find() or find_all() functions to search through the tree.


import requests
from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :
page = BeautifulSoup(page,"html.parser")
print(page.prettify())

# Request Html page
page = requests.get("https://en.wikipedia.org/wiki/Tesla,_Inc.")
page = BeautifulSoup(page.content,"html.parser")
print(page.prettify())

 Finding an element by classname inside a document and extracting its data.


import requests
from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

requests.get("https://en.wikipedia.org/wiki/Tesla,_Inc.")

page = BeautifulSoup(page,"html.parser")

# Find <p> with given classname
description = page.find("p",attrs={"class":"description"})

# Remove the html code and extract the text
description = description.text.strip()
print(description)

Typically, the .find and .find_all methods are used to search the tree, giving the input arguments. The input arguments are : the tag name that is being sought, attribute names and other related arguments. These arguments could be presented as : a string, a regular expression, a list or even a function. Common uses of the BeautifulSoup object include :

  • Search by CSS class
  • Search by Hyperlink address
  • Search by Element Id, tag
  • Search by Attribute name. Attribute value.

If you have a need to filter the tree with a combination of the above criteria, you could also write a function that evaluates to true or false, and search by that function.

----------------------------------------------------------------------------------------------------------------


Searching the Document Tree

The Beautifulsoup4 library provides 2 main function to search elements inside the parsed HTML document, which are as follows :

1] Find() - It is used for finding out the first tag with the specified class or id and returning an object of type bs4.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page,"html.parser")

# Find <p> with given classname
description = page.find("p",attrs={"class":"description"})

# Remove the html code and extract the text
description = description.text.strip()
print(description)

You can also find HTML tags by their Id value.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page,"html.parser")

# Find <p> with given Id
description = page.find("p",attrs={"id":"description"})

# Extract from the first <p> found
description = description.text.strip()
print(description)


2] Find_All() - It is used for finding out all tags with the specified tag class or id and returning them as a list of type bs4.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page,"html.parser")

# Find all <p> tags
descriptions = page.find_all("p")

# Extract text from the first <p> found
description = descriptions[0].text.strip()
print(description)

We can also extract multiple elements through passing a list of HTML tags to the function.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page,"html.parser")

# Find all <p> and <h1>
descriptions = page.find_all(["p","h1"])

# Extract from the first <p> found
description = descriptions[0].text.strip()
print(description)


Find Tags with Id/Class

We can also find tags based on their class names and Ids. The word "class" is reserved in python hence we use the word "class_" as argument.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find all tags with given Id
output = page.find_all(id="description")
print(output)



from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find all tags with given class
output = page.find_all(class_="description")
print(output)


Access Tag Attributes

Once you have the required HTML tag, we can also acess their internal attribute values.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find the first <a> tag
link = page.find("a")

# Extract the 'href' attribute
url = link["href"]
print(url)

Or you can also use the get() function on the object incase the above does'nt work.


import requests
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Elon_Musk")

page = BeautifulSoup(page.content,"html.parser")

# Find all <a> tags and extract links
links = page.find_all("a")

for link in links:
print(link.get('href'))


Search for Strings

In Beautifulsoup we can also search for tags based on the string values they hold. 


import requests
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Elon_Musk")

page = BeautifulSoup(page.content,"html.parser")

# List of tags containing the given string
outputs = page.find_all(string="Bill Gates")

print(outputs[0].parent)
# <a href="/wiki/Bill_Gates" title="Bill Gates">Bill Gates</a>

NOTE : This is not a full-text search feature, it only returns tags which have exactly the same value inside them and nothing more or less. Instead if you want to extract tags which have the given string anywhere inside their value,we can use regular expressions.


import requests
import re
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Elon_Musk")

page = BeautifulSoup(page.content,"html.parser")

# List of all tags containing 'Tesla' keyword
outputs = page.find_all(string=re.compile("Tesla"))

print(outputs[5].parent)
# <a href="/wiki/Elon_Musk%27s_Tesla_Roadster">Tesla Roadster in space</a>


Search by Regex Expressions

In Beautifulsoup4 we can extract tags by using regular expression on Class names, Ids, Strings etc.


import requests
import re
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Elon_Musk")

page = BeautifulSoup(page.content,"html.parser")

# Find all tags with keyword 'Tesla' inside
outputs = page.find_all(string=re.compile("Tesla"))
# ['Tesla stock has risen significantly','Criticism of Tesla']

# Find all tags which start with letter 'b'
outputs = page.find_all(re.compile("^b"))

# Find all tags whose names contain the letter ‘t’
outputs = page.find_all(re.compile("t"))


----------------------------------------------------------------------------------------------------------------


Tag Objects

When we passed a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.

From the beautifulsoup object which we get through parsing the HTML document, we can extract specific tags from the parsed tree,this returns a Tag object of the corresponding html tag.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Get a <p> Tag Object. This gets the first <p> in document
p1 = page.p
print(p1)
# <p id="description"> Did you know there are many types of anime? </p>

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.headUsing a tag name as an attribute will give you only the first tag by that name.


soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first <b> tag beneath the <body> tag:


soup.body.b
# <b>The Dormouse's story</b>


----------------------------------------------------------------------------------------------------------------


Filter Functions

As we saw in the above examples we can filter the HTML document based on class,Id, regex etc to find the required tags. But if none of them work for you then we can define a "Filter Function" to create a complex filter.

We can pass a filter function to the find_all() function. This function will take a single Tag as argument and should returns a boolean value.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# filter function
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')

# Find all tags with class but no id attribute
soup = page.find_all(has_class_but_no_id)
print(soup)

We can also apply filters to specific attributes rather than the entire HTML tag.


import requests
import re
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/Elon_Musk")

page = BeautifulSoup(page.content ,"html.parser")

# filter function
def elon_link(href):
return href and re.compile("Elon").search(href)

# Find all <a> whose href contain keyword "Elon"
soup = page.find_all(href=elon_link)

for link in soup:
print(link)



----------------------------------------------------------------------------------------------------------------


Navigating the HTML Tree

When we parse an HTML document with beautifulsoup, we create a Tree structure of the entire document where each Tag represents a node. We can traverse through this tree to find children,siblings,parents tags etc for specific nodes.


1] Accessing Children Tags

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

A tag’s children are available in a list called ".contents". A string does not have  ".contentsbecause it can’t contain anything. Instead of getting them as a list, you can iterate over a tag’s children using the ".children" generator.

NOTE : The .contents and .children only gives us acess to the direct childrens of a tag and not the childrens of direct childs.


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find a unordered list of drinks
drinks = page.find(id="drinks")

# Get all children of list
drinks = drinks.contents

print(drinks)
# [ <li>Coffee</li>, <li>Tea</li>, <li>Milk</li> ]

#-------------using the generator-------------------

# Find a unordered list of drinks
drinks = page.find(id="drinks")

# Iterate through the list
for drink in drinks.children:
print(drink)

# <li>Coffee</li>
# <li>Tea</li>
# <li>Milk</li>

If a tag has only one child, and that child is a string value,the child is made available with ".string".


from bs4 import BeautifulSoup

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find a unordered list of drinks
drinks = page.find(id="drinks")

# Get all children of list
drinks = drinks.contents
# [ <li>Coffee</li>, <li>Tea</li>, <li>Milk</li> ]

# Get string values of each list item
for drink in drinks:
print(drink.string)

# Coffee
# Tea
# Milk


2] Accessing Parent Tags

You can access an element’s parent with the .parent attribute. For example the <head> tag is the parent of the <title> tag.

NOTEThe parent of a top-level tag like <html> is the BeautifulSoup object itself. And the .parent of a BeautifulSoup object is defined as None


from bs4 import BeautifulSoup

# <h1> Click here : <a id="link" href="www.tesla.com"> Click me 1</a> </h1>

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Find the <a> with given id
a_tag = page.find(id="link")

# Get the parent of <a>
h1_tag = a_tag.parent
print(h1_tag.contents)
# [' Click here : ', <a href="www.tesla.com" id="link"> Click me 1</a>, ' ']


3] Accessing Sibling Tags

In an HTML document 2 different tags which are on same level and direct children of same parent tag are called siblings. We can acess a HTML tags siblings using the ".next_siblings" and ".previous_siblings".

NOTE : If a element does'nt have a next or previous sibling then the result is None.

NOTEIn real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace, in such case you'll need to chain the ".next_sibling" multiple times.


from bs4 import BeautifulSoup

"""
<div id="page_links">
<a id="page1" href="www.example.com/1"> page 1 </a>
<a id="page2" href="www.example.com/2"> page 2 </a>
<a id="page3" href="www.example.com/3"> page 3 </a>
</div>
"""

# Open a Local Html File
with open("index.html") as page :

page = BeautifulSoup(page ,"html.parser")

# Get the element with Id
p2 = page.find(id="page2")
print(p2) # <a href="www.example.com/2" id="page2"> page 2 </a>

# Get its next sibling
p3 = p2.next_sibling.next_sibling
print(p3) # <a href="www.example.com/3" id="page3"> page 3 </a>

# Get its previous sibling
p1 = p2.previous_sibling.previous_sibling
print(p1) # <a href="www.example.com/1" id="page1"> page 1 </a>



----------------------------------------------------------------------------------------------------------------




Comments

Popular posts from this blog

React Js + React-Redux (part-2)

React Js + CSS Styling + React Router (part-1)

ViteJS (Module Bundlers, Build Tools)