Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
~/> cd beautifulsoup# python=3.10 is required
:~/ cd beautifulsoup
beautifulsoup:~/ conda create -n bs4 python=3.10
beautifulsoup:~/ conda activate bs4# python=3.10 is required
:~/ cd beautifulsoup
beautifulsoup:~/ python3 -m venv venv
beautifulsoup:~/ source venv/bin/activate # On macOS/Linuxbeautifulsoup:~/ pip install -e .
beautifulsoup:~/ pip install pytest# If you are using conda env
cd beautifulsoup
beautifulsoup:~/ pytest# if you are using venv env
cd beautifulsoup/bs4/tests
beautifulsoup/bs4/tests:~/ pytest- Install GitHub CLI and configure it with your GitHub Account
- Checkout this to learn Adding locally hosted code to GitHub
- The following approach is with GitHub CLI
:~/ cd beautifulsoup
beautifulsoup:~/ git init
beautifulsoup:~/ git add .
beautifulsoup:~/ git commit -m "Initialize Repository"
beautifulsoup:~/ gh repo create beautifulsoup --source=. --public --push- In your GitHub web account, create a repository named beautifulsoup.
- It will generate a repository link: https://github.com/YOUR-USERNAME/beautiful.git
- Now open a terminal inside the downloaded
beautifulsoupfolder and steup the repository - *** Replace
YOUR-USERNAMEwith your GitHub username.
:~/ cd beautifulsoup
beautifulsoup:~/ git init
beautifulsoup:~/ git commit -m "repository initialization"
beautifulsoup:~/ git branch -M main
beautifulsoup:~/ git remote add origin https://github.com/YOUR-USERNAME/beautiful.git
beautifulsoup:~/ git push -u origin main>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
>>> print(soup.prettify())
<html>
<body>
<p>
Some
<b>
bad
<i>
HTML
</i>
</b>
</p>
</body>
</html>
>>> soup.find(string="bad")
'bad'
>>> soup.i
<i>HTML</i>
#
>>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
#
>>> print(soup.prettify())
<?xml version="1.0" encoding="utf-8"?>
<tag1>
Some
<tag2/>
bad
<tag3>
XML
</tag3>
</tag1>
To go beyond the basics, comprehensive documentation is available.
cbb2b30 (Add README)