Article
· Jul 27, 2022 6m read

Introduction to Web Scraping with Embedded Python - Let’s Extract python job’s

What is Web Scraping:

In simple terms, Web scrapingweb harvesting, or web data extraction is an automated process of collecting large data(unstructured) from websites. The user can extract all the data on particular sites or the specific data as per the requirement. The data collected can be stored in a structured format for further analysis.

What is Web Scraping? — James Le

Steps involved in web scraping:

  1. Find the URL of the webpage that you want to scrape
  2. Select the particular elements by inspecting
  3. Write the code to get the content of the selected elements
  4. Store the data in the required format

It’s that simple !!

The popular libraries/tools used for web scraping are:

  • Selenium – a framework for testing web applications
  • BeautifulSoup – Python library for getting data out of HTML, XML, and other markup languages
  • Pandas – Python library for data manipulation and analysis


What is Beautiful Soup?

Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.

  • It usually saves programmers hours or days of work since it works with your favorite parsers like lxml and html5lib to provide organic Python ways of navigating, searching, and modifying the parse tree.
  • Another powerful and useful feature of beautiful soup is its intelligence to convert the documents being fetched to Unicode and outgoing documents to UTF-8. As a developer, you do not have to take care of that unless the document intrinsic doesn't specify an encoding or Beautiful Soup is unable to detect one.
  • It is also considered to be faster when compared to other general parsing or scraping techniques.

 

In today's article we will be using Embedded Python with Object Script to scrape python vacancies and companies on  ae.indeed.com


Step 1 - Find the URL of the webpage that you want to scrape.
 

Url = https://ae.indeed.com/jobs?q=python&l=Dubai&start=0

The webpage that we are gonna scrape data from looks like this

 

for simplicity and learning purposes we will be extracting "Job Title" and "Company", the output would be something similar to below screenshot.

 

  

We will be using two python libraries.

  • requests Requests is a HTTP library for the Python programming language. The goal of the project is to make HTTP requests simpler and more human-friendly.  
  • bs4 for BeautifulSoup Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

Lets install this python packages (windows)

irispip install --target C:\InterSystems\IRISHealth\mgr\python bs4

irispip install --target C:\InterSystems\IRISHealth\mgr\python requests

Lets import the python libraries to ObjectScript
 

Class PythonTesting.WebScraper Extends %Persistent
{

// pUrl = https://ae.indeed.com/jobs?q=python&l=Dubai&start=
// pPage = 0
ClassMethod ScrapeWebPage(pUrl, pPage)
{
    // imports the requests python library
	set requests = ##class(%SYS.Python).Import("requests")
    // import the bs4 python library
	set soup = ##class(%SYS.Python).Import("bs4")
	// import builtins package which contains all of the built-in identifiers
	set builtins = ##class(%SYS.Python).Import("builtins")
}

Lets collect the html data using requests;

Note: The user agent us taken from googling "my user agent"
The url is "https://ae.indeed.com/jobs?q=python&l=Dubai&start=", pPage is the page number

We will do a http get request to the URL using requests and store the response on "req"

    set headers  = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"}
    set url = "https://ae.indeed.com/jobs?q=python&l=Dubai&start="_pPage
    
    set req = requests.get(url,"headers="_headers)

The req object will have the html which was returned from the webpage.

Let's run this through the BeautifulSoup html parser, so that we can extract the job data.

set soupData = soup.BeautifulSoup(req.content, "html.parser")
set title = soupData.title.text
W !,title

The title looks as follows

 


Step 2 :Select the required elements by inspecting.

In this scenario we are interested the list of jobs which usually sits in a <div> tag, in your browser you can inspect the element to find the div class.


In our case the required information is stored under <div class="cardOutline tapItem  ... </div>

 


Step 3: Write the code to get the content of the selected elements

We will be using the find_all functionality on BeautifulSoup to look for all the <div> tags which contains the class name "cardOutline"

//parameters to python would be sent as a python dictionary
set divClass = {"class":"cardOutline"}
set divsArr = soupData."find_all"("div",divClass...)

This will return a list, which we can loop through and extract the Job Titles and Company



Step 4: Store/Display the data in the required format.

In the following example we will be writing the data to the terminal.

set len = builtins.len(divsArr)
	
W !, "Job Title",$C(9)_" --- "_$C(9),"Company"
for i = 1:1:len {
    Set item = divsArr."__getitem__"(i - 1)
    set title = $ZSTRIP(item.find("a").text,"<>W")
    set companyClass = {"class_":"companyName"}
    set company = $ZSTRIP(item.find("span", companyClass...).text,"<>W")
    W !,title,$C(9)," --- ",$C(9),company
}

Note that we are using the builtins.len() to get the length of the divsArr list

Identifier Names:
The rules for naming identifiers are different between ObjectScript and Python. For example, the underscore (_) is allowed in Python method names, and in fact is widely used for the so-called “dunder” methods and attributes (“dunder” is short for “double underscore”), such as __getitem__ or __class__. To use such identifiers from ObjectScript, enclose them in double quotes:

Intersystems Documentation on Identifier Names

 

 Example Class Method.

 
ClassMethod ScrapeWebPage(pUrl, pPage)

 

Next Steps..

Using Object Script and Embedded python and with few lines of code; we could easily scrape data form our favourite job web sites, collect the job name, company, salary, job description and emails/links.
for example if you have multiple pages you can traverse through them easily using the page
This data can be added to a pandas dataframe and remove duplicates, filters can be applied based on specific keywords that you are interested in.
Run this data through numpy, and get some lineplot's
Or perform One-Hot encoding on the data, and create/train your ML models and if there are specific vacancies that you are interested in, send a notification to yourself. 😉

Happy Coding !!!

and don’t forget to hit the like button 😃

Discussion (12)2
Log in or sign up to continue

Cool stuff! For this age-old iKnow demo I used import.io to scrape a bunch of hotel reviews from the web, but it was pretty crude and server-based, so impossible to include in the demo repo. I've heard of beautifulsoup before, but never considered revisiting that data sourcing piece of my demo now that we can use it directly through Embedded Python.

Out of curiosity.
For my review analysis, I try to read the STARS In OpenExchange review pages.
The display is generated by Drupal based on some frames running JS scripts in the browser, filled with data from a DB in background that I have no access to.
Is there a chance by using BeautifulSoup to analyze this dynamic content?
 

@Robert Cemper , i believe this is possible,:
for example this Url is for Open exchange Object Script Package Manager Reviews;
https://openexchange.intersystems.com/package/ObjectScript-Package-Manag...


Each Review is enclosed within  <oex-review-card ... />
Although, i have not tested this we could do a find_all get the cards and extract the stars within them

what a great demo!  Thank you for writing it up - I hope to be able to experiment with this at some point.  Years ago I wrote my own web scraper in ObjectScript to watch the classified section of my local newspaper for cars going up for sale so I could find something undervalued and jump on it quickly - purchased my favorite used car that way thanks to my ObjectScript web scraper :)  But this library looks like a much easier approach ;)