Scrape the best selling products on Amazon using python and Beautiful Soup.

In this article, we're going to learn how to scrape Amazon bestselling product details using python, requests, and Beautiful Soup. Before diving into the details we must first understand what is web scraping? Web scraping is a method of extracting a large amount of information from web pages that can be stored in a specific format.

Why do web scrape?

Imagine if you are trying to get much information about something from various web pages and articles that need to be stored in a suitable format, for instance, an excel file. One way is to go through all those websites and write the useful information to the excel sheets manually. But programmers tend to do it in an easy way which is web scraping. Web scraping can be useful in many cases when you want to extract a large amount of information within a short period of time. It is used in extracting data for data analysis, building Machine Learning models, Price monitoring, News tracking, etc.

Scraping Amazon bestsellers

Amazon is one of the largest online business enterprises which sells millions of varieties of goods all over the world through the platform called Amazon.com. They have different categories of products including Fashion, Books, Electronics, Toys, Jewellery, and more. There are some popular products that are listed in the bestseller category of Amazon that is helpful for sellers to find the bestselling products as well as the customers to find the bestselling quality products to buy. Here we are going to scrape those bestselling products of different categories using python and Beautiful Soup.

steps involved:

Install and import the required libraries
Parse a specific bestseller category page using requests and Beautiful Soup
Fetch information about the item including name, price, reviews, rating, and URL
Store all the information's into a CSV file using pandas

Installing and importing libraries

Install libraries directly using pip:

$ pip install requests lxml bs4 pandas

The requests help to send an HTTP request using python. The lxml converts the page into XML or HTML format, the bs4 module for parsing the web page, and pandas to convert the data into a CSV file.

Importing libraries

import requests 
from bs4 import BeautifulSoup
import pandas as pd

HEADERS ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

We are using a User-Agent header string for servers to identify the application, OS, version, etc.

Now we are going to define a function with URL as a parameter that scrapes through the page specified as the parameter.

 def scrape_product_details(url):
    resp = requests.get(url, headers=HEADERS)
    content = BeautifulSoup(resp.content, 'lxml')   
    item_details = []  
    for item in content.select('.zg-item-immersion'):      
        try:          
            data = {
                "item":item.select('.p13n-sc-truncate')[0].get_text().strip(),
                "price":item.select('.p13n-sc-price')[0].get_text().strip(),
                "rating":item.select('.a-icon-row i')[0].get_text().strip(),
                'reviews':item.select('.a-icon-row a')[1].get_text().strip(),
                'url':"https://www.amazon.in/"+item.select('.a-icon-row a')[1]['href']
            }
        except IndexError:
            continue       
        item_details.append(data)
    dataframe = pd.DataFrame(item_details)  
    return dataframe

Here, the requests.get() (2nd line) function returns a response of data of the entire web page. Next, we are converting the received data into an HTML format using Beautiful Soup and lxml module. Now let's use the CSS selectors to fetch the data we want.

Hover through the page to find the CSS selectors we need. We require selectors of the item div tag as shown in the image, item name, price, reviews, rating, and URL. Probably the selectors used in the code are the same on the web page, but sometimes it may change.

After getting all the selectors required, we need to loop through each of the products on the web page for finding the name, price, reviews, etc, and store all those details in a python dictionary we named as data in the code. Then the entire dictionary consisting of product details is appended to a list named item_details. After that, the item_details is converted into a pandas data frame and simply returned it.

Now let's call the function with best sellers products URL as the parameter:

You'll get the products listed with their names, price, rating, reviews, and the URL as pandas Data Frame. Now you can simply save it as a CSV file using: df.to_csv('amazon_scrape.csv')

PyCodeMates

Search Suggest

Scrape the best selling products on Amazon using python and Beautiful Soup.

Why do web scrape?

Scraping Amazon bestsellers

Installing and importing libraries

Top 5 World's Most Advanced AI Systems | 2024

Making a WhatsApp spammer with python under 10 lines of code.

WebP to PNG Conversion Using Python

The RBF kernel in SVM: A Complete Guide

Implementing SVM from Scratch Using Python

SVM Kernels: Polynomial Kernel - From Scratch Using Python.

Backpropagation Through Time (BPTT): Explained With Derivations

SVM for Face Recognition Using Python

Derivation of Backpropagation in Convolutional Neural Network (CNN)

Build A Convolutional Neural Network (CNN) From Scratch Using Python