I tried Python for the first time a couple of weeks ago. Can’t say it was love from the first line of code, but after I replaced a standard PyCharm’s Darcula theme with Atom One Dark, and after I tried to write a web scraper, I started enjoying quite a lot.
So I wrote a scraper that grabs the product data (product name, description, images, price) from one e-commerce website and saves it into CSV file. If the website owner could hear me, I would say sorry, and hope I did not create any issues for that website.
I think the code speaks better than any words, so here is my code (the URL is dummy just in case :) ):
There are four functions:
get_soup sends request to the given url and returns html response
get_category_urls - returns a list of category urls from the homepage’s menu
get_product_urls - returns a list of product urls for the given category, takes into account pagination on the category page
get_product_details - returns a product dictionary
I tried multiprocessing in order to speed up the scraping. Until now still can’t understand what is the difference between multiprocessing and threading. Also, not really understand the syntax of it (I found it somewhere on the internet).
The most surprising thing for me was Pandas drop_duplicates method - without Pandas I might had to spend ages figuring out how to remove duplicates from the list of dictionaries (if a value of one specific key is duplicated).
I am thinking to improve the code further, e.g. in get_soup I should probably check if response status code is 200. If you have any suggestions of improvements, please write in the comments, it will be greatly appreciated. That’s it for now, this was my first Python code and hope not the last.