Scraper payslips with Python | Selenium
Scenario:
I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.
How: py, selenium. I tried with beutifulsoup but didn't work.
Explenation and Code
Web Driver
I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.
Creating a class
class PaylipsScaper: # Init def __init__(self, username, password): self.username = username self.password = password # Options chrome_options = webdriver.ChromeOptions() prefs = { "plugins.always_open_pdf_externally": True, "download.default_directory": "C:\\tmp", # folder save files "download.prompt_for_download": False, "download.directory_upgrade": True, "safebrowsing.enabled": True } chrome_options.add_experimental_option("prefs",prefs) chrome_options.headless = True # If True hide browser self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)
Login
The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.
# Manage login page def login(self, url): driver = self.driver driver.get(url) driver.switch_to.frame("FunArea") username = driver.find_element_by_id("login") password = driver.find_element_by_id("pwd") username.send_keys(self.username) time.sleep(1) password.send_keys(self.password) driver.find_element_by_id("CmdInvia").click()
Loop table using XPATH
I create a class that wrap the selenium driver in order to keep all cleans.
I just reproduce the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution use XPATH
(to get the XPATH in Chrome see here)
By the way I don't like the time.sleep but was usefull to avoid navigations problems during the process.
# Inside PaylipsScaper classdef get_num_rows(self, num_rows = 1): driver = self.driver self.click_to_payslips_area() num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr")) return num_rows[...other stuff...]try: bot = PaylipsScaper(username, password) bot.login(url_website) wait = WebDriverWait(bot.driver, 10) num_rows = bot.get_num_rows() for row in range(1,num_rows+1): paylip_year = bot.get_val_in_cedolino_row(row, 4) paylip_month = bot.get_val_in_cedolino_row(row, 5) paylip_type = bot.get_val_in_cedolino_row(row, 7) bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img")))) time.sleep(2) filepdf= dirpath + "\\*.pdf" list_of_files = glob.glob(filepdf) file_name = max(list_of_files, key=os.path.getctime) current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name) bot.rename_and_move (current_paylip) print("Downloaded:") print(current_paylip)
Save pdf file in folder and rename it
It's quite a brute solution anyway I get the last pdf saved in a folder and rename it with the informations from the website.
Then I move the file in sub-folders by year.
def rename_and_move(self, urrent_paylip): if current_paylip.paylip_month == "" : new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf' elif "TREDICESIMA" in current_paylip.paylip_type: new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf' else: new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf' print(new_file_name) new_file_name = os.path.join(dirpath, new_file_name) # Rename file and move it in the year-directory os.rename(current_paylip.file_name, new_file_name) current_paylip.file_name = new_file_name # Check if path with year directory exist otherwise create it dirin=os.path.split(new_file_name) newdir=dirin[0]+'\\'+current_paylip.paylip_year if os.path.exists(newdir)==False: # Create directory os.mkdir(newdir) # Move file in the year-directory if os.path.exists(newdir+"\\"+dirin[1]): # If file already exist, delete it os.remove(newdir+"\\"+dirin[1]) shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1]) return
Final situation
Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.
What I learned:
- Use of Selenium in py.
- Simple automation can save a lot of time and avoid manual boring tasks.
- How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)
Future improvements:
- input parameters
- (re)try to use css selector instead of xpath selector
- (re)try to use BeautifulSoup
- save last paylips saved in order, next run, to save only the not already saved paylips
- read pdf and report data in file(eg google sheets)
Of course the code are useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.
Original Link: https://dev.to/alanstocco/scraper-payslips-with-python-selenium-1h7
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To