Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
February 4, 2021 01:27 pm GMT

Scraper payslips with Python | Selenium

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

class PaylipsScaper:    # Init    def __init__(self, username, password):        self.username = username        self.password = password        # Options        chrome_options = webdriver.ChromeOptions()        prefs = {            "plugins.always_open_pdf_externally": True,            "download.default_directory": "C:\\tmp", # folder save files            "download.prompt_for_download": False,            "download.directory_upgrade": True,            "safebrowsing.enabled": True            }        chrome_options.add_experimental_option("prefs",prefs)        chrome_options.headless = True # If True hide browser        self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)
Enter fullscreen mode Exit fullscreen mode

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.

    # Manage login page    def login(self, url):        driver = self.driver        driver.get(url)        driver.switch_to.frame("FunArea")         username = driver.find_element_by_id("login")        password = driver.find_element_by_id("pwd")        username.send_keys(self.username)        time.sleep(1)           password.send_keys(self.password)          driver.find_element_by_id("CmdInvia").click()
Enter fullscreen mode Exit fullscreen mode

Loop table using XPATH

I create a class that wrap the selenium driver in order to keep all cleans.
I just reproduce the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution use XPATH
(to get the XPATH in Chrome see here)
By the way I don't like the time.sleep but was usefull to avoid navigations problems during the process.

# Inside PaylipsScaper classdef get_num_rows(self, num_rows = 1):        driver = self.driver        self.click_to_payslips_area()                    num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))                     return num_rows[...other stuff...]try:    bot = PaylipsScaper(username, password)     bot.login(url_website)    wait = WebDriverWait(bot.driver, 10)    num_rows = bot.get_num_rows()           for row in range(1,num_rows+1):           paylip_year  = bot.get_val_in_cedolino_row(row, 4)        paylip_month = bot.get_val_in_cedolino_row(row, 5)                    paylip_type  = bot.get_val_in_cedolino_row(row, 7)        bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))            time.sleep(2)          filepdf= dirpath + "\\*.pdf"        list_of_files = glob.glob(filepdf)            file_name = max(list_of_files, key=os.path.getctime)        current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)        bot.rename_and_move (current_paylip)        print("Downloaded:")        print(current_paylip)
Enter fullscreen mode Exit fullscreen mode

Save pdf file in folder and rename it

It's quite a brute solution anyway I get the last pdf saved in a folder and rename it with the informations from the website.
Then I move the file in sub-folders by year.

def rename_and_move(self, urrent_paylip):        if current_paylip.paylip_month == "" :            new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'        elif "TREDICESIMA" in current_paylip.paylip_type:            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'        else:            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'        print(new_file_name)        new_file_name = os.path.join(dirpath, new_file_name)        # Rename file and move it in the year-directory        os.rename(current_paylip.file_name, new_file_name)        current_paylip.file_name = new_file_name        # Check if path with year directory exist otherwise create it        dirin=os.path.split(new_file_name)        newdir=dirin[0]+'\\'+current_paylip.paylip_year        if os.path.exists(newdir)==False:                # Create directory                os.mkdir(newdir)        # Move file in the year-directory         if os.path.exists(newdir+"\\"+dirin[1]):            # If file already exist, delete it             os.remove(newdir+"\\"+dirin[1])        shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])        return
Enter fullscreen mode Exit fullscreen mode

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

  • Use of Selenium in py.
  • Simple automation can save a lot of time and avoid manual boring tasks.
  • How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

  • input parameters
  • (re)try to use css selector instead of xpath selector
  • (re)try to use BeautifulSoup
  • save last paylips saved in order, next run, to save only the not already saved paylips
  • read pdf and report data in file(eg google sheets)

Of course the code are useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.


Original Link: https://dev.to/alanstocco/scraper-payslips-with-python-selenium-1h7

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To