Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

February 4, 2021 01:27 pm GMT

Scraper payslips with Python | Selenium

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

class PaylipsScaper:    # Init    def __init__(self, username, password):        self.username = username        self.password = password        # Options        chrome_options = webdriver.ChromeOptions()        prefs = {            "plugins.always_open_pdf_externally": True,            "download.default_directory": "C:\\tmp", # folder save files            "download.prompt_for_download": False,            "download.directory_upgrade": True,            "safebrowsing.enabled": True            }        chrome_options.add_experimental_option("prefs",prefs)        chrome_options.headless = True # If True hide browser        self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)

Enter fullscreen mode Exit fullscreen mode

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.

    # Manage login page    def login(self, url):        driver = self.driver        driver.get(url)        driver.switch_to.frame("FunArea")         username = driver.find_element_by_id("login")        password = driver.find_element_by_id("pwd")        username.send_keys(self.username)        time.sleep(1)           password.send_keys(self.password)          driver.find_element_by_id("CmdInvia").click()

Enter fullscreen mode Exit fullscreen mode

Loop table using XPATH

I create a class that wrap the selenium driver in order to keep all cleans.
I just reproduce the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution use XPATH
(to get the XPATH in Chrome see here)
By the way I don't like the time.sleep but was usefull to avoid navigations problems during the process.

# Inside PaylipsScaper classdef get_num_rows(self, num_rows = 1):        driver = self.driver        self.click_to_payslips_area()                    num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))                     return num_rows[...other stuff...]try:    bot = PaylipsScaper(username, password)     bot.login(url_website)    wait = WebDriverWait(bot.driver, 10)    num_rows = bot.get_num_rows()           for row in range(1,num_rows+1):           paylip_year  = bot.get_val_in_cedolino_row(row, 4)        paylip_month = bot.get_val_in_cedolino_row(row, 5)                    paylip_type  = bot.get_val_in_cedolino_row(row, 7)        bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))            time.sleep(2)          filepdf= dirpath + "\\*.pdf"        list_of_files = glob.glob(filepdf)            file_name = max(list_of_files, key=os.path.getctime)        current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)        bot.rename_and_move (current_paylip)        print("Downloaded:")        print(current_paylip)

Enter fullscreen mode Exit fullscreen mode

Save pdf file in folder and rename it

It's quite a brute solution anyway I get the last pdf saved in a folder and rename it with the informations from the website.
Then I move the file in sub-folders by year.

def rename_and_move(self, urrent_paylip):        if current_paylip.paylip_month == "" :            new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'        elif "TREDICESIMA" in current_paylip.paylip_type:            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'        else:            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'        print(new_file_name)        new_file_name = os.path.join(dirpath, new_file_name)        # Rename file and move it in the year-directory        os.rename(current_paylip.file_name, new_file_name)        current_paylip.file_name = new_file_name        # Check if path with year directory exist otherwise create it        dirin=os.path.split(new_file_name)        newdir=dirin[0]+'\\'+current_paylip.paylip_year        if os.path.exists(newdir)==False:                # Create directory                os.mkdir(newdir)        # Move file in the year-directory         if os.path.exists(newdir+"\\"+dirin[1]):            # If file already exist, delete it             os.remove(newdir+"\\"+dirin[1])        shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])        return

Enter fullscreen mode Exit fullscreen mode

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

Use of Selenium in py.
Simple automation can save a lot of time and avoid manual boring tasks.
How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

input parameters
(re)try to use css selector instead of xpath selector
(re)try to use BeautifulSoup
save last paylips saved in order, next run, to save only the not already saved paylips
read pdf and report data in file(eg google sheets)

Of course the code are useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.

Original Link: https://dev.to/alanstocco/scraper-payslips-with-python-selenium-1h7

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To