Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 17, 2022 11:36 pm GMT

SageMaker Data Ingestion using Kaggle

Recently, I have been focusing on learning AI/ML. After overcoming many roadblocks and mistakes, I can now confidently share a successful solution.

I intend to explain one specific foundation of ML which is data ingestion. I will demonstrate how you can import a Kaggle dataset into a SageMaker Studio notebook. Amazon SageMaker is a ML tool that enables developers to rapidly create, train, and deploy machine-learning models in the cloud. Kaggle is an online community platform that has numerous datasets and ML challenges for data scientists and machine learning enthusiasts. You can clearly see how both tools can complement each other if integrated properly.

Prerequisite

  • A SageMaker Studio Notebook is needed. If you don't have one already, you can follow this guide to create one.
  • A Kaggle account is needed, you can register for one here.

Let's build!!

First we will import the python packages that will be used in the notebook.

Import Packages

import pandas as pdimport time

Install Kaggle CLI

!pip install --q kaggle 

To use Kaggle API, you must have an account and an API token. You can follow this guide to generate your API token, it is completely free. The command below creates a json file to store your Kaggle credentials. Insert your username and API Key in the code blocks.

!touch ~/.kaggle/kaggle.json # Creates json file to store Kaggle API Credentialskaggle_api_token = {"username":"<username>","key":"<api_key>"}  # Insert your own username and API Key here

We then write our kaggle credentials to the json file we created.

import json # Writes API Credentials to Kaggle filewith open('/root/.kaggle/kaggle.json', 'w') as file:     json.dump(kaggle_api_token,file)

For security reasons, we must ensure that other users do not have read access to our Kaggle credentials.

!chmod 600 ~/.kaggle/kaggle.json

Since our access token is now configured we can list the available datasets.

!kaggle datasets list # List available datasets

If the above command was successful, you will see a list of available datasets.

Image description

The below command downloads the dataset you specified. You can change this name to any of the names returned in the list of datasets. Downloading the dataset might take some time depending on your network connection.

%%time!kaggle datasets download -d iamsouravbanerjee/game-of-thrones-dataset --unzip # Downloads & Unzip dataset

Now that the dataset is downloaded, let us visualize what the csv file looks like. We will use pandas to load and display the data.

data = pd.read_csv("Game_of_Thrones.csv", header=0)df = data.copy()df.head() 

Image description

Github Repo

You can find the complete SageMaker Studio Notebook on my GitHub.

Additional Features

This was a simple demonstration of data ingestion, you can build on this solution by extracting insights from the data using pandas or perhaps training an ML model. If you do, please feel free to share your project with me.

Stay curious, keep learning and keep building!!!


Original Link: https://dev.to/aws-builders/sagemaker-data-ingestion-using-kaggle-3cba

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To