Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 27, 2023 12:55 am GMT

How working/install Pig with Notebooks?

Basic commands to work with Pig in Notebooks

Related content

You can find post related in:

Google Colab

You can find repo related in:

GitHub

You can connect with me in:

LinkedIn

Resume

I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

First I install Hadoop using same commands that I have used before but without put a number of step.

Install Hadoop

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.

Unzip and copy

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/

Set up Hadoop's Java

I use following command:

#To find the default Java path and add export in hadoop-env.shJAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"java_home_text = JAVA_HOME[0]java_home_text_command = f"$ {JAVA_HOME[0]} "!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

Set Hadoop home variables

I use following command:

# Set environment variablesimport osos.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"os.environ['JAVA_HOME']=java_home_text

1st - Install Pig

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.

2nd - Unzip and copy

I use following command:

!tar -xzvf pig-0.17.0.tar.gz

3rd - Set Pig home variables

I use following command:

# Set environment variablesimport osos.environ['PIG_HOME']="/content/pig-0.17.0"os.environ['PIG_CLASSPATH']="/usr/local/hadoop-3.3.1/conf"os.environ["PATH"] += os.pathsep + "/content/pig-0.17.0/bin"

We can validate installation with command:

!pig -version

4th - Create a folder with HDFS

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig

4.1 - Remove folder with HDFS

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig

5th - Getting a dataset to anlyze with Pig

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "file:///content/data_pig"

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k

6th - Creating process to use Pig with Pig Syntax

To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:

!head /content/data_pig/ml-100k/u.data

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
# Create pig script%%writefile id.pig/* id.pig */student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')   as (userId:int, movieId:int, rating:int, timestamp:int);student_order = ORDER student BY rating DESC;Dump student_order;

7th - Running the process

Here we run the process specifing some parameters:

  • Pig file program is id.pig
  • Dataset is in file:///content/data_pig/ml-100k/u.data

When run process, maybe take a few minutes...

You can run script with:

!pig -x local id.pig

But we run script and save results in a file .txt:

!pig -x local id.pig > results.txt

8th - Advancing in the logic of the scripts

Now we will advance in logic of the script to get answer to next questions:

  • What are the oldest 5 star movies?
  • What are the worst movies?

8.1 - Find oldest 5 star movies start

%%writefile fiveStarMovies.pigratings = LOAD 'file:///content/data_pig/ml-100k/u.data'    AS (userID:int, movieID:int, rating:int, ratingTime:int);metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;ratingsByMovie = GROUP ratings BY movieID;avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;DUMP oldestFiveStarMovies;

Run script and save results in a file .txt:

!pig -x local fiveStarMovies.pig > fiveStarMovies.txt

8.2 - Find most rated bad movies

%%writefile BadPopularMovies.pigratings = LOAD 'file:///content/data_pig/ml-100k/u.data'  AS (userID:int, movieID:int, rating:int, ratingTime:int);metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);nameLookup = FOREACH metadata GENERATE movieID, movieTitle;groupedRating = GROUP ratings by movieID;avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;  badMovies = FILTER avgRatings BY avgRating < 2.0;namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,          badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;finalResults = ORDER results BY numRatings DESC;DUMP finalResults;

Run script and save results in a file .txt:

!pig -x local BadPopularMovies.pig > BadPopularMovies.txt

9th - Say thanks, give like and share if this has been of help/interest


Original Link: https://dev.to/xlmriosx/how-workinginstall-pig-with-notebooks-54km

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To