Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

January 16, 2023 02:05 am GMT

How working/install Spark with Notebooks?

Basic commands to work with Spark in Notebooks like a Standalone cluster

Related content

You can connect with me in:

Resume

I will install Spark program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

Before start we setup environment to run Spark Standalone Cluster.

1st - Mount Google Drive

We will mount Google Drive to can use it files.

I use following script:

from google.colab import drivedrive.mount('/content/gdrive')

2nd - Install Spark

Later got a Colab notebook up, to get Spark running you have to run the following script (I apologize for how ugly it is).

I use following script:

%%bashapt-get install openjdk-8-jdk-headless -qq > /dev/null if [[ ! -d spark-3.3.1-bin-hadoop3.tgz ]]; then  echo "Spark hasn't been installed, Downloading and installing!"  wget -q https://downloads.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz  tar xf spark-3.3.1-bin-hadoop3.tgz  rm -f spark-3.3.1-bin-hadoop3.tgzfipip install -q findspark

You would can get other version if you need in: https://downloads.apache.org/spark/ and later replace it in the before command.

3rd - Setting environment variables

I use following command:

import osos.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"  os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"import findsparkfindspark.init()

4th - Configuring a SparkSession

I use following command:

from pyspark.sql import SparkSessionspark = SparkSession.builder \    .master("local[*]") \ # set up like master using all(*) threads    .appName("BLOG_XLMRIOSX") \ # a generic name    .getOrCreate()sc = spark.sparkContextsc

Later of execute this we can use many data strcuture to manage data, like RDD and Dataframe(Spark and Pandas).
Exists one more but only exists in Scala and it is Dataset.

5th - Getting a dataset to anlyze with Spark

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "/content/ml-100k_folder"

6th - Configuring data to analyze

We create a variable called data where put path where data is like:

data = '/content/ml-100k_folder/ml-100k/u.data'

Later define a variable called df_spark where put information of data:

df_spark = spark.read.csv(data, inferSchema=True, header=True)

We can inspect type of variable df_spark like:

print(type(df_spark))

We can inspect data frame of variable df_spark like:

df_spark.show()

We can see format is incorrect so we will fix that where we configure format of data by following way:

df_spark = spark.read.csv(data, inferSchema=True, header=False, sep="")

9th - Making a query

To make this we need know format of data, so I infer the following structure:

First column reference to userID.
Second column reference to movieID.
Third column reference to rating.
Fourth column reference to timestamp.

I will answer the question, how many movies by each rating exists...

9th - Making a query with SQL syntax

First, create a table with dataframe.

df_spark.createOrReplaceTempView("table")

Second, can make query to answer question.

sql = spark.sql("SELECT _c2, COUNT(*) FROM table GROUP BY _c2")

To see results:

sql.show()

9th - Making a query with Dataframe

It's so easy make this query with dataframe.

dataframe.groupBy("_c2").count().show()

9th - Making a query with RDD

First, we transform dataframe to rdd type.

rdd = df_spark.rdd

Second, make query with RDD functions.

rdd\.groupBy(lambda x: x[2])\.mapValues(lambda values: len(set(values)))\.collect()

10th - Say thanks, give like and share if this has been of help/interest

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To

An Interest In:

Web News this Week

Some of Our Sources

Help Webnuz

How working/install Spark with Notebooks?

Basic commands to work with Spark in Notebooks like a Standalone cluster

Related content

You can find post related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Resume

1st - Mount Google Drive

2nd - Install Spark

3rd - Setting environment variables

4th - Configuring a SparkSession

5th - Getting a dataset to anlyze with Spark

6th - Configuring data to analyze

9th - Making a query

9th - Making a query with SQL syntax

9th - Making a query with Dataframe

9th - Making a query with RDD

10th - Say thanks, give like and share if this has been of help/interest

Related content

You can find post related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Dev To