SparkSession is the new entry point from Spark 2.0. Prior to 2.0, we had only SparkContext and SQLContext, and also we would create StreamingContext (if using streaming).
It looks like SparkSession is part of the Spark’s plan of unifying the APIs from Spark 2.0.
start spark shell
Run the following commands from your spark base folder.
create spark session
SparkSession object will be available by default in the spark shell as “spark”. But when you build your spark project outside the shell, you can create a session as follows
If you run the above command in spark shell, you will see this warning
This is because there is already an instance SparkSession object in the scope, which is also evident from the builder’s getOrCreate() method.
getOrCreate method of SparkSession builder does the following:
Create a SparkConf
Get a SparkContext (using SparkContext.getOrCreate(sparkConf))
Get a SparkSession (using SQLContext.getOrCreate(sparkContext).sparkSession)
Once spark session is created, it can be used to read data from various sources.
Note : All the commands used in the blog post can be found here
Let us now register this Dataframe as a temp table.
It looks like registerTempTable method is deprecated. Let’s check Dataset.scala to figure out which alternate method to use.
You can also save the dataframe as table in hive metastore using.
You can access the registered table via
We can register udf(User Defined Function) using the SparkSession.
This API is similar to how we create an RDD using SparkContext
Used for creating DataFrames. We cannot create a Dataframe from our earlier RDD[Int] because createDataFrame requires an RDD[A <: Product] - i.e., a class that is subclass of Product. So we will create a DataFrame from an RDD of case class.
Let us look at one more way of creating DataFrame, using Row RDD and Schema
DataFrame to RDD / DataSet to RDD
A DataFrame or a DataSet can be converted to rdd by calling .rdd
Catalog provides a catalog of information about the databases and tables in the session, also some actions like drop view, cacheTable, clearCache etc
This concludes my experiments with SparkSession for now. I will try to explore more about the new features in Spark 2.0 and share with you in later posts!
Hello, Would you like to subscribe so that I can keep you posted on my new aritcles?