SparkContext

This is the core of Spark. It handles:

  • connecting to the cluster (Standalone/YARN/Mesos/K8s),
  • task scheduling,
  • creating RDDs,
  • managing executors.

Without SparkContext, Spark cannot run at all.

Example:

sc = SparkContext(appName="MyApp")
rdd = sc.parallelize([1, 2, 3])

SparkSession

Starting with Spark 2.0, SparkSession became the main interface.

SparkSession:

  • unifies SQLContext, HiveContext, and SparkContext,
  • provides DataFrame and Dataset APIs,
  • is used for SQL, Hive tables, Delta Lake, and reading/writing data,
  • contains a SparkContext internally.

Example:

spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.json("data.json")

Accessing SparkContext from SparkSession:

sc = spark.sparkContext

🧠 Why SparkSession was introduced

Before Spark 2.0:

  • RDDs → SparkContext
  • DataFrames → SQLContext
  • Hive tables → HiveContext

This was messy. SparkSession unified all of these:

✔ a single entry point ✔ unified configuration ✔ easier to use in Python/JVM ✔ consistent API

🆚 Main differences

FeatureSparkContextSparkSession
Entry pointYes, low-levelYes, unified and high-level
RDD API✔ via spark.sparkContext
DataFrame/SQL/Dataset
Read/write data (DataSource API)Limited
Hive support
Internally used by SparkSession

📝 Summary

SparkSession is the modern, unified way to work with Spark. It is a high-level API that:

  • offers DataFrames, SQL, and Datasets,
  • handles reading/writing data,
  • gives access to Hive,
  • internally wraps SparkContext.

SparkContext is the engine. SparkSession is the steering wheel.