
Kindly Note: If you had any problems, please comment below.
LAB 1: Welcome to Pyspark - Handson 1 - Create a DataFrame
Solution: Welcome to Pyspark - Handson 1 - Create a DataFrame
Solution: Welcome to Pyspark Handson 1 [Create a DataFrame]
#!/bin/python3
# Put your code here
from pyspark.sql import SparkSession
from pyspark.sql import *
spark = SparkSession.builder.appName("Data Frame PASSANGER").config("spark.some.config.option", "some-value").getOrCreate()
passanger = Row("Name","age","source","destination")
data1 = passanger("David", "22", "London", "Paris")
data2 = passanger("Steve", "22", "New York", "Sydney")
passangerData=[data1,data2]
df = spark.createDataFrame(passangerData)
df.show()
# Don't Remove this line
df.coalesce(1).write.parquet("PassengerData")
LAB 2: Welcome to Pyspark Handson 2: PySpark Final Hands-on: DataFrame operations using a json file.
Solution: PySpark Final Hands-on: DataFrame operations using a json file.
Solution: PySpark Final Hands-on: DataFrame operations using a json file.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySpark DataFrame From RDD').getOrCreate()
df = spark.read.json("emp.json")
df.show()
df.coalesce(1).write.parquet("/projects/challenge/Employees")
df2=spark.read.parquet("Employees")
df1=df2.filter(df2.stream=='JAVA')
df1.show()
df1.coalesce(1).write.parquet("/projects/challenge/JavaEmployees")
LAB 3: Welcome to Pyspark Handson 3- Statistical and mathematical functions with dataframes in apache sparks Hands-on
Solution: Statistical and mathematical functions with dataframes in apache sparks Hands-on
Solution: Statistical and mathematical functions with dataframes in apache sparks Hands-on
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand
from pyspark.sql import *
from pyspark import SparkContext
spark = SparkSession.builder.appName('PySpark DataFrame').getOrCreate()
df = spark.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('rand2', rand(seed=27))
# print(df.show())
# df.show()
CoV=df.stat.cov('rand1', 'rand2')
# print(CoV)
CoR=df.stat.corr('rand1', 'rand2')
# print(CoR)
Student = Row("Stats", "Value")
s1 = Student('Co-variance', 0.01580184435383226)
s2 = Student('Correlation', 0.16622388738558816)
StudentData=[s1,s2]
df=spark.createDataFrame(StudentData)
# df.show()
df.write.option("header",True).csv("Result")
LAB 4: Welcome to Pyspark - Handson - 4 - More Operations in pyspark
Solution: Welcome to Pyspark - Handson - 4 - More Operations in pyspark
Solution: Welcome to Pyspark - Handson - 4 - [More Operations in pyspark]
#!/bin/python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('handson2').getOrCreate()
from pyspark.sql import *
# from pyspark.sql import functions as col
passanger = Row("ID","Name","Age","Area of Interest")
data1 = passanger("1", "Jack", 22, "Data Science")
data2 = passanger("2", "Luke", 21, "Data Analytics")
data3 = passanger("3", "Leo", 24, "Micro Services")
data4 = passanger("4", "Mark", 21, "Data Analytics")
passangerData=[data1,data2,data3,data4]
df = spark.createDataFrame(passangerData)
df_parquet= df.describe(['Age'])
df_parquet.show()
df_parquet.coalesce(1).write.parquet("/projects/challenge/Age")
df2=df.select(['ID', 'Name', 'Age'] ).sort(['Name'], ascending=False)
df2.coalesce(1).write.parquet("/projects/challenge/NameSorted")