Kindly Note: If you had any problems, please comment below.
LAB 1: Welcome to Pyspark - Handson 1 - Create a DataFrame
Solution: Welcome to Pyspark - Handson 1 - Create a DataFrame
Solution: Welcome to Pyspark Handson 1 [Create a DataFrame]
#!/bin/python3 # Put your code here from pyspark.sql import SparkSession from pyspark.sql import * spark = SparkSession.builder.appName("Data Frame PASSANGER").config("spark.some.config.option", "some-value").getOrCreate() passanger = Row("Name","age","source","destination") data1 = passanger("David", "22", "London", "Paris") data2 = passanger("Steve", "22", "New York", "Sydney") passangerData=[data1,data2] df = spark.createDataFrame(passangerData) df.show() # Don't Remove this line df.coalesce(1).write.parquet("PassengerData")
LAB 2: Welcome to Pyspark Handson 2: PySpark Final Hands-on: DataFrame operations using a json file.
Solution: PySpark Final Hands-on: DataFrame operations using a json file.
Solution: PySpark Final Hands-on: DataFrame operations using a json file.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PySpark DataFrame From RDD').getOrCreate() df = spark.read.json("emp.json") df.show() df.coalesce(1).write.parquet("/projects/challenge/Employees") df2=spark.read.parquet("Employees") df1=df2.filter(df2.stream=='JAVA') df1.show() df1.coalesce(1).write.parquet("/projects/challenge/JavaEmployees")
LAB 3: Welcome to Pyspark Handson 3- Statistical and mathematical functions with dataframes in apache sparks Hands-on
Solution: Statistical and mathematical functions with dataframes in apache sparks Hands-on
Solution: Statistical and mathematical functions with dataframes in apache sparks Hands-on
from pyspark.sql import SparkSession from pyspark.sql.functions import rand from pyspark.sql import * from pyspark import SparkContext spark = SparkSession.builder.appName('PySpark DataFrame').getOrCreate() df = spark.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('rand2', rand(seed=27)) # print(df.show()) # df.show() CoV=df.stat.cov('rand1', 'rand2') # print(CoV) CoR=df.stat.corr('rand1', 'rand2') # print(CoR) Student = Row("Stats", "Value") s1 = Student('Co-variance', 0.01580184435383226) s2 = Student('Correlation', 0.16622388738558816) StudentData=[s1,s2] df=spark.createDataFrame(StudentData) # df.show() df.write.option("header",True).csv("Result")
LAB 4: Welcome to Pyspark - Handson - 4 - More Operations in pyspark
Solution: Welcome to Pyspark - Handson - 4 - More Operations in pyspark
Solution: Welcome to Pyspark - Handson - 4 - [More Operations in pyspark]
#!/bin/python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('handson2').getOrCreate() from pyspark.sql import * # from pyspark.sql import functions as col passanger = Row("ID","Name","Age","Area of Interest") data1 = passanger("1", "Jack", 22, "Data Science") data2 = passanger("2", "Luke", 21, "Data Analytics") data3 = passanger("3", "Leo", 24, "Micro Services") data4 = passanger("4", "Mark", 21, "Data Analytics") passangerData=[data1,data2,data3,data4] df = spark.createDataFrame(passangerData) df_parquet= df.describe(['Age']) df_parquet.show() df_parquet.coalesce(1).write.parquet("/projects/challenge/Age") df2=df.select(['ID', 'Name', 'Age'] ).sort(['Name'], ascending=False) df2.coalesce(1).write.parquet("/projects/challenge/NameSorted")