Python Pandas Tutorial: Essential DataFrame Tips from Beginner to Expert Solution

Unlock Python Pandas mastery with our guide: essential DataFrame skills, from beginner basics to advanced techniques and practical solutions examples
Python Pandas Tutorial: Essential DataFrame Tips from Beginner to Expert Solution - www.pdfcup.com

#1. How to Create a DataFrame in Python Using Pandas

There are three commonly used methods in data analysis are:

Option 1:


import pandas as pd
data = {
"price_aprox_usd": [115910.26, 48718.17, 28977],
"surface_covered_in_m2": [128.0, 210.0, 58.0],
"rooms": [4.0, 3.0, 2.0],
}

df_houses = pd.DataFrame(data)

Option 2:


height_series = pd.Series([161.5, 156.2, 172.8, 140.9])
weight_series = pd.Series([58.6, 65.7, 58.8, 86.1])

frame = {'height': height_series,
         'weight': weight_series}
df = pd.DataFrame(frame)


Option 3:


h = pd.Series([161.5, 156.2, 172.8, 140.9], name="Marks")
w = pd.Series([58.6, 65.7, 58.8, 86.1], name="Average")
pd.concat([h,w], axis=1)

#df = pd.concat([df, new_col], axis=1)   # using this, you can add a new column

#2. How to Drop or Delete NaN Values from the DataFrame.

The dropna() function helps to remove rows from the DataFrame if they contain any NaN values.


# Drop entire row if any NaN values available.
df_cleaned = df.dropna()

#If you want to remove rows where all elements are NaN, use this method:
df_cleaned = df.dropna(how='all')

#To remove columns with NaN values:
df_cleaned = df.dropna(axis=1)

#To remove rows with NaN values only in specific columns:
df_cleaned = df.dropna(subset=['colA', 'colB'])

#3. How to Delete or Remove Duplicate Values from the DataFrame.

The drop_duplicates() function helps to remove duplicate values from the entire DataFrame.


#To remove duplicate rows based on all columns:
df_cleaned = df.drop_duplicates()

#To remove duplicates based on specific columns:
df_cleaned = df.drop_duplicates(subset=['colA', 'colB'])

#To keep the last occurrence of the duplicates instead of the first:
df_cleaned = df.drop_duplicates(keep='last')

#4. How to Remove or Delete Special Characters or Symbols from the DataFrame Column Values.

The str.replace() function helps to remove any word or character from the DataFrame.


df1["price"].str.replace("$","Dollar")

#5. How to Change or Cast the Column Data Type in a Python DataFrame.

The astype() function provides the functionality to change the data type of an existing column in a DataFrame.


df1["price"].astype(float)

#6. How to Drop a Column in a DataFrame in Python

The drop() function provides functionality for deleting rows or columns.


df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [15, 6, 17, 8]
}, index=['a', 'b', 'c', 'd'])

# Drop row 'y' in place
df.drop(d, inplace = True)

# Drop rows with index 'b' and 'd'
df_dropped_rows = df.drop(['b', 'd'])

# Drop (Multiple) columns 'B' and 'C'
df_dropped_column = df.drop(['B','C'], axis=1)

# Drop (Multiple) rows 'b' and 'd'
df_dropped_multiple = df.drop(['b', 'd'], axis=0)

#7. How to save the DataFrame to a CSV file.

We can save the DataFrame to a CSV file using various methods, with the most popular being conversion through the pandas library.


your_dataframe.to_csv('output.csv') # default convertion

# Another way is to use manual parameters, providing more control over the conversion.
your_dataframe.to_csv(
    path_or_buf='new_titanic.csv',
    sep=',',
    na_rep='',
    float_format=None,
    columns=[],
    header=True,
    index=True,
    index_label=None,
    mode='w',
    encoding='utf-8',
    quoting=csv.QUOTE_MINIMAL,
    lineterminator=os.linesep,
    storage_options={},
    compression='infer',
    quotechar='',
    chunksize=None,
    date_format=None,
    doublequote=True,
    escapechar=None,
    decimal='.',
    errors='strict'
)

How to add a cloumn in dataframe in pytho

The Assignment '=' operator can help to assign a new column in a dataframe.


df["new_column_1"] = [6, 16, 26, 36, 46, 56]

df["new_column_2"] = df["price"]*10

how to split a column into two or multiple columns in pandas dataframe.

The str.split() function usefull to split the columns.


df3[["lat", "lon"]] = df3["lat-lon"].str.split(",", expand=True)

how to concat or append or merge two dataframes in python using Pandas.

The concat() function helps to merge the two different dataframes.


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list ) // by-default merged the dataframe horizontally.

how to merge two dataframes horizontally in python using Pandas.

The concat() function is capable to merge the dataframes horizontally with parameter axis=0.


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list, axis = 0 ) 

how to merge or append two dataframes vertically in python using Pandas

The concat() function is capable to merge the dataframes horizontally with parameter axis=1


import pandas as pd
df_list = ["df1", "df2"]
new_df = pd.concat( df_list, axis = 1 ) 

how to save or export dataframe as csv in python

The to_csv() can be use to export the data frame in csv file.


df.to_csv("Dataset_output.csv")

how to save or export dataframe as csv in python without indexes.

The index parameter can be use in the to_csv() function to remove the indexes from the dataframe.


df.to_csv("Dataset_output.csv" , index = False)

how to save or export dataframe as excel file in python.

The to_excel() Function can be use to export the data frame in excel file


df.to_excel("Dataset_output.xlsx" )

how to get all unique values from dataframe column

The unique() function can be apply with the respected column to get the list of unique values.


df["col"].unique()

how to get total number of unique value from the dataframe column

The nunique() function can be used to get the count of unique value in each column.


df["col"].nunique()

how to get top N most frequent values from the dataframe column using pandas in python

The values_counts() function can be used to count the occurance of the values in the dataframe.


arrange_by_rank = df["salary"].value_counts()
top_3_values = arrange_by_rank.head(3)

hows to calculate the min, max, Average-Mean, standard deviation, and quartiles of dataframe using pandas in Python.

The function "describe()" helps to calculate all the usefull statics numbers.


df.describe() # Calculate for entire dataframe
df[["area_m2", "price_usd"]].describe() # Calculate for Specific dataframe's columns.


how to group a dataframe based on different column and calculate its mean value in python Dataframe.

The mean() function along with groupby() function can be used to achive the goal.


df.groupby(["state"], sort=False).mean()

how to group a repetataive column values based on a another column in python Dataframe

The groupby() function can also useful, see below-


t = df[["state", "price_usd"]].groupby(["state"], sort=False).mean()

how to group multiple column in python Dataframe

The groupby() function can be used to group the multiple columns by name under the square bracket.


t = df[["state","area_m2", "price_usd"]].groupby(["state","area_m2"], sort=False).mean()

how to sort a dataframe based on a column in python

The sort_values() function is helpful to sort the columns values.


t.sort_values(by="price_usd",  ascending=False)

how to group the dataframe and return only specifics or particular columns

The groupby() function can be used to group the columns by name and provide the column name in square bracket to return the specific column.


df.groupby(["state"], sort=False)["price_usd"].mean().sort_values(ascending=False)

how to plot a dataframe using pandas library

The pandas's plot() function can also use to dispaly the plots.


df.plot(kind="bar",xlabel = "",ylabel= "",title= ""
);  # there should be two coloumns for batter visibility
chaining method

(

    df
    .groupby("state")
    ["price_per_m2"].mean()
    .sort_values(ascending=False)
    .plot(
        kind="bar",
        xlabel="State",
        ylabel="Mean Price per M^2 [USD]",
        title="Mean House Price per M^2 by State"

)

);

how to check if value exists in dataframe column

The str.contains() function helps to check the value avaiable in column or not.


# Returns True if the value is exists otherwise return False.

bool  = df["place_with_parent_names"].quantile([0.1, 0.9]).str.contains("Capital") 

print(df[bool]).head() #return rows only for True values.

how to calculate quartiles in python.

The QUANTILE() function provides an estimate of what range the data value actually lies in based on the given function parameter. For example, here we have added two parameters to the QUANTILE function. The first value '0.1' will return the exact number that lies at 0.1 or 10% of the data. Similarly, '0.9' will return the exact value that lies at 90% of the data.
Note: This function only works with numeric data, if you implement it with string data it will return the number of errors.


df["price_usd"].quantile([0.1, 0.9])  # Note: 0.1 equals 10% and 0.9 equals 90%

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment