Mini-Project for Wings - Data Classification | Turing Machine Data Scientist Program: Use-case 5 Fresco Play Handson Solution

Machine Learning Case Study - Binary Classification

Instructions - Business Use Case

You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance for the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.
You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.

Additional instructions for the case study are provided below.

Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
The prediction must be done based on various data attributes provided below.
Use 'TrainData' Dataset below for building the model. The train data has 43957 records.
Use 'TestData' Dataset below for testing your predictions. The test data has 898 records.

Data Attribute	Description
age	continuous
workclass	Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt	continuous
education	Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
education-num	continuous
marital-status	Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation	Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship	Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race	White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex	Female, Male
capital-gain	continuous
capital-loss	continuous
hours-per-week	continuous
native-country	United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
income >50K	binary (Target that needs to be predicted)

Few Sample Data from Test.csv

age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	capital-loss	hours-per-week	native-country
39	Self-emp-not-inc	327120	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	Portugal
32	Private	123253	Assoc-acdm	12	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	42	United-States

Few Sample Data from Train.csv

age	workclass	fnlwgt	education	educational-num	marital-status	occupation	relationship	race	gender	capital-gain	hours-per-week	native-country	income_>50K
67	Private	366425	Doctorate	16	Divorced	Exec-managerial	Not-in-family	White	Male	99999	60	United-States	1
17	Private	244602	12th	8	Never-married	Other-service	Own-child	White	Male	0	15	United-States	0
31	Private	174201	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	40	United-States	1
58	State-gov	110199	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	White	Male	0	40	United-States	0

Final Instruction:

You can use the train data to build and train your model and perform your prediction using the test data.
Once you have the predictions ready, paste them in the below format into the IDE.
id, outcome
0, 1
1,0
2,1
3, 1
4, 0
.
.
.
.
900 1

Solution: Machine Learning Case Study - Binary Classification

Step 1: Import the necessary Packages and the dataset.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the data
test_data= pd.read_csv("/content/sample_data/test.csv")
train_data= pd.read_csv("/content/sample_data/train.csv")

Step 2: Build the Classification ML Model.

train_data.fillna({"workclass": "Unknown", "occupation": "Unknown", "native-country": "Unknown"}, inplace=True)


# Drop unnecessary columns if needed (e.g., 'fnlwgt' can often be ignored)
train_data.drop(columns=['fnlwgt'], inplace=True)
test_data.drop(columns=['fnlwgt'], inplace=True)

# Convert categorical variables to numerical using Label Encoding
categorical_features = train_data.select_dtypes(include=['object']).columns.tolist()

label_encoders = {}
for col in categorical_features:
    label_encoders[col] = LabelEncoder()
    train_data[col] = label_encoders[col].fit_transform(train_data[col])
    if col in test_data.columns:
        test_data[col] = label_encoders[col].transform(test_data[col])

# Define input (X) and target (y)
X = train_data.drop(columns=["income_>50K"])
y = train_data["income_>50K"]

# Split the training data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(test_data)

# Train a Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate model performance
y_pred = model.predict(X_valid_scaled)
print("Validation Accuracy:", accuracy_score(y_valid, y_pred))

# Make predictions on test data
test_predictions = model.predict(X_test_scaled)

# Format output for submission
submission = pd.DataFrame({"id": range(len(test_predictions)), "outcome": test_predictions})
submission.to_csv("answer.csv", index=False)

print("Predictions saved to submission.csv")

# Now just copy all the 880+ lines into ide IDE and run the Final Test case.

PDFcup.com