Machine Learning Case Study - Binary Classification
Instructions - Business Use Case
You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance for the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.
You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.
Additional instructions for the case study are provided below.
- Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
- The prediction must be done based on various data attributes provided below.
- Use 'TrainData' Dataset below for building the model. The train data has 43957 records.
- Use 'TestData' Dataset below for testing your predictions. The test data has 898 records.
Data Attribute |
Description |
age |
continuous |
workclass |
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
fnlwgt |
continuous |
education |
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool |
education-num |
continuous |
marital-status |
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse |
occupation |
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces |
relationship |
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
race |
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black |
sex |
Female, Male |
capital-gain |
continuous |
capital-loss |
continuous |
hours-per-week |
continuous |
native-country |
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
income >50K |
binary (Target that needs to be predicted) |
Few Sample Data from Test.csv
age |
workclass |
fnlwgt |
education |
educational-num |
marital-status |
occupation |
relationship |
race |
gender |
capital-gain |
capital-loss |
hours-per-week |
native-country |
39 |
Self-emp-not-inc |
327120 |
HS-grad |
9 |
Married-civ-spouse |
Craft-repair |
Husband |
White |
Male |
0 |
0 |
40 |
Portugal |
32 |
Private |
123253 |
Assoc-acdm |
12 |
Married-civ-spouse |
Craft-repair |
Husband |
White |
Male |
0 |
0 |
42 |
United-States |
Few Sample Data from Train.csv
age |
workclass |
fnlwgt |
education |
educational-num |
marital-status |
occupation |
relationship |
race |
gender |
capital-gain |
capital-loss |
hours-per-week |
native-country |
income_>50K |
67 |
Private |
366425 |
Doctorate |
16 |
Divorced |
Exec-managerial |
Not-in-family |
White |
Male |
99999 |
0 |
60 |
United-States |
1 |
17 |
Private |
244602 |
12th |
8 |
Never-married |
Other-service |
Own-child |
White |
Male |
0 |
0 |
15 |
United-States |
0 |
31 |
Private |
174201 |
Bachelors |
13 |
Married-civ-spouse |
Exec-managerial |
Husband |
White |
Male |
0 |
0 |
40 |
United-States |
1 |
58 |
State-gov |
110199 |
7th-8th |
4 |
Married-civ-spouse |
Transport-moving |
Husband |
White |
Male |
0 |
0 |
40 |
United-States |
0 |
Final Instruction:
- You can use the train data to build and train your model and perform your prediction using the test data.
- Once you have the predictions ready, paste them in the below format into the IDE.
id, outcome
0, 1
1,0
2,1
3, 1
4, 0
.
.
.
.
900 1
Solution: Machine Learning Case Study - Binary Classification
Step 1: Import the necessary Packages and the dataset.
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load the data
test_data= pd.read_csv("/content/sample_data/test.csv")
train_data= pd.read_csv("/content/sample_data/train.csv")
Step 2: Build the Classification ML Model.
train_data.fillna({"workclass": "Unknown", "occupation": "Unknown", "native-country": "Unknown"}, inplace=True)
# Drop unnecessary columns if needed (e.g., 'fnlwgt' can often be ignored)
train_data.drop(columns=['fnlwgt'], inplace=True)
test_data.drop(columns=['fnlwgt'], inplace=True)
# Convert categorical variables to numerical using Label Encoding
categorical_features = train_data.select_dtypes(include=['object']).columns.tolist()
label_encoders = {}
for col in categorical_features:
label_encoders[col] = LabelEncoder()
train_data[col] = label_encoders[col].fit_transform(train_data[col])
if col in test_data.columns:
test_data[col] = label_encoders[col].transform(test_data[col])
# Define input (X) and target (y)
X = train_data.drop(columns=["income_>50K"])
y = train_data["income_>50K"]
# Split the training data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(test_data)
# Train a Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate model performance
y_pred = model.predict(X_valid_scaled)
print("Validation Accuracy:", accuracy_score(y_valid, y_pred))
# Make predictions on test data
test_predictions = model.predict(X_test_scaled)
# Format output for submission
submission = pd.DataFrame({"id": range(len(test_predictions)), "outcome": test_predictions})
submission.to_csv("answer.csv", index=False)
print("Predictions saved to submission.csv")
# Now just copy all the 880+ lines into ide IDE and run the Final Test case.
Step 3: Copy all the row from the answer.csv file and past into the IDE and run the Final Test Case.
Wings - Machine First Al - Binary Classification