Mini-Project for Wings - Data Classification | Turing Machine Data Scientist Program: Use-case 5 Fresco Play Handson Solution

Machine Learning Case Study - Binary Classification

Instructions - Business Use Case

You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance for the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model.
You can build your model in your own hardware / pc / laptop and just upload the prediction as shown in the below format.

Additional instructions for the case study are provided below.

  • Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
  • The prediction must be done based on various data attributes provided below.
  • Use 'TrainData' Dataset below for building the model. The train data has 43957 records.
  • Use 'TestData' Dataset below for testing your predictions. The test data has 898 records.
Data Attribute Description
age continuous
workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt continuous
education Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
education-num continuous
marital-status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex Female, Male
capital-gain continuous
capital-loss continuous
hours-per-week continuous
native-country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
income >50K binary (Target that needs to be predicted)

Few Sample Data from Test.csv

age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
39 Self-emp-not-inc 327120 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 Portugal
32 Private 123253 Assoc-acdm 12 Married-civ-spouse Craft-repair Husband White Male 0 0 42 United-States

Few Sample Data from Train.csv

age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income_>50K
67 Private 366425 Doctorate 16 Divorced Exec-managerial Not-in-family White Male 99999 0 60 United-States 1
17 Private 244602 12th 8 Never-married Other-service Own-child White Male 0 0 15 United-States 0
31 Private 174201 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States 1
58 State-gov 110199 7th-8th 4 Married-civ-spouse Transport-moving Husband White Male 0 0 40 United-States 0

Final Instruction:

  1. You can use the train data to build and train your model and perform your prediction using the test data.
  2. Once you have the predictions ready, paste them in the below format into the IDE.
    id, outcome
    0, 1
    1,0
    2,1
    3, 1
    4, 0
    .
    .
    .
    .
    900 1

Solution: Machine Learning Case Study - Binary Classification

Step 1: Import the necessary Packages and the dataset.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the data
test_data= pd.read_csv("/content/sample_data/test.csv")
train_data= pd.read_csv("/content/sample_data/train.csv")

Step 2: Build the Classification ML Model.

train_data.fillna({"workclass": "Unknown", "occupation": "Unknown", "native-country": "Unknown"}, inplace=True)


# Drop unnecessary columns if needed (e.g., 'fnlwgt' can often be ignored)
train_data.drop(columns=['fnlwgt'], inplace=True)
test_data.drop(columns=['fnlwgt'], inplace=True)

# Convert categorical variables to numerical using Label Encoding
categorical_features = train_data.select_dtypes(include=['object']).columns.tolist()

label_encoders = {}
for col in categorical_features:
    label_encoders[col] = LabelEncoder()
    train_data[col] = label_encoders[col].fit_transform(train_data[col])
    if col in test_data.columns:
        test_data[col] = label_encoders[col].transform(test_data[col])

# Define input (X) and target (y)
X = train_data.drop(columns=["income_>50K"])
y = train_data["income_>50K"]

# Split the training data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(test_data)

# Train a Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate model performance
y_pred = model.predict(X_valid_scaled)
print("Validation Accuracy:", accuracy_score(y_valid, y_pred))

# Make predictions on test data
test_predictions = model.predict(X_test_scaled)

# Format output for submission
submission = pd.DataFrame({"id": range(len(test_predictions)), "outcome": test_predictions})
submission.to_csv("answer.csv", index=False)

print("Predictions saved to submission.csv")

# Now just copy all the 880+ lines into ide IDE and run the Final Test case.

Step 3: Copy all the row from the answer.csv file and past into the IDE and run the Final Test Case.

Wings - Machine First Al - Binary Classification

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment