Machine Learning

Evaluation of Regression Model
Evaluation of Classification Model
Telco Customer Churn Prediction
House Price Prediction Model
Talent Hunting Classification with Machine Learning (Scoutium)
Customer Segmentation with Unsupervised Learning (FLO)

Evaluation of Regression Model

Project Tasks

Task:

Employees' years of experience and salary information are given.

Years of Experience (x)	Salary (y)
5	600
7	900
3	550
3	500
2	400
7	950
3	540
10	1200
6	900
4	550
8	1100
1	460
1	400
9	1000
1	380

Step 1: Create the linear regression model equation according to the given bias and weight.

Bias: 275, Weight: 90 (y' = b+wx)

Step 2: Estimate the salary for all years of experience in the table according to the model equation you have created.
Step 3: Calculate MSE, RMSE, MAE scores to measure the success of the model.

Requirements

matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1

Evaluation of Classification Model

Project Tasks

Task 1:

A classification model has been created that predicts whether the customer is churn or not. The actual values of 10 test data observations and the probability values predicted by the model are given.

Create a confusion matrix by taking the threshold value to 0.5.
Calculate Accuracy, Recall, Precision, F1 Scores.

	Actual Value	Model Probability Estimation (Probability of belonging to class 1)
1	1	0.7
2	1	0.8
3	1	0.65
4	1	0.9
5	1	0.45
6	1	0.5
7	0	0.55
8	0	0.35
9	0	0.4
10	0	0.25

		Model Prediction
		Non-Churn (0)	Churn (1)
Actual Value	Non-Churn (0)	?	?	?
Actual Value	Churn (1)	?	?	?
		?	?

Task 2:

A classification model has been created in order to detect fraudulent transactions during transactions made through the bank. The success of the model with 90.5% accuracy rate was found to be sufficient and the model was taken live. However, after going live, the output of the model was not as expected, and the business unit reported that the model was unsuccessful. The confusion matrix of the prediction results of the model is given below. According to this;

Calculate Accuracy, Recall, Precision, F1 Scores.
Comment on what the 'Data Science' team may have overlooked.

		Model Prediction
		Non-Fraud (0)	Fraud (1)
Actual Value	Non-Fraud (0)	900	90	990
Actual Value	Fraud (1)	5	5	10
		905	95

True Negative (TN): 900
False Positive (FP): 90
True Positive (TP): 5
False Negative (FN): 5

Requirements

matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1

Telco Customer Churn Prediction

Business Problem
Develop a machine learning model that can predict customers who will leave the company.

Perform the necessary data analysis and feature engineering steps before developing the model.

Dataset Story
Telco churn data includes information about a fictitious telecom company that provided home phone and internet services to 7.043 California customers in the third quarter. It shows which customers have left, stayed or signed up for their service.

Variables

CustomerId: Customer ID
Gender: Gender
SeniorCitizen: Whether the client is older (1, 0)
Partner: Whether the client has a partner (Yes, No)? Married or not
Dependents: Whether the customer has dependents (Yes, No) (Child, mother, father, grandmother)
tenure: The number of months the customer has stayed with the company
PhoneService: Whether the customer has phone service (Yes, No)
MultipleLines: Whether the customer has more than one line (Yes, No, No Telephone service)
InternetService: Customer's internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security (Yes, No, no Internet service)
OnlineBackup: Whether the customer has an online backup (Yes, No, no Internet service)
DeviceProtection: Whether the customer has device protection (Yes, No, no Internet service)
TechSupport: Whether the customer has technical support (Yes, No, no Internet service)
StreamingTV: Whether the customer is broadcasting TV (Yes, No, no Internet service). Indicates whether the customer uses the Internet service to stream television programs from a third-party provider
StreamingMovies: Whether the customer is streaming movies (Yes, No, no Internet service). Indicates whether the customer is using Internet service to stream movies from a third-party provider
Contract: Customer's contract duration (Month to month, One year, Two years)
PaperlessBilling: Whether the customer has a paperless invoice (Yes, No)
PaymentMethod: Customer's payment method (Electronic check, Postal check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged from the customer
Churn: Whether the customer used (Yes or No). Customers who left in the last month or quarter

📝 Notes

Each row represents a unique customer.
Variables include information about customer service, account, and demographics.
- Services customers sign up for: Phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
- Customer account information: How long they have been a customer, contract, payment method, paperless billing, monthly fees, and total fees.
- Demographic information about customers: Gender, age range, and whether they have partners and dependents.

Requirements

catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
xgboost==1.7.5

House Price Prediction Model

Business Problem
It is desired to carry out a machine learning project regarding the prices of different types of houses, using the dataset containing the features and house prices of each house.

Dataset Story
There are 79 explanatory variables in this dataset of residential homes in Ames, Iowa. You can access the dataset and competition page of the project, which also has a competition on Kaggle, from the link below. Since the dataset belongs to a Kaggle competition, there are two different csv files: train and test. House prices are left blank in the test dataset, and you are expected to guess these values.

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation

Total Observations: 1.460
Numeric Variable: 38
Categorical Variable: 43

Variables

SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

Requirements

lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5

Talent Hunting Classification with Machine Learning using SCOUTIUM's Dataset

Business Problem
Predicting which class (average, highlighted) players are based on the points given to the characteristics of the football players watched by the Scouts.

Dataset Story
The dataset consists of information containing the characteristics and scores of the football players evaluated by the scouts according to the characteristics of the football players observed in the matches from Scoutium.

Variables

scoutium_attributes.csv

task_response_id: The set of a scout's evaluations of all players on a team's roster in a match
match_id: The id of the relevant match
evaluator_id: The id of the evaluator (scout)
player_id: The id of the relevant player
position_id: The id of the position played by the relevant player in that match
1. Goalkeeper
2. Stopper
3. Right-back
4. Left-back
5. Defensive midfielder
6. Central midfielder
7. Right wing
8. Left wing
9. Attacking midfielder
10. Striker
analysis_id: Set of attribute evaluations of a scout for a player in a match
attribute_id: The id of each attribute that the players were evaluated for
attribute_value: The value (points) a scout gives to a player's attribute

scoutium_potential_labels.csv

task_response_id: The set of a scout's evaluations of all players on a team's roster in a match
match_id: The id of the corresponding match
evaluator_id: The id of the evaluator (scout)
player_id: The id of the respective player
potential_label: A label that indicates a scout's final decision regarding a player in a match (target variable)

Requirements

catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5

Customer Segmentation with Unsupervised Learning using FLO's Dataset

Business Problem
FLO wants to divide its customers into segments and determine marketing strategies according to these segments. To this end, customers' behaviors will be defined and groups will be created based on clusters in these behaviors.

Dataset Story
The dataset consists of the information obtained from the past shopping behaviors of customers who made their last purchases from FLO as OmniChannel (both online and offline shopper) in 2020-2021.

Variables

master_id: Customer ID
order_channel: Shopping platform (Android, ios, Desktop, Mobile, Offline)
last_order_channel: The channel where the most recent purchase was made
first_order_date: Customer's first order date
last_order_date: Customer's last order date
last_order_date_online: Customer's last offline order date
last_order_date_offline: Customer's last online order date
order_num_total_ever_online: The total number of orders made by the customer online
order_num_total_ever_offline: The total number of orders made by the customer offline
customer_value_total_ever_offline: The total price paid by the customer for offline orders
customer_value_total_ever_online: The total price paid by the customer for online orders
interested_in_categories_12: List of categories the customer has shopped in the last 12 months

Requirements

matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
scipy==1.11.4
seaborn==0.12.1
sklearn==1.3.1
yellowbrick==1.5