Evaluation of Regression Model
GitHub Repository:
Project Tasks
Task
Employees' years of experience and salary information are given.
Years of Experience (x) |
Salary (y) |
5 |
600 |
7 |
900 |
3 |
550 |
3 |
500 |
2 |
400 |
7 |
950 |
3 |
540 |
10 |
1200 |
6 |
900 |
4 |
550 |
8 |
1100 |
1 |
460 |
1 |
400 |
9 |
1000 |
1 |
380 |
Step 1: Create the linear regression model equation according
to the given bias and weight.
-
Bias: 275, Weight: 90
(y' = b+wx)
Step 2: Estimate the salary for all years of experience in
the table according to the model equation you have created.
Step 3: Calculate MSE, RMSE, MAE
scores to measure the success of the model.
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
Evaluation of Classification Model
GitHub Repository:
Project Tasks
Task 1:
A classification model has been created that predicts whether the
customer is churn or not. The actual values of 10 test data
observations and the probability values predicted by the model are
given.
-
Create a confusion matrix by taking the threshold value to
0.5.
-
Calculate Accuracy, Recall, Precision,
F1 Scores.
|
Actual Value |
Model Probability Estimation (Probability of belonging to class 1)
|
1 |
1 |
0.7 |
2 |
1 |
0.8 |
3 |
1 |
0.65 |
4 |
1 |
0.9 |
5 |
1 |
0.45 |
6 |
1 |
0.5 |
7 |
0 |
0.55 |
8 |
0 |
0.35 |
9 |
0 |
0.4 |
10 |
0 |
0.25 |
|
Model Prediction |
|
|
Non-Churn (0) |
Churn (1) |
|
Actual Value
|
Non-Churn (0) |
? |
? |
? |
Churn (1) |
? |
? |
? |
|
|
? |
? |
|
Task 2:
A classification model has been created in order to detect
fraudulent transactions during transactions made through the bank.
The success of the model with 90.5% accuracy rate was found
to be sufficient and the model was taken live. However, after going
live, the output of the model was not as expected, and the business
unit reported that the model was unsuccessful. The
confusion matrix of the prediction results of the
model is given below. According to this;
-
Calculate Accuracy, Recall, Precision,
F1 Scores.
-
Comment on what the 'Data Science' team may have
overlooked.
|
Model Prediction |
|
|
Non-Churn (0) |
Churn (1) |
|
Actual Value
|
Non-Churn (0) |
900 |
90 |
990 |
Churn (1) |
5 |
5 |
10 |
|
|
905 |
95 |
|
- True Negative (TN): 900
- False Positive (FP): 90
- True Positive (TP): 5
- False Negative (FN): 5
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
Telco Customer Churn Prediction
GitHub Repository:
Business Problem
Develop a machine learning model that can predict customers who will
leave the company.
Perform the necessary data analysis and feature engineering steps
before developing the model.
Dataset Story
Telco churn data includes information about a fictitious
telecom company that provided home phone and internet services to
7.043 California customers in the third quarter. It shows which
customers have left, stayed or signed up for their service.
Variables
- CustomerId: Customer ID
- Gender: Gender
-
SeniorCitizen: Whether the client is older (1, 0)
-
Partner: Whether the client has a partner (Yes, No)?
Married or not
-
Dependents: Whether the customer has dependents
(Yes, No) (Child, mother, father, grandmother)
-
tenure: The number of months the customer has stayed with
the company
-
PhoneService: Whether the customer has phone service
(Yes, No)
-
MultipleLines: Whether the customer has more than one line
(Yes, No, No Telephone service)
-
InternetService: Customer's internet service provider
(DSL, Fiber optic, No)
-
OnlineSecurity: Whether the customer has online security
(Yes, No, no Internet service)
-
OnlineBackup: Whether the customer has an online backup
(Yes, No, no Internet service)
-
DeviceProtection: Whether the customer has device
protection (Yes, No, no Internet service)
-
TechSupport: Whether the customer has technical support
(Yes, No, no Internet service)
-
StreamingTV: Whether the customer is broadcasting TV
(Yes, No, no Internet service). Indicates whether the
customer uses the Internet service to stream television programs
from a third-party provider
-
StreamingMovies: Whether the customer is streaming movies
(Yes, No, no Internet service). Indicates whether the
customer is using Internet service to stream movies from a
third-party provider
-
Contract: Customer's contract duration
(Month to month, One year, Two years)
-
PaperlessBilling: Whether the customer has a paperless
invoice (Yes, No)
-
PaymentMethod: Customer's payment method
(Electronic check, Postal check, Bank transfer (automatic),
Credit card (automatic))
-
MonthlyCharges: The amount charged to the customer monthly
-
TotalCharges: The total amount charged from the customer
-
Churn: Whether the customer used (Yes or No).
Customers who left in the last month or quarter
📝 Notes
- Each row represents a unique customer.
-
Variables include information about customer service, account, and
demographics.
-
Services customers sign up for: Phone, multiple lines,
internet, online security, online backup, device protection,
tech support, and streaming TV and movies.
-
Customer account information: How long they have been a
customer, contract, payment method, paperless billing, monthly
fees, and total fees.
-
Demographic information about customers: Gender, age
range, and whether they have partners and dependents.
Requirements
catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
xgboost==1.7.5
House Price Prediction Model
GitHub Repository:
Business Problem
It is desired to carry out a machine learning project regarding the
prices of different types of houses, using the dataset containing
the features and house prices of each house.
Dataset Story
There are 79 explanatory variables in this dataset of residential
homes in Ames, Iowa. You can access the dataset and competition page
of the project, which also has a competition on Kaggle, from
the link below. Since the dataset belongs to a
Kaggle competition, there are two different
csv files: train and
test. House prices are left blank in the test dataset,
and you are expected to guess these values.
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation
- Total Observations: 1.460
- Numeric Variable: 38
- Categorical Variable: 43
Variables
-
SalePrice: The property's sale price in dollars. This is
the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
-
LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
-
Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
-
Condition2: Proximity to main road or railroad
(if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
-
Exterior2nd: Exterior covering on house
(if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
-
ExterCond: Present condition of the material on the
exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
-
BsmtFinType2: Quality of second finished area
(if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First floor square feet
- 2ndFlrSF: Second floor square feet
-
LowQualFinSF: Low quality finished square feet
(all floors)
-
GrLivArea: Above grade (ground) living area square
feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
-
TotRmsAbvGrd: Total rooms above grade
(does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
-
MiscFeature: Miscellaneous feature not covered in other
categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
Requirements
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5
Talent Hunting Classification with Machine Learning using
SCOUTIUM's Dataset
GitHub Repository:
Business Problem
Predicting which class (average, highlighted) players are
based on the points given to the characteristics of the football
players watched by the Scouts.
Dataset Story
The dataset consists of information containing the characteristics
and scores of the football players evaluated by the scouts according
to the characteristics of the football players observed in the
matches from Scoutium.
Variables
scoutium_attributes.csv
-
task_response_id: The set of a scout's evaluations of all
players on a team's roster in a match
- match_id: The id of the relevant match
- evaluator_id: The id of the evaluator (scout)
- player_id: The id of the relevant player
-
position_id: The id of the position played by the relevant
player in that match
1. Goalkeeper
2. Stopper
3. Right-back
4. Left-back
5. Defensive midfielder
6. Central midfielder
7. Right wing
8. Left wing
9. Attacking midfielder
10. Striker
-
analysis_id: Set of attribute evaluations of a scout for a
player in a match
-
attribute_id: The id of each attribute that the players
were evaluated for
-
attribute_value: The value (points) a scout gives to
a player's attribute
scoutium_potential_labels.csv
-
task_response_id: The set of a scout's evaluations of all
players on a team's roster in a match
- match_id: The id of the corresponding match
- evaluator_id: The id of the evaluator (scout)
- player_id: The id of the respective player
-
potential_label: A label that indicates a scout's final
decision regarding a player in a match (target variable)
Requirements
catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5
Customer Segmentation with Unsupervised Learning using
FLO's Dataset
GitHub Repository:
Business Problem
FLO wants to divide its customers into segments and determine
marketing strategies according to these segments. To this end,
customers' behaviors will be defined and groups will be created
based on clusters in these behaviors.
Dataset Story
The dataset consists of the information obtained from the past
shopping behaviors of customers who made their last purchases from
FLO as OmniChannel
(both online and offline shopper) in 2020-2021.
Variables
- master_id: Customer ID
-
order_channel: Shopping platform
(Android, ios, Desktop, Mobile, Offline)
-
last_order_channel: The channel where the most recent
purchase was made
- first_order_date: Customer's first order date
- last_order_date: Customer's last order date
-
last_order_date_online: Customer's last offline order date
-
last_order_date_offline: Customer's last online order date
-
order_num_total_ever_online: The total number of orders
made by the customer online
-
order_num_total_ever_offline: The total number of orders
made by the customer offline
-
customer_value_total_ever_offline: The total price paid by
the customer for offline orders
-
customer_value_total_ever_online: The total price paid by
the customer for online orders
-
interested_in_categories_12: List of categories the
customer has shopped in the last 12 months
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
scipy==1.11.4
seaborn==0.12.1
sklearn==1.3.1
yellowbrick==1.5