Machine Learning
Evaluation of Regression Model
GitHub Repository:
Project Tasks
Task:
Employees' years of experience and salary information are given.
Years of Experience (x) |
Salary (y) |
5 |
600 |
7 |
900 |
3 |
550 |
3 |
500 |
2 |
400 |
7 |
950 |
3 |
540 |
10 |
1200 |
6 |
900 |
4 |
550 |
8 |
1100 |
1 |
460 |
1 |
400 |
9 |
1000 |
1 |
380 |
Step 1: Create the linear regression model equation according to the given bias and
weight.
- Bias: 275, Weight: 90 (y' =
b+wx)
Step 2: Estimate the salary for all years of experience in the table according to the
model equation you have created.
Step 3: Calculate MSE, RMSE, MAE
scores to measure the success of the model.
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
Evaluation of Classification Model
GitHub Repository:
Project Tasks
Task 1:
A classification model has been created that predicts whether the customer is churn or not. The
actual values of 10 test data observations and the probability values predicted by the model are
given.
- Create a confusion matrix by taking the threshold value to 0.5.
- Calculate Accuracy, Recall, Precision,
F1 Scores.
|
Actual Value |
Model Probability Estimation (Probability of belonging to
class 1) |
1 |
1 |
0.7 |
2 |
1 |
0.8 |
3 |
1 |
0.65 |
4 |
1 |
0.9 |
5 |
1 |
0.45 |
6 |
1 |
0.5 |
7 |
0 |
0.55 |
8 |
0 |
0.35 |
9 |
0 |
0.4 |
10 |
0 |
0.25 |
|
Model Prediction |
|
|
Non-Churn (0) |
Churn (1) |
|
Actual Value
|
Non-Churn (0) |
? |
? |
? |
Churn (1) |
? |
? |
? |
|
|
? |
? |
|
Task 2:
A classification model has been created in order to detect fraudulent transactions during
transactions made through the bank. The success of the model with 90.5% accuracy
rate was found to be sufficient and the model was taken live. However, after going
live, the output of the model was not as expected, and the business unit reported that the model was
unsuccessful. The confusion matrix of the prediction results of
the model is given below. According to this;
- Calculate Accuracy, Recall, Precision,
F1 Scores.
- Comment on what the 'Data Science' team may have overlooked.
|
Model Prediction |
|
|
Non-Fraud (0) |
Fraud (1) |
|
Actual Value
|
Non-Fraud (0) |
900 |
90 |
990 |
Fraud (1) |
5 |
5 |
10 |
|
|
905 |
95 |
|
- True Negative (TN): 900
- False Positive (FP): 90
- True Positive (TP): 5
- False Negative (FN): 5
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
Telco Customer Churn Prediction
GitHub Repository:
Business Problem
Develop a machine learning model that can predict customers who will leave the company.
Perform the necessary data analysis and feature engineering steps before developing the model.
Dataset Story
Telco churn data includes information about a fictitious telecom company that provided home
phone and internet services to 7.043 California customers in the third quarter. It shows which
customers
have left, stayed or signed up for their service.
Variables
- CustomerId: Customer ID
- Gender: Gender
- SeniorCitizen: Whether the client is older (1, 0)
- Partner: Whether the client has a partner (Yes, No)? Married or not
- Dependents: Whether the customer has dependents (Yes, No) (Child, mother, father,
grandmother)
- tenure: The number of months the customer has stayed with the company
- PhoneService: Whether the customer has phone service (Yes, No)
- MultipleLines: Whether the customer has more than one line (Yes, No, No Telephone
service)
- InternetService: Customer's internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security (Yes, No, no Internet
service)
- OnlineBackup: Whether the customer has an online backup (Yes, No, no Internet
service)
- DeviceProtection: Whether the customer has device protection (Yes, No, no Internet
service)
- TechSupport: Whether the customer has technical support (Yes, No, no Internet
service)
- StreamingTV: Whether the customer is broadcasting TV (Yes, No, no Internet
service). Indicates whether the customer uses the Internet service to stream
television
programs from a third-party provider
- StreamingMovies: Whether the customer is streaming movies (Yes, No, no Internet
service). Indicates whether the customer is using Internet service to stream movies
from a
third-party provider
- Contract: Customer's contract duration (Month to month, One year, Two years)
- PaperlessBilling: Whether the customer has a paperless invoice (Yes, No)
- PaymentMethod: Customer's payment method (Electronic check, Postal check, Bank
transfer (automatic), Credit card (automatic))
- MonthlyCharges: The amount charged to the customer monthly
- TotalCharges: The total amount charged from the customer
- Churn: Whether the customer used (Yes or No). Customers who left in the last
month or quarter
📝 Notes
- Each row represents a unique customer.
- Variables include information about customer service, account, and demographics.
- Services customers sign up for: Phone, multiple lines, internet, online
security, online backup, device protection, tech support, and streaming TV and
movies.
- Customer account information: How long they have been a customer, contract,
payment method, paperless billing, monthly fees, and total fees.
- Demographic information about customers: Gender, age range, and whether they
have partners and dependents.
Requirements
catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
xgboost==1.7.5
House Price Prediction Model
GitHub Repository:
Business Problem
It is desired to carry out a machine learning project regarding the prices of different types of
houses, using the dataset containing the features and house prices of each house.
Dataset Story
There are 79 explanatory variables in this dataset of residential homes in Ames, Iowa. You can
access the dataset and competition page of the project, which also has a competition on
Kaggle, from the link below. Since the dataset belongs to a
Kaggle competition, there are two different csv files:
train and
test. House prices are left blank in the test dataset, and you
are expected to guess these values.
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation
- Total Observations: 1.460
- Numeric Variable: 38
- Categorical Variable: 43
Variables
- SalePrice: The property's sale price in dollars. This is the target variable that you're trying
to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
Requirements
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5
Talent Hunting Classification with Machine Learning using SCOUTIUM's Dataset
GitHub Repository:
Business Problem
Predicting which class (average, highlighted) players are based on the points given to the
characteristics of the football players watched by the Scouts.
Dataset Story
The dataset consists of information containing the characteristics and scores of the football
players evaluated by the scouts according to the characteristics of the football players observed in
the matches from Scoutium.
Variables
scoutium_attributes.csv
- task_response_id: The set of a scout's evaluations of all players on a team's
roster in a match
- match_id: The id of the relevant match
- evaluator_id: The id of the evaluator (scout)
- player_id: The id of the relevant player
- position_id: The id of the position played by the relevant player in that
match
1. Goalkeeper
2. Stopper
3. Right-back
4. Left-back
5. Defensive midfielder
6. Central midfielder
7. Right wing
8. Left wing
9. Attacking midfielder
10. Striker
- analysis_id: Set of attribute evaluations of a scout for a player in a match
- attribute_id: The id of each attribute that the players were evaluated for
- attribute_value: The value (points) a scout gives to a player's
attribute
scoutium_potential_labels.csv
- task_response_id: The set of a scout's evaluations of all players on a team's
roster in a match
- match_id: The id of the corresponding match
- evaluator_id: The id of the evaluator (scout)
- player_id: The id of the respective player
- potential_label: A label that indicates a scout's final decision regarding a
player in a match (target variable)
Requirements
catboost==1.2
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
seaborn==0.12.1
sklearn==1.3.1
tabulate==0.9.0
xgboost==1.7.5
Customer Segmentation with Unsupervised Learning using FLO's
Dataset
GitHub Repository:
Business Problem
FLO wants to divide its customers into segments and determine marketing strategies
according to these segments. To this end, customers' behaviors will be defined and groups will be
created based on clusters in these behaviors.
Dataset Story
The dataset consists of the information obtained from the past shopping behaviors of customers who
made their last purchases from FLO as OmniChannel (both online and offline
shopper) in 2020-2021.
Variables
- master_id: Customer ID
- order_channel: Shopping platform (Android, ios, Desktop, Mobile,
Offline)
- last_order_channel: The channel where the most recent purchase was made
- first_order_date: Customer's first order date
- last_order_date: Customer's last order date
- last_order_date_online: Customer's last offline order date
- last_order_date_offline: Customer's last online order date
- order_num_total_ever_online: The total number of orders made by the customer
online
- order_num_total_ever_offline: The total number of orders made by the customer
offline
- customer_value_total_ever_offline: The total price paid by the customer for
offline orders
- customer_value_total_ever_online: The total price paid by the customer for
online orders
- interested_in_categories_12: List of categories the customer has shopped in the
last 12 months
Requirements
matplotlib==3.7.1
numpy==1.24.3
pandas==1.5.1
scipy==1.11.4
seaborn==0.12.1
sklearn==1.3.1
yellowbrick==1.5