Machine Learning


Evaluation of Regression Model

GitHub Repository:


Project Tasks

Task

Employees' years of experience and salary information are given.

Years of Experience (x) Salary (y)
5 600
7 900
3 550
3 500
2 400
7 950
3 540
10 1200
6 900
4 550
8 1100
1 460
1 400
9 1000
1 380

Step 1: Create the linear regression model equation according to the given bias and weight.

  • Bias: 275, Weight: 90 (y' = b+wx)

Step 2: Estimate the salary for all years of experience in the table according to the model equation you have created.
Step 3: Calculate MSE, RMSE, MAE scores to measure the success of the model.

Requirements
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • seaborn==0.12.1
  • sklearn==1.3.1

Evaluation of Classification Model

GitHub Repository:


Project Tasks

Task 1:

A classification model has been created that predicts whether the customer is churn or not. The actual values ​​of 10 test data observations and the probability values ​​predicted by the model are given.

  • Create a confusion matrix by taking the threshold value to 0.5.
  • Calculate Accuracy, Recall, Precision, F1 Scores.
  Actual Value Model Probability Estimation
(Probability of belonging to class 1)
1 1 0.7
2 1 0.8
3 1 0.65
4 1 0.9
5 1 0.45
6 1 0.5
7 0 0.55
8 0 0.35
9 0 0.4
10 0 0.25
  Model Prediction
    Non-Churn (0) Churn (1)  
Actual Value Non-Churn (0) ? ? ?
Churn (1) ? ? ?
    ? ?  

Task 2:

A classification model has been created in order to detect fraudulent transactions during transactions made through the bank. The success of the model with 90.5% accuracy rate was found to be sufficient and the model was taken live. However, after going live, the output of the model was not as expected, and the business unit reported that the model was unsuccessful. The confusion matrix of the prediction results of the model is given below. According to this;

  • Calculate Accuracy, Recall, Precision, F1 Scores.
  • Comment on what the 'Data Science' team may have overlooked.
  Model Prediction
    Non-Churn (0) Churn (1)  
Actual Value Non-Churn (0) 900 90 990
Churn (1) 5 5 10
    905 95  
  • True Negative (TN): 900
  • False Positive (FP): 90
  • True Positive (TP): 5
  • False Negative (FN): 5
Requirements
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • seaborn==0.12.1
  • sklearn==1.3.1

Telco Customer Churn Prediction

GitHub Repository:


Business Problem

Develop a machine learning model that can predict customers who will leave the company.

Perform the necessary data analysis and feature engineering steps before developing the model.

Dataset Story

Telco churn data includes information about a fictitious telecom company that provided home phone and internet services to 7.043 California customers in the third quarter. It shows which customers have left, stayed or signed up for their service.

Variables
  • CustomerId: Customer ID
  • Gender: Gender
  • SeniorCitizen: Whether the client is older (1, 0)
  • Partner: Whether the client has a partner (Yes, No)? Married or not
  • Dependents: Whether the customer has dependents (Yes, No) (Child, mother, father, grandmother)
  • tenure: The number of months the customer has stayed with the company
  • PhoneService: Whether the customer has phone service (Yes, No)
  • MultipleLines: Whether the customer has more than one line (Yes, No, No Telephone service)
  • InternetService: Customer's internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity: Whether the customer has online security (Yes, No, no Internet service)
  • OnlineBackup: Whether the customer has an online backup (Yes, No, no Internet service)
  • DeviceProtection: Whether the customer has device protection (Yes, No, no Internet service)
  • TechSupport: Whether the customer has technical support (Yes, No, no Internet service)
  • StreamingTV: Whether the customer is broadcasting TV (Yes, No, no Internet service). Indicates whether the customer uses the Internet service to stream television programs from a third-party provider
  • StreamingMovies: Whether the customer is streaming movies (Yes, No, no Internet service). Indicates whether the customer is using Internet service to stream movies from a third-party provider
  • Contract: Customer's contract duration (Month to month, One year, Two years)
  • PaperlessBilling: Whether the customer has a paperless invoice (Yes, No)
  • PaymentMethod: Customer's payment method (Electronic check, Postal check, Bank transfer (automatic), Credit card (automatic))
  • MonthlyCharges: The amount charged to the customer monthly
  • TotalCharges: The total amount charged from the customer
  • Churn: Whether the customer used (Yes or No). Customers who left in the last month or quarter

📝 Notes

  • Each row represents a unique customer.
  • Variables include information about customer service, account, and demographics.
    • Services customers sign up for: Phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
    • Customer account information: How long they have been a customer, contract, payment method, paperless billing, monthly fees, and total fees.
    • Demographic information about customers: Gender, age range, and whether they have partners and dependents.
Requirements
  • catboost==1.2
  • lightgbm==3.3.5
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • seaborn==0.12.1
  • sklearn==1.3.1
  • xgboost==1.7.5

House Price Prediction Model

GitHub Repository:


Business Problem

It is desired to carry out a machine learning project regarding the prices of different types of houses, using the dataset containing the features and house prices of each house.

Dataset Story

There are 79 explanatory variables in this dataset of residential homes in Ames, Iowa. You can access the dataset and competition page of the project, which also has a competition on Kaggle, from the link below. Since the dataset belongs to a Kaggle competition, there are two different csv files: train and test. House prices are left blank in the test dataset, and you are expected to guess these values.

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation

  • Total Observations: 1.460
  • Numeric Variable: 38
  • Categorical Variable: 43
Variables
  • SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale
Requirements
  • lightgbm==3.3.5
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • seaborn==0.12.1
  • sklearn==1.3.1
  • tabulate==0.9.0
  • xgboost==1.7.5

Talent Hunting Classification with Machine Learning using SCOUTIUM's Dataset

GitHub Repository:


Business Problem

Predicting which class (average, highlighted) players are based on the points given to the characteristics of the football players watched by the Scouts.

Dataset Story

The dataset consists of information containing the characteristics and scores of the football players evaluated by the scouts according to the characteristics of the football players observed in the matches from Scoutium.

Variables
scoutium_attributes.csv
  • task_response_id: The set of a scout's evaluations of all players on a team's roster in a match
  • match_id: The id of the relevant match
  • evaluator_id: The id of the evaluator (scout)
  • player_id: The id of the relevant player
  • position_id: The id of the position played by the relevant player in that match
    1. Goalkeeper
    2. Stopper
    3. Right-back
    4. Left-back
    5. Defensive midfielder
    6. Central midfielder
    7. Right wing
    8. Left wing
    9. Attacking midfielder
    10. Striker
  • analysis_id: Set of attribute evaluations of a scout for a player in a match
  • attribute_id: The id of each attribute that the players were evaluated for
  • attribute_value: The value (points) a scout gives to a player's attribute
scoutium_potential_labels.csv
  • task_response_id: The set of a scout's evaluations of all players on a team's roster in a match
  • match_id: The id of the corresponding match
  • evaluator_id: The id of the evaluator (scout)
  • player_id: The id of the respective player
  • potential_label: A label that indicates a scout's final decision regarding a player in a match (target variable)
Requirements
  • catboost==1.2
  • lightgbm==3.3.5
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • seaborn==0.12.1
  • sklearn==1.3.1
  • tabulate==0.9.0
  • xgboost==1.7.5

Customer Segmentation with Unsupervised Learning using FLO's Dataset

GitHub Repository:


Business Problem

FLO wants to divide its customers into segments and determine marketing strategies according to these segments. To this end, customers' behaviors will be defined and groups will be created based on clusters in these behaviors.

Dataset Story

The dataset consists of the information obtained from the past shopping behaviors of customers who made their last purchases from FLO as OmniChannel (both online and offline shopper) in 2020-2021.

Variables
  • master_id: Customer ID
  • order_channel: Shopping platform (Android, ios, Desktop, Mobile, Offline)
  • last_order_channel: The channel where the most recent purchase was made
  • first_order_date: Customer's first order date
  • last_order_date: Customer's last order date
  • last_order_date_online: Customer's last offline order date
  • last_order_date_offline: Customer's last online order date
  • order_num_total_ever_online: The total number of orders made by the customer online
  • order_num_total_ever_offline: The total number of orders made by the customer offline
  • customer_value_total_ever_offline: The total price paid by the customer for offline orders
  • customer_value_total_ever_online: The total price paid by the customer for online orders
  • interested_in_categories_12: List of categories the customer has shopped in the last 12 months
Requirements
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==1.5.1
  • scipy==1.11.4
  • seaborn==0.12.1
  • sklearn==1.3.1
  • yellowbrick==1.5