Machine Learning A-Z: Part 2 – Regression (Multiple Linear Regression)
Dummy Variable Trap
Dummy variables must be:
D2 = 1 – D1
You cannot have more than one pair of dummy variables at the same time.
Building a model
1. All-in
=> 2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
Akaike information criterion (AIC) 赤池情報量規準
– a measure of the relative quality of statistical models for a given set of data.
– Given a collection of models for the data, estimates the quality of each model, relative to each of the other models.
– Hence, provides a means for model selection.
Multiple Linear Regression
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# Multiple Linear Regression # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1].values # independent variables y = dataset.iloc[:, 4].values # dependent variable # Encoding categorical data # Encoding the Independent Variable from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 3] = labelencoder_X.fit_transform(X[:, 3]) onehotencoder = OneHotEncoder(categorical_features = [3]) X = onehotencoder.fit_transform(X).toarray() # Avoiding the Dummy Variable Trap X = X[:, 1:] # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Feature Scaling """from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train)""" # Fitting Multiple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting the Test set results y_pred = regressor.predict(X_test) # Building the optimal model using Backward Elimination import statsmodels.formula.api as sm X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1) X_opt = X[:, [0,1,2,3,4,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() X_opt = X[:, [0,1,3,4,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() X_opt = X[:, [0,3,4,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() X_opt = X[:, [0,3,5]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() X_opt = X[:, [0,3]] regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() regressor_OLS.summary() |
R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# Multiple Linear Regression # Importing the dataset dataset = read.csv('50_Startups.csv') # Encoding categorical data dataset$State = factor(dataset$State, levels = c('New York', 'California', 'Florida'), labels = c(1, 2, 3)) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$Profit, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Feature Scaling # training_set = scale(training_set) # test_set = scale(test_set) # Fitting Multiple Linear Regression to the Training set regressor = lm(formula = Profit ~ ., data = training_set) # Predicting the Test set results y_pred = predict(regressor, newdata = test_set) # Building the optimal model using Backward Elimination regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data = dataset) summary(regressor) regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data = dataset) summary(regressor) regressor = lm(formula = Profit ~ R.D.Spend, data = dataset) summary(regressor) |
Clear environment of RStudio.
1 |
rm(list=ls()) |