For this problem, we will use a subset of Wisconsin Breast can...1 answer below »

For this problem, we will use a subset ofWisconsin Breast cancer dataset. Note that this dataset has some information missing.

1.1 Data Munging (5 Marks)

Cleaning the data is essential when dealing with real world problems. Training and testing data is stored in "data/wisconsin_data" folder. You have to perform the following:

Read the training and testing data. Print the number of features in the dataset.

For the data label, print the total number of 1's and 0's in the training and testing data. Comment on the class distribution. Is it balanced or unbalanced?

Print the number of features with missing entries.

Fill the missing entries. For filling any feature, you can use either mean or median value of the feature values from observed entries.

Normalize the training and testing data.

1.2 Logistic Regression (5 Marks)Train logistic regression models with L1 regularization and L2 regularization using alpha = 0.1 and lambda = 0.1. Report accuracy, precision, recall, f1-score and print the confusion matrix.

1.3 Choosing the best hyper-parameter (5 Marks)

For L1 model, choose the best alpha value from the following set: {0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333}.

For L2 model, choose the best lambda value from the following set: {0.001, 0.003, 0.01, 0.03, 0.1,0.3,1,3,10,33}.

To choose the best hyperparameter (alpha/lambda) value, you have to do the following:

For each value of hyperparameter, perform 100 random splits of training data into training and validation data.

Find the average validation accuracy for each 100 train/validate pairs. The best hyperparameter will be the one that gives maximum validation accuracy. Use the best alpha and lambda parameter to re-train your final L1 and L2 regularized model. Evaluate the prediction performance on the test data and report the following:

Precision

Accuracy

The top 5 features selected in decreasing order of feature weights.

Confusion matrix

Finally, discuss if there is any sign of underfitting or overfitting with appropriate reasoning.

Part-2 (Multiclass Classification):

For this experiment, we will use a small subset ofMNIST dataset for handwritten digits. This dataset has no missing data. You will have to implement one-versus-rest scheme to perform multi-class classification using a binary classifier based on L1 regularized logistic regression.

2.1 Read and understand the data, create a default One-vs-Rest Classifier (5 Marks)

1- Use the data from the file reduced_mnist.csv in the data directory. Begin by reading the data. Print the following information:

IT USES PYTHON LANGUAGE TO SHOW THE REGRESSION IN MULTI CLASS CLASSIFICATION USING PREDEFINED PACKAGES IN R STUDIO FOR PYTHON AND COMPUTES THE SAME AND ALSO PLOT THE RESULT AS GIVEN IN THE SOLUTION WE REQUIRE FOLLOWING PACKAGES TO COMPLETE THE SOLUTION IN PRESCRIBED FORMAT
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
from sklearn.cross_validation import train_test_split import plotly.graph_objs as go
import plotly.plotly as py
from plotly.graph_objs import *

Solution Preview:

IT USES PYTHON LANGUAGE TO SHOW THE REGRESSION IN MULTI CLASS CLASSIFICATION USING PREDEFINED PACKAGES IN R STUDIO FOR PYTHON AND COMPUTES THE SAME AND ALSO PLOT THE RESULT AS GIVEN IN THE SOLUTION WE REQUIRE FOLLOWING PACKAGES TO COMPLETE THE SOLUTION IN PRESCRIBED FORMAT
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
from sklearn.cross_validation import train_test_split import plotly.graph_objs as go
import plotly.plotly as py
from plotly.graph_objs import *