Week 7 CST 383

This week in CST 383 we focused on the last few concepts of the course which were managing how to encode categorical data in order to feed it to SciKit-Learning models and related methods, then we went over logistic regression which despite its name is actually a way to solve classification problems in regards to categorical binaries, and lastly we discussed the different ways we can overfit a machine learning model and how to fix it. In this case, overfitting means the model performs extremely well on the training data, but fails to generalize to new unseen data such as a test set or future observations.

The homework this week was an overarching highlight of everything we had learned in this course though mainly the concepts that we had learned in the previous three weeks from exploring machine learning. It was interesting to see how I can challenge myself to identify how to pre-process data by encoding categorical variables, identifying when to split and scale data, testing different iterations of the hyperparameters to find the best values, and overall exploring the difference between classification machine learning models by assessing the prediction results. It was great to connect all of these separate lessons into one free-for-all assignment and culminate it into an actual machine learning workflow. 

Something that I had trouble understanding this week was trying to decide whether or not one certain predictor variable was nominal or ordinal to properly encode in the homework. Within the homework assignment we had to encode the predictor and target variables since Scikit-Learn requires them to be quantitative to further process the models. I understood how to encode the ordinal variables which were those with natural orders by creating a mapping that goes along its order starting from the number 0 and nominal variables required using dummy variables which creates extra columns that hold true or false values for each value in the original variable. The confusing part was deciding whether a ‘Contract’ was considered nominal or ordinal and which process is better to encode. From my understanding and observing the data the ‘Contract’ variable contained values ‘Month-to-month’, ‘one-year’, and ‘two-year’ which to me seems like a natural order from shortest to longest contract dates. Though I did not want to make the mistake of using an ordinal encoding on a nominal predictor variable. I started thinking that maybe it's about the interpretation of the data, since I am not too sure if each value had a corresponding difference to each other. Additionally I was going to feed it to a KNN classification which relies on distance algorithms, so I just decided to apply the dummy variable specification to it. I didn’t have time to try it but I do wonder if encoding it as an ordinal would make any difference in the modeling. Another question I had from this was whether dummy variables are usually the safer choice for categorical predictors when using KNN, since KNN depends on distance and ordinal encoding might create artificial distances between categories. 

Something else we took a look at this week was another classifier with a distinct twist. The classifier in this case is called Logistic Regression, and despite the name it is mainly used for classification problems rather than regular regression problems. The idea of this is that it is a classifier that predicts probabilities using the sigmoid function. In a way, it kind of reminds me of scaling because it squashes a wide range of numbers into a fixed window between 0 and 1. From my understanding, it relies on the combination of a linear model with what is known as a squasher to force the output into a 0 and 1 range intepreted as probabilities. I generally understood the concept of maximum likelihood which is the idea of choosing a model parameters that make the observed data the most probable. Although something that is confusing about this idea is how it retains the log loss function. In one of the homework reading questions it asked if the logistic regression model seeks to minimize the mean squared error, but according to the book it actually minimizes the log loss. The text explains that the cost becomes large when the model assigns a probability close to 0 to a positive instance or close to 1 to a negative instance. I am still trying to intuitively understand why taking the negative log of the predicted probability is the correct way to measure the error. What I noticed when applying this model to the telecom homework was the lack of requirement for hyperparameters and how simple it was to train and predict the data. Compared to the KNN classification I had to find hyperparameters using GridSearchCV which took some time to run and having to properly scale the data. Despite the differences both models seemed to have very similar results from what I had assessed with the predictions. Something that I do wonder is how much of a difference does scaling make for this model, since the slides mentions that training data does not need to be scaled similarly to linear regression models.

Lastly I wanted to just quickly touch upon overfitting because this week made the idea feel a bit more connected compared to the previous. I spent time reflecting on the idea of overfitting, which brought together many of the modeling choices we made in the homework. Overfitting finally made sense to me as a situation where the model becomes too sensitive to the tiny details of the training data and performs too well on what it has already seen but fails to generalize to new observations. The slides emphasized that the biggest sign is a large gap between training and test performance, and I can now see how this connects to model hyperparameters and even preprocessing decisions. Overall this week has been quite insightful, I thoroughly enjoyed learning how to encode the categorical variables and going back to explore the KNN classifier to compare with the recently taught Logistic regression. I thought this week’s homework was quite challenging but it did teach me a lot about organizing the workflow and understanding the judgments required for machine learning.

Comments

Popular Posts