Week 5 CST 383
This week in CST 383 we finally took a look at some core concepts of machine learning. To understand machine learning more practically, we had to shift our focus toward data preprocessing, such as handling missing values and scaling features before applying models. After doing so we took a look at the different classifiers with a large focus on the KNN classification and learning how to train and test splits, undergo cross validation, and identify the different ways to assess the classifiers. Something that stood out to me was realizing that machine learning is not just about choosing a model and training it right away. The preprocessing steps and evaluation choices matter just as much, and if they’re handled incorrectly, the results of the model can be completely misleading and distorted.
When engaging with the practicality of data cleaning or preprocessing with our homework assignments, it made me realize that some datasets can be quite messy. Recently we’ve been working with small datasets and it’s quite easy to indicate where missing values are, though in large datasets it seems to become much harder. To overcome this obstacle pandas offers the isna() method to find the missing values, dropna() to remove them, and fillna() to replace them. While we have these methods to handle some of the missing values such as NaN or None, there can be some missing values that are hidden such as placeholder values. This is something that I am still thinking about and practicing with the datasets we are given such as the campaign and college dataset. I want to be able to easily identify and handle those values that are considered placeholders like 0, -1, strings that are ‘N/A’ or even empty strings. Given some of these large datasets it makes me wonder how often the data scientists may accidentally treat the placeholder values as actual values without realizing it and how much of that can distort the model’s performance.
While in some cases we have to remove the data, sometimes we have to consider replacing it with some form of proper value. Something I’ve noticed when trying to identify how to replace or impute the data is trying to distinguish the variables like whether it would be continuous data or categorical and such. I am still having a bit of trouble with that since I usually end up looking back at the older lectures to double check. Though I would say now I understand that continuous data or as the “Geron” book mentions it as numerical data, are often replaced with the variable’s median or mean value. And if the values are categorical or discrete, we replace it with the variable’s mode instead or in some cases a new categorical string. I sometimes find myself questioning the whole logic behind imputing missing values, such as in the case of utilizing the modes. It feels a bit strange to replace the same value repeatedly, and I’m sort of trying to understand again why this doesn’t distort the dataset. The lecture emphasized that imputation is necessary practice, but I’m still unclear about the boundary between what is an acceptable imputation and situations where the feature should simply be dropped. Additionally, another question I still have is why we have the 60% missing values as a threshold for removing a feature. Maybe I had glossed over this concept in the book, but I am somewhat curious whether that number is based on previous research, common practice, or just a general guideline that deeply depends on specific datasets. I may have to go back and re-read through that part of this week’s material.
On a different note, something that was quite easy for me to understand was the scaling of data. The idea of putting variables on a similar scale made sense, especially after seeing how KNN relies on distance. If one variable has values in the thousands and another has values between 0 and 1, the larger variable could have too much influence on the model even if it is not actually more important. I found z-score normalization easier to understand because it tells us how far a value is from the mean in terms of standard deviations. Unit interval scaling also makes sense because it puts the values between 0 and 1. Something that makes it easy to distinguish between using these two is that z-score is great when features have outliers and roughly normalized while unit scaling is based on no extreme outliers.
KNN classification was quite easy to understand as the idea to it is "looking at the closest points and follow the majority". If most of the nearby examples belong to one instance, then the new point is probably in that instance also. What I found tricky to understand wasn't the whole algorithm itself but rather the different accuracy evaluations. Training accuracy, test accuracy, and even cross validation can measure the ways it performs well but each tell a different story. From my understanding training accuracy measures the model performance on the data it had already learned from. Test accuracy seems a bit more important because it evaluates the data the model has not seen before, which gives it a better idea on how to perform on some data in the future. While cross validation helps evaluate the model using different parts of the training data before using the final test set. Accordingly, the lecture mentions that it is also evaluating the model without test data.
Overall, this week helped me dive a bit into the intricacies of machine learning. A model can only be good if the data was properly preprocessed by checking for missing values or replaced, taking into account scaling of values when necessary, and evaluating the procedures beforehand. I feel like I understand the general flow better now, but I still want more practice with deciding when to impute or drop missing values and get better at interpreting accuracy scores without jumping to conclusions too quickly.
When engaging with the practicality of data cleaning or preprocessing with our homework assignments, it made me realize that some datasets can be quite messy. Recently we’ve been working with small datasets and it’s quite easy to indicate where missing values are, though in large datasets it seems to become much harder. To overcome this obstacle pandas offers the isna() method to find the missing values, dropna() to remove them, and fillna() to replace them. While we have these methods to handle some of the missing values such as NaN or None, there can be some missing values that are hidden such as placeholder values. This is something that I am still thinking about and practicing with the datasets we are given such as the campaign and college dataset. I want to be able to easily identify and handle those values that are considered placeholders like 0, -1, strings that are ‘N/A’ or even empty strings. Given some of these large datasets it makes me wonder how often the data scientists may accidentally treat the placeholder values as actual values without realizing it and how much of that can distort the model’s performance.
While in some cases we have to remove the data, sometimes we have to consider replacing it with some form of proper value. Something I’ve noticed when trying to identify how to replace or impute the data is trying to distinguish the variables like whether it would be continuous data or categorical and such. I am still having a bit of trouble with that since I usually end up looking back at the older lectures to double check. Though I would say now I understand that continuous data or as the “Geron” book mentions it as numerical data, are often replaced with the variable’s median or mean value. And if the values are categorical or discrete, we replace it with the variable’s mode instead or in some cases a new categorical string. I sometimes find myself questioning the whole logic behind imputing missing values, such as in the case of utilizing the modes. It feels a bit strange to replace the same value repeatedly, and I’m sort of trying to understand again why this doesn’t distort the dataset. The lecture emphasized that imputation is necessary practice, but I’m still unclear about the boundary between what is an acceptable imputation and situations where the feature should simply be dropped. Additionally, another question I still have is why we have the 60% missing values as a threshold for removing a feature. Maybe I had glossed over this concept in the book, but I am somewhat curious whether that number is based on previous research, common practice, or just a general guideline that deeply depends on specific datasets. I may have to go back and re-read through that part of this week’s material.
On a different note, something that was quite easy for me to understand was the scaling of data. The idea of putting variables on a similar scale made sense, especially after seeing how KNN relies on distance. If one variable has values in the thousands and another has values between 0 and 1, the larger variable could have too much influence on the model even if it is not actually more important. I found z-score normalization easier to understand because it tells us how far a value is from the mean in terms of standard deviations. Unit interval scaling also makes sense because it puts the values between 0 and 1. Something that makes it easy to distinguish between using these two is that z-score is great when features have outliers and roughly normalized while unit scaling is based on no extreme outliers.
KNN classification was quite easy to understand as the idea to it is "looking at the closest points and follow the majority". If most of the nearby examples belong to one instance, then the new point is probably in that instance also. What I found tricky to understand wasn't the whole algorithm itself but rather the different accuracy evaluations. Training accuracy, test accuracy, and even cross validation can measure the ways it performs well but each tell a different story. From my understanding training accuracy measures the model performance on the data it had already learned from. Test accuracy seems a bit more important because it evaluates the data the model has not seen before, which gives it a better idea on how to perform on some data in the future. While cross validation helps evaluate the model using different parts of the training data before using the final test set. Accordingly, the lecture mentions that it is also evaluating the model without test data.
Overall, this week helped me dive a bit into the intricacies of machine learning. A model can only be good if the data was properly preprocessed by checking for missing values or replaced, taking into account scaling of values when necessary, and evaluating the procedures beforehand. I feel like I understand the general flow better now, but I still want more practice with deciding when to impute or drop missing values and get better at interpreting accuracy scores without jumping to conclusions too quickly.


Comments
Post a Comment