Week 3 CST 383

This week in CST 383 we continued to explore more about the different variables such as those which are continuous or discrete in data science by looking at the ways we can explore and visualize data. The start of the lectures focused on providing ways to analyze one continuous variable through density plots, histograms, and boxplots to enhance the understanding of how data is located and scaled visually. This week’s lectures walked us through how to create the plots in Python using a notebook‑style workflow, which I found especially helpful because of the way it blends the different cells such as the code, text, and visual output in a way that makes the analysis feel more intuitive. In a way I am able to observe the plot completely right underneath a certain cell of a code to mix and match different parameters in order to get the right visualizations.

After getting some practice with plotting through the homework and labs, I began to understand the intricacies of the different plots. Density plots can give a smooth sense of shape but can change dramatically with a different bandwidth. I had found the bw_method parameter in density plots a bit confusing because changing it from 0.15 to 0.5 completely smoothed out the distribution curve. While observing this I was questioning the cases in which we would use this method and found that it actually matters in different situations such as utilizing it when looking at certain datasets to uncover the different skews and peaks. A small bandwidth can capture the details of these peaks and skews but may mess up the shape of the plot in which having a larger bandwidth can help see the shape that we have been observing when looking at the PDF and CDF in the prior week. As for histograms the visualization depends on the way the bins are arranged, because maybe we would like to see a certain interval in which the data clusters to, and the bin boundaries can actually highlight that pattern. In regards to histograms, something that was confusing to me was during the college lab #6 when we had to create a superimposed histogram with a transparency to see how two variables overlapped with each other. At first I created a plt.hist() that the initial hint had mentioned, the plots ended up with odd shaped bars in which I had to reorganize with a certain bin range to match the model output. I actually realized that all I had to do was create a dataframe.plot.hist() and that initially created the correct output matching the model. It made me wonder how often these small differences in plotting functions or methods can end up affecting the display of the data, and whether there are general guidelines for when a certain plotting approach is more dependable than the other. Lastly, I wanted to reflect on boxplots and what I noticed when creating a multi-box plot within one line rather than creating a subplot. The issue when plotting multiple variables without using a subplot is the scaling of the values in regards to the single value axis. The plot becomes useless because of how the smaller sized variables get flattened in certain areas. Rather than creating a subplot to fix this issue, is there a specific way to apply something that helps scale the different variables in the same single multi-box plot?

We also examined the various famous distributions of a continuous variable especially with continuous uniform distribution and normal distribution. I thought the lecture in this case went over this concept pretty well. I’ve had only a bit of prior experience working with Gaussian distributions in a statistics class many years ago. What surprised me was how simple the parameterization of the normal distribution really is, as it is just the mean and the standard deviation. I hadn’t fully realized that the reason these two parameters are enough is because the bell‑curve shape is fixed and once you know the center and the spread then everything else follows through automatically. It also made sense why the mean and median coincide for a normal distribution, this was something the lectures from last week had hinted at when discussing shapes of a PDF and CDF density plot. I had encountered Gaussian distributions before in a statistics class years ago, but mostly it was quickly going over the 68‑95‑99.7 rule. On a similar note the continuous uniform distribution was quite easy to understand after reviewing the lecture slides especially given the idea where “values from a to b all have the same density”, with this all we need to understand are the parameters which are a,b where a < b to solve the different statistics. After the lecture, I feel like I can now properly create and use these distributions in SciPy, and figure out all the key statistics directly from the distribution’s object.

The other materials we looked at this week were how to analyze and calculate joint and conditional probabilities along with interpreting the relationship between variables through covariance and the Pearson correlation coefficient. Covariance was a little confusing at first because the formula looked very similar to variance. After looking through the slides, I noticed that variance is basically the covariance of a variable with itself. With covariance, the idea is actually extended to two variables rather than comparing one variable to itself, we compare how two variables move together after each has been centered around its own mean. The Pearson correlation coefficient then takes covariance and normalizes it.

There was a lot of material to discuss but I wanted to end this by discussing something about discrete variables that caught my attention. The most significant part of the discrete variables topic for me was learning that the mean of a discrete variable distribution is just a weighted average, not just the average of the possible values. I thought if the possible values were 0, 1, 2, and 3 then the mean would just be the average of those numbers, but the PMF example showed that some values are more likely than others, so the probabilities end up acting like weights. In terms of the famous distribution for discrete variables we took a look at the discrete uniform distribution and the binomial distribution. I found the SciPy examples for these useful because making the distribution objects allowed me to sample different values and check the various stats like the mean and variance directly. Something that I need to keep in mind for the future is how the parameters work for each distribution. For a discrete uniform distribution, if possible I need to pay attention to the range of possible values, especially in SciPy where the upper endpoint is not included. And in the case of a binomial distribution, I need to remember that the important parameters are the number of trials and the probability of successes.

Comments

Popular Posts