Week4 CST 383
This week in CST 383 the lectures focused on working with two or more discrete/categorical variables and figuring out how to analyze and visualize the relationships between them. The core concepts for this week were involving crosstabs, looking through different relationships such as joint probabilities, conditional probabilities, answering various questions in regard to variables with normalization of distributions, along with visualizing with grouped/stacked bar plots, and heatmaps. Additionally we took a look at the involvement of multiple variables such as visualizing two or more through methods such as hue, style, size, and faceting.
The core concept this week that really made me slow down and think a bit was the role of normalization in crosstabs. At first, I understood that a crosstab or contingency table helps visualize the counts for combinations of two different variables. Something that did confuse me a bit was the proper ways to use normalize=’index’, normalize=’columns’, or normalize=True. Generally my confusion came during the creation of crosstabs, and observing the index and column labels. When given a certain probability question I was overthinking which position of the table I should be prioritizing whether it’d be the row or the column. The part that finally helped me was thoroughly paying attention to the wording of the probability questions. From my understanding, If the questions are asking about “and” I should mostly think about joint probability where it's given that each cell represents a combination and the whole table sums to 1 or 100%. In the ‘and’ cases the normalization will be set to normalize=True. If the question is asking about “given” or prioritizing a variable to find a result, then I should identify this as something being conditioned on or conditional probability. And if the “given” variable is in the rows, I should utilize normalize=’index’, else if the variable is in the columns, I may use normalize=’columns’. In addition, I can think of it in the probability format P( find the result | given this variable), where the 2nd variable is most likely the given and should be the one being normalized whether they are in the column or the row.
Another useful way for me to think about this is that the “given” variable is the group I am starting inside. Using some context from the campaign homework, if the question asks, “For each occupation, what percent of contributions fall into each bin?” then the occupation is the given variable. If occupation is the index of the crosstab, each row should sum to 1, so normalize='index' in this case makes more sense. But if the question asks, “Of all contributions in the 0–25 bin, what percent came from each occupation?” then the bin is the given variable. If bins are the columns, then each column should sum to 1, so normalize='columns' makes more sense in this case. This distinction was confusing at first because both tables can have the same numbers in different forms, but they answer different questions.
One question that still lingers for me this week was how to best handle “or” relationships visually. Crosstabs are very useful for “and” relationships because each cell is one specific combination of two variables, and they also work well for “given” relationships when normalized. Although “or” does not fit quite well with a single cell because it overall must include multiple cells. They usually span multiple cells across rows, columns or even both. My current understanding is that it may be better to create a new boolean column for an “or” condition and then visualize that separately with a bar plot. This feels intuitive because it gives away the grouping logic into a single variable that can be analyzed directly. I’m not entirely sure whether this is the standard or most effective method, or whether there are established/possible ways to represent “or” relationships directly within a crosstab.
The visualization lecture helped me see that tables and plots are not just different formats, but different ways of answering questions. When observing a grouped bar plot, it can show the counts, but if the question asks about the percentages within groups, then utilizing a normalized crosstab before plotting makes it more understandable. I did also notice stacked bar plots can be useful when demonstrating how a group is divided into different categories. Another helpful resource for me to understand the concepts taught this week was the campaign homework. It allowed me to practice various ways of visualizing the variables of that dataset. For example, being able to handle different types of grouping of crosstabs on horizontal and vertical bar plots for fractions of contributions and groupings with contributions. Something that caught my attention was the first couple of problems like #2 and #3. When plotting the histogram for a specific amount of contributions the plot was heavily skewed right and in order to get a better understanding of the graph at the spot where all the data was gathered we created a density plot with certain limitations to get a better read or zoom on the smaller contributions. It was really interesting to go back to using the groupby to aggregate certain data in order to categorize it into a barplot.
I thought the lecture on two or more variables demonstrated that visualizations can show more details when we add extra variables to a plot. We can use certain elements of visualizations like changing color, marker style, point size, or separate panels to include the third or fourth variable. For example, a scatterplot can show two quantitative variables, and then hue, style, or size can add more information without needing a completely different plot. Additionally we took a look at violin plots, bar plots with hue, and FacetGrid, which allows us to make easier comparisons of groups side by side. The biggest takeaway from this lecture is to be careful when adding more variables as it can make a visualization more useful, but on the other hand it can make it much harder to read if you don’t manage the certain plot for it. Being able to determine which visualization and elements helps manage the readability of the plot for better interpretation.
My biggest takeaway from this week is being able to understand the relationship between variables and finding the ability to understand the questions being asked by paying attention to the specific wordings. It also improved my understanding of which plots to handle specific combinations of variables, although I think more practice of this would be helpful for the future. Even though I still have lingering questions about the best ways to visualize “or” relationships, I feel much more confident in interpreting and constructing crosstab grouping, and I’m beginning to appreciate how much clarity comes from framing the question correctly before choosing the method. This week ultimately helped me move from simply computing probabilities or generating plots to understanding how to select the right representation for the question at hand, and how thoughtful visualization can reveal structure that raw tables alone may hide.


Comments
Post a Comment