Week 2 CST383

This week in CST383 we covered the fundamentals of pandas in Python such as creating panda series and dataframes and understanding how both of their operations, indexing, aggregation, grouping, and other features like string methods work to create and manipulate data effectively. We also went over the types of variables used in data science and how to obtain data from different sources. We ended the week by covering how to understand descriptive statistics that give summary information about some form of data such as the mean, median, quantiles, variance, and standard deviation. Lastly, we overviewed the different demonstrations of functions that show the distribution of continuous variables through PDF or probability density functions and CDF or cumulative distribution function

One of the significant moments this week was learning how to distinguish between implicit and explicit indexing. At first iloc, loc, and even normal bracket indexing appeared to be straightforward at first, but in practice through the labs it became a bit of a headache to manage. I managed to understand it but it took a bit of time during the week. The thing that confused me about it was the inclusivity of slicing with .loc because it goes against the usual python slicing rules that were discussed in the previous week. The .iloc or integer based indexing behaves similarly to regular python slicing while on the other hand .loc is inclusive with the end of the label. Another idea that caught my attention was that DataFrames interpret normal brackets as column selection first. In one of the homework’s I tried to access a dataframe with df[[0, 1]] which threw a KeyError. It made me worried at first because I was not familiar with the structure, so after looking into it, I noticed that it was trying to select dataframes whose labels are literally 0 and 1. I had to resort to using .iloc[:, [0, 1] to access those two columns.

Another concept that stood out to me was index alignment. It was interesting to find out about how pandas aligns data by labels even when the sizes don’t match. Something I need to keep in mind about this is when attempting to join the data I need to pay attention to the NaN values that are given for the missing index labels. Forgetting to do so will cause some trouble ahead when I am trying to access specific data after initiating a combining of two pandas series or dataframe.

Aggregation or combining of data, was straightforward and it was interesting to see the different functionalities and features of the aggregate function. Something that did confused me was during one of the labs when passing a numpy function like the following df[‘age’].aggregate(np.mean) it gave a warning, at first I thought I had made a mistake because the syntax followed what was in the lecture slides. After looking up the warning I learned how pandas handles callables, even when passing a np.mean, pandas does not actually use the NumPy version yet but rather it calls its own Series.Mean(). It essentially warned that it was going to use the callable with the np.mean in future rather than at the moment. It made me question if this was a legacy feature behavior in pandas, and it also made me wonder if other np methods being passed do the same thing. Lastly, I found it quite fascinating creating a defined function myself and passing it through the aggregate function. I do wonder if we’ll be doing more of those in the future especially with the lambda function. As I continued through the lectures such as aggregation with groupby, I found that pandas was very very similar to SQL, though not entirely identically but quite close. The groupby function was one of the moments I realized it because it essentially functions the same way as SQL’s GROUP BY; They do have the same name. The biggest takeaway from the lectures was the Split-Apply-Combine technique, this allowed me to understand the functionality of groupby within the context of pandas.

Lastly the most confusing part of this week was the estimation of probabilities based on looking at a PDF or CDF plot. I had to re-watch the lectures and re-read the slides to understand how to identify the mean, median, skew, and the probability of the graph by just observing it. What I understood from this was that skewed data whether it be left or right has a longer or wider tail and the mean is found by finding where the peak is pulled towards the tail. Practically I need to keep in mind how to visually interpret the plots and build a better understanding of the distribution shapes of the plot.

Comments

Popular Posts