Week1 CST 383

This week in CST 383 Introduction to Data Science we learned about the fundamentals of machine learning by getting introduced to some of its types such as supervised learning and unsupervised learning. Additionally this week, we took a look at the python ecosystem with a large focus on the NumPy python library, and general environment setup with Anaconda.

Since I haven't used python much since the Computer Networking course last semester, I found it helpful to revisit the essentials through the lectures and slides covered this week. Something that was never fully gone over before that I have now been introduced to was Python slicing. I was surprised by how powerful slicing is within NumPy arrays compared to regular Python lists. From my understanding of regular Python lists, it creates a copy of the data which signifies that the sliced section of the original list is not altered. On the other hand I learned that NumPy array slicing can create a view that references the same underlying data. So in this case, modifying the slice of a NumPy array can also modify the original array, and vice versa where making changes to the main array can influence the sliced portion.

When starting the lecture this week, I assumed that slicing would behave entirely the same way in NumPy like it does in regular python lists, though this thought changed when the views and copies were discussed. It also gave me a deeper understanding of why NumPy is so efficient in terms of storage and operations on numerical data. In the case of slicing, the efficiency is that NumPy can return views instead of copies which basically avoids unnecessary memory usage and helps speed up operations on large numerical datasets. It made me question how many other NumPy operations rely on similar memory-efficient design choices that I have yet to notice. As I reflect on the performances of NumPy I wanted to shortly mention something that intrigued me which was the vectorization of arrays. I am amazed at how we can manipulate an entire array at once with operations without having to write a python loop. I do wonder how these vectorized operations work under the hood and why they are so much faster than python loops? And are there situations where vectorized operations do not perform well compared to regular loops?

As I was working on the homework assignment I noticed many problems asking for the mean of a numpy array. I was mostly using the np.mean() function, but I did come across a similar function in np.average(). At first, I was not sure why both existed and begged the question on whether they retain the same functionality. I looked a bit into it with the given resource in one of the lecture slides and I learned that the np.mean() initially computes the standard arithmetic mean. While np.average() also does this, its focus is more on computing weighted averages when weights are provided in the argument. It made me curious on if we are going to get involved with weighted averages in the future of this course? And how do we determine those weights? I realized that I should take a closer look at the NumPy documentation to see if there are other similarities between methods that this library provides.

Something that I must review and make sure I understand for the quiz this week is retrieving certain sections of multidimensional arrays using slicing such as problem 12 in the numpy-2D-lab where I needed to get the last two rows of an array at the fourth column. When working with the 1-D arrays it was quite easy to understand since I was looking at it sequentially. But when working with multi-dimensional arrays I needed to focus on the proper slicing by looking at both the row and the column. Additionally, I need to improve on identifying the potential errors that may occur especially like those in problem #16 and #18 of the numpy-1D-lab. I need to develop a better understanding of how masking works with different sized arrays and broadcasting in multidimensional arrays.

Comments

Popular Posts