NumPy
#
NumPy
stands for numerical python and is a package that enables working with (primarily) numeric information in a multidimensional array and carrying out operations on those arrays efficiently. If you’re familiar with matrices in mathematics, often used in linear algebra, then you’re familiar with multidimensional arrays. If you’re not, we’ll get you up to speed here.
A reminder that to run any of the NumPy
code below you’ll need to first import
numpy
. Conventionally, pythonistas (people who code in python) import NumPy
using: import numpy as np
, so we’ll encourage that here. Using this import statement, any time you want to reference an object in NumPy
, you can do so with just the two letters np
.
numpy
using import numpy as np
before attempting to run any of the included code in this section.
import numpy as np
Homogenous Data#
First and foremost, the NumPy
package was developed around the concept of an array. Arrays are useful for storing homogenous data - or data that are all of the same type. Most often this is numeric information. For example, if you had numeric recordings from a bunch of participants from multiple visits to the lab, you could imagine storing each individual participant’s data in a row and their numeric recordings in columns. (If you’re picturing a spreadheet with numbers in cells of the spreadsheet, then you’ve got the right idea.) NumPy
arrays are great for storing this type of information! Importantly, NumPy
supports multidimensional arrays, including 2D arrays, like data stored in rows and columns, as well as multidimensional arrays.
Up to this point, if we wanted to store such information, we could have stored the numbers in a list, or maybe in a dictionary with the participant’s identifier as the key. And while, lists of lists are possible, managing them can be a nightmare and operating on them is not trivial. This is why the development of the numpy
array was critical.
The NumPy
array#
In programming, the term array refers to a data structure that enables the storage and retrieval of data. When we reference NumPy
arrays, we often discuss each number being stored in a “cell” of a grid.
A simple one-dimensional array would store homogenous (again, typically numeric) data in something that looks a lot like a list:
5 | 1 | 8 | 3 |
However, arrays
really shine when we work in more than a single dimensional data, such as a two-dimesional array:
3 | 7 | 1 | 8 |
4 | 9 | 2 | 6 |
5 | 3 | 10 | 7 |
6 | 1 | 8 | 4 |
Arrays can be more than two dimensional; however, for our purposes, we’ll stick to only working with 2D arrays for now.
Now that we have an idea of the types of information stored in arrays and what they look like, we’ll discuss the ground rules for NumPy
arrays. In NumPy
, arrays must:
Store information of the same type (homogenous data)
Remain the same total size once created
Be rectangular (meaning every row of a 2D array must have the same number of columns)
The principle object within NumPy
is the ndarray
(which stands for N-dimensional array). Here we’ll create our first two arrays:
array_0 = np.array([[1, 2], [3, 4]])
array_0
array([[1, 2],
[3, 4]])
array_1 = np.array([[5, 6], [7, 8]])
array_1
array([[5, 6],
[7, 8]])
As a reminder, in the above we can easily see the structure of each of these arrays is a two dimensional array. Each has two rows and two columns. In array_0
the first row contains the integers 1
and 2
and the second row 3
and 4
.
The reason we use arrays is because it maintains this structure, making row and column operations feasible. If we were to simply use a list of lists, instead of an array, the dimensionality (rows and columns) would be lost, as demonstrated here:
[[1, 2], [3, 4]]
[[1, 2], [3, 4]]
Basic operations#
One of the many advantages of using ndarray
s is that you can then easily carry out operations on your arrays. For example, your two arrays can be added together using +
to carry out matrix addition.
array_0 + array_1
array([[12, 8],
[10, 12]])
Similarly, matrix multiplication is now equally simple:
array_0 * array_1
array([[35, 12],
[21, 32]])
While we won’t be covering the mathematical principles underlying these operations here, we can see that mathematically operating on an array is feasible in a way that was not possible with the variable types discussed up to this point.
Attributes#
Because the ndarray
is the core object in NumPy
, there are a number of helpful attributes (and methods - we’ll get there) associated with this object. Again, this is why object-oriented programming is particularly helpful. There are attributes attached to and methods associated with the ndarray
object that are particularly helpful for working with data.
shape
#
The first thing we often want to know about an array is its shape - how many rows? how many columns? The shape
attribute stores this information:
array_0.shape
(2, 2)
The (2, 2)
reports how many rows and how many columns are in the array. The first number will always be the number of rows and the second the number of columns.
size
#
The total number of elements stored within an array can be accessed with the size
attribute:
array_0.size
4
Here, we see that there are four total elements within the array_0
object
dtype
#
As noted above, ndarray
s store homogenous information. Typically, these will be numbers, but they aren’t required to be numbers. To determine the data type stored in the array, the dtype
attribute can be used:
array_0.dtype
dtype('int64')
Above, we see that the information stored within array_0
are all integers
Note: There are additional array attributes, referenced here; however, mostare beyond the scope of knowledge required here.
Indexing & Slicing#
In addition to knowing information about the array, we often want to be able to access particular elements of the array.
For example, if you wanted to index into an array and find the value stored at a particular position, we can do so using our typical approach to indexing ([]
). However, note here, that to access a single value within an array, we’ll need to provide both the row and column location within the array.
For example, to access the value in the first row but second column of array_0
, we’d use the following:
array_0[0, 1]
2
A reminder that Python is zero-indexed, so the information in the first row will be accessed with the index 0
and the information in the second column will be accessed with the index 1
.
Additionally, rows of data can be accessed using a single value when indexing. The following returns the first row of the array:
array_0[0]
array([1, 2])
Beyond accessing a single row, slices of the original array can also be accessed using the slice notation with which we’re familiar. For example, the following returns the first column of the array:
array_0[:, 0]
array([1, 3])
The :
says select all rows, whereas the 0
indicates to only return the first column.`
Finally, as arrays are mutable, the ability to access particular elements in or parts of an array enables values within the array to be updated after object creation. If I wanted to change the first value in array_0
to be the number 7
instead of 1
, I could do so using the following assignment:
array_0[0, 0] = 7
array_0
array([[7, 2],
[3, 4]])
Methods#
In addition to attributes and the ability operate mathematically, ndarray
objects have a number of helpful methods.
sum()
#
For example, if you wanted to quickly compute the sum of all the values in an array, there’s the method sum
for that:
array_0.sum()
16
Helpfully, this method can also operate to calculate the sum across the columns of arrays, by specifying the value 0 for the axis
parameter:
array_0.sum(axis=0)
array([10, 6])
…or across rows by specifying the value 1:
array_0.sum(axis=1)
array([9, 7])
Aggregation functions#
Beyond sum
, there are a number of methods that calculate some statistic across your array.
For example, max()
provides the largest value in the array, min()
the smallest, mean()
the average, and std()
the standard deviation:
# smallest value
array_0.min()
2
# largest vallue
array_0.max()
7
# average
array_0.mean()
4.0
# standard devaation
array_0.std()
1.8708286933869707
As with sum()
, the axis parameter would carry out any of the operations by row or column. For example, calculating the mean for each column:
array_0.mean(axis=0)
array([5., 3.])
While we won’t walk through examples of all of the existing methods in NumPy
, we’ll summarize a few common ones here:
Function |
Purpose |
---|---|
|
Convert the array to a nested list |
|
Fill array with a particular value |
|
Transposes the axes of the array |
|
Returns True if all elements in array meet condition |
|
Returns True if any element in array meets condition |
Additional methods can be found in the NumPy
Documentation.
Functions#
While the ndarray
object is the main object in NumPy
, there are a number of additional functions provided within the package that add additional functionality when working with arrays.
Specifically, what if you wanted to find all of the unique values in an array quickly? There’s a function (np.array
) for that.
For example, if you had the following array:
array_dups = np.array([[1, 5, 5, 5, 7, 9, 10],
[1, 5, 5, 5, 7, 9, 10],
[5, 7, 9, 9, 8, 2, 3]])
array_dups
array([[ 1, 5, 5, 5, 7, 9, 10],
[ 1, 5, 5, 5, 7, 9, 10],
[ 5, 7, 9, 9, 8, 2, 3]])
…you could use the np.unique()
function to return the following to extract an array of all the unique values:
np.unique(array_dups)
array([ 1, 2, 3, 5, 7, 8, 9, 10])
Note that again the axis
parameter would allow you to do the same across rows, returning only unique rows:
np.unique(array_dups, axis=0)
array([[ 1, 5, 5, 5, 7, 9, 10],
[ 5, 7, 9, 9, 8, 2, 3]])
While we won’t walk through examples of all of the existing methods in NumPy
, we’ll summarize a few common ones here:
Function |
Purpose |
---|---|
|
Idenfies location within matrix where condition is met |
|
Reverses the order of an array |
|
Add values in range to array |
|
Fills a 2D array with ones |
|
Fills a 2D array with zeroes |
Exercises#
Q1. Create three 2D numpy array, each with 3 rows and 2 columns. Fill the first one with zeroes, teh second with ones, and the third one with a range of values.
Q2. Using, NumPy
attributes, double check that each array has the correct shape and size.
Q3. Calculate the minimum, maximum, mean, and standard deviation of the values in the array you created with a range of values
Q4. Calculate the same metrics as in Q3, but by row.
Q5. Calculate the same metrics as in Q3, but by column.