{
"cells": [
{
"cell_type": "markdown",
"id": "eeafac29-de44-4bf5-991a-b13609cba540",
"metadata": {},
"source": [
"# `NumPy`\n",
"\n",
"[`NumPy`](https://numpy.org/) stands for numerical python and is a package that enables working with (primarily) numeric information in a multidimensional array and carrying out operations on those arrays efficiently. If you're familiar with matrices in mathematics, often used in linear algebra, then you're familiar with multidimensional arrays. If you're not, we'll get you up to speed here.\n",
"\n",
"A reminder that to run any of the `NumPy` code below you'll need to first `import` `numpy`. Conventionally, pythonistas (people who code in python) import `NumPy` using: `import numpy as np`, so we'll encourage that here. Using this import statement, any time you want to reference an object in `NumPy`, you can do so with just the two letters `np`.\n",
"\n",
"
\n",
"Be sure to import numpy
using import numpy as np
before attempting to run any of the included code in this section.\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6dab84ff-06c1-4831-9d0a-e76a7d7b39b8",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "c4d7bc88-0a16-4f21-8ba2-7b5ded3b0f0f",
"metadata": {},
"source": [
"## Homogenous Data\n",
"\n",
"First and foremost, the `NumPy` package was developed around the concept of an **array**. Arrays are useful for storing homogenous data - or data that are **all of the same type**. Most often this is numeric information. For example, if you had numeric recordings from a bunch of participants from multiple visits to the lab, you could imagine storing each individual participant's data in a row and their numeric recordings in columns. (If you're picturing a spreadheet with numbers in cells of the spreadsheet, then you've got the right idea.) `NumPy` arrays are great for storing this type of information! Importantly, `NumPy` supports multidimensional arrays, including 2D arrays, like data stored in rows and columns, as well as multidimensional arrays.\n",
"\n",
"Up to this point, if we wanted to store such information, we could have stored the numbers in a list, or maybe in a dictionary with the participant's identifier as the key. And while, lists of lists are possible, managing them can be a nightmare and operating on them is not trivial. *This* is why the development of the `numpy` array was critical.\n",
"\n",
"## The `NumPy` array\n",
"\n",
"In programming, the term **array** refers to a data structure that enables the storage and retrieval of data. When we reference `NumPy` arrays, we often discuss each number being stored in a \"cell\" of a grid.\n",
"\n",
"A simple one-dimensional array would store homogenous (again, typically numeric) data in something that looks a lot like a list: \n",
"\n",
"\n",
" \n",
" 5 | \n",
" 1 | \n",
" 8 | \n",
" 3 | \n",
"
\n",
"
\n",
"\n",
"However, `arrays` really shine when we work in more than a single dimensional data, such as a two-dimesional array: \n",
"\n",
"\n",
" \n",
" 3 | \n",
" 7 | \n",
" 1 | \n",
" 8 | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" 5 | \n",
" 3 | \n",
" 10 | \n",
" 7 | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" 8 | \n",
" 4 | \n",
"
\n",
"
\n",
"\n",
"Arrays can be more than two dimensional; however, for our purposes, we'll stick to only working with 2D arrays for now. \n",
"\n",
"Now that we have an idea of the types of information stored in arrays and what they look like, we'll discuss the ground rules for `NumPy` arrays. In `NumPy`, arrays must:\n",
"1. Store information of the same type (homogenous data)\n",
"2. Remain the same total size once created\n",
"3. Be rectangular (meaning every row of a 2D array must have the same number of columns)\n",
"\n",
"The principle object within `NumPy` is the `ndarray` (which stands for N-dimensional array). Here we'll create our first two arrays:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5e2b5031-cb91-4eaf-b672-3b70ab64e839",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 2],\n",
" [3, 4]])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0 = np.array([[1, 2], [3, 4]])\n",
"array_0"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ca37d8e3-2607-443c-b58f-31af84056dc2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[5, 6],\n",
" [7, 8]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_1 = np.array([[5, 6], [7, 8]])\n",
"array_1"
]
},
{
"cell_type": "markdown",
"id": "715cc086-704c-439f-aef5-9a75fccfca71",
"metadata": {},
"source": [
"As a reminder, in the above we can easily see the structure of each of these arrays is a two dimensional array. Each has two rows and two columns. In `array_0` the first row contains the integers `1` and `2` and the second row `3` and `4`. \n",
"\n",
"The reason we use arrays is because it maintains this structure, making row and column operations feasible. If we were to simply use a list of lists, instead of an array, the dimensionality (rows and columns) would be lost, as demonstrated here:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "bca641e1-0ae7-4c37-af92-918b15434df3",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[[1, 2], [3, 4]]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[[1, 2], [3, 4]] "
]
},
{
"cell_type": "markdown",
"id": "51bbdce9-5e0e-4d56-b1ef-ea44ff0e7185",
"metadata": {},
"source": [
"## Basic operations\n",
"\n",
"One of the many advantages of using `ndarray`s is that you can then easily carry out operations on your arrays. For example, your two arrays can be added together using `+` to carry out matrix addition."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e72fdb0d-81c0-4510-bbb6-0f2e56e2d408",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[12, 8],\n",
" [10, 12]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0 + array_1"
]
},
{
"cell_type": "markdown",
"id": "f9c2a4f0-80ea-4471-9f42-6c8a351fd6d7",
"metadata": {},
"source": [
"Similarly, matrix multiplication is now equally simple:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "0a126eb0-ec39-4883-8c67-9313e55072b4",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[35, 12],\n",
" [21, 32]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0 * array_1"
]
},
{
"cell_type": "markdown",
"id": "4f74b24e-a999-4669-a07a-a6b9a0dbc086",
"metadata": {},
"source": [
"While we won't be covering the mathematical principles underlying these operations here, we can see that mathematically operating on an array is feasible in a way that was not possible with the variable types discussed up to this point."
]
},
{
"cell_type": "markdown",
"id": "3ee2a83b-0065-46f5-9c88-b39dc29a8af3",
"metadata": {},
"source": [
"## Attributes\n",
"\n",
"Because the `ndarray` is the core object in `NumPy`, there are a number of helpful attributes (and methods - we'll get there) associated with this object. Again, this is why object-oriented programming is particularly helpful. There are attributes attached to and methods associated with the `ndarray` object that are particularly helpful for working with data.\n",
"\n",
"### `shape`\n",
"\n",
"The first thing we often want to know about an array is its shape - how many rows? how many columns? The `shape` attribute stores this information:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "49756f73-3952-4030-95d9-6e189a30fb82",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2, 2)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.shape"
]
},
{
"cell_type": "markdown",
"id": "233b9d86-4ba8-424d-95ee-98b280d8b80a",
"metadata": {},
"source": [
"The `(2, 2)` reports how many rows and how many columns are in the array. The first number will always be the number of rows and the second the number of columns. "
]
},
{
"cell_type": "markdown",
"id": "0ab9a337-ece6-4ed4-8b1b-9474a7410f96",
"metadata": {},
"source": [
"### `size`\n",
"\n",
"The total number of elements stored within an array can be accessed with the `size` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fc2510de-b468-4891-82ed-a9b29f8f6e2a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.size"
]
},
{
"cell_type": "markdown",
"id": "05a39d35-aa87-4a2b-bdeb-114dcf632009",
"metadata": {},
"source": [
"Here, we see that there are four total elements within the `array_0` object\n",
"\n",
"### `dtype`\n",
"\n",
"As noted above, `ndarray`s store homogenous information. Typically, these will be numbers, but they aren't required to be numbers. To determine the data type stored in the array, the `dtype` attribute can be used:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5c4395b1-04c1-4493-a293-2b65940cbb90",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('int64')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.dtype"
]
},
{
"cell_type": "markdown",
"id": "f6cb900f-4704-41f4-84eb-259163837d85",
"metadata": {},
"source": [
"Above, we see that the information stored within `array_0` are all integers\n",
"\n",
"Note: There are additional array attributes, referenced [here](https://numpy.org/doc/stable/reference/arrays.ndarray.html#arrays-ndarray); however, mostare beyond the scope of knowledge required here."
]
},
{
"cell_type": "markdown",
"id": "eb851539-c616-47cf-aa14-c6ebd51c03c3",
"metadata": {},
"source": [
"## Indexing & Slicing\n",
"\n",
"In addition to knowing information about the array, we often want to be able to access particular elements of the array. \n",
"\n",
"For example, if you wanted to index into an array and find the value stored at a particular position, we can do so using our typical approach to indexing (`[]`). However, note here, that to access a single value within an array, we'll need to provide both the row and column location within the array.\n",
"\n",
"For example, to access the value in the first row but second column of `array_0`, we'd use the following:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4540cd06-7d0f-463f-803b-12df47870ca7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0[0, 1]"
]
},
{
"cell_type": "markdown",
"id": "b1c6cf77-cb9d-4342-b662-73c7638f81bd",
"metadata": {},
"source": [
"A reminder that Python is zero-indexed, so the information in the first row will be accessed with the index `0` and the information in the second column will be accessed with the index `1`.\n",
"\n",
"Additionally, rows of data can be accessed using a single value when indexing. The following returns the first row of the array:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1de18242-b41f-49c8-ba25-b63fe09bd2c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0[0]"
]
},
{
"cell_type": "markdown",
"id": "1064b8bc-028e-4ba5-841a-c98b0f6042f6",
"metadata": {},
"source": [
"Beyond accessing a single row, slices of the original array can also be accessed using the slice notation with which we're familiar. For example, the following returns the first column of the array:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "06c70d3b-6b41-4f30-b8e1-3da2311ec03a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 3])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0[:, 0]"
]
},
{
"cell_type": "markdown",
"id": "1ef3b349-5ec7-4b37-853a-4891b34272ae",
"metadata": {},
"source": [
"The `:` says select all rows, whereas the `0` indicates to only return the first column.`"
]
},
{
"cell_type": "markdown",
"id": "e12f4498-b01b-4ef6-90a2-5e93e9d44859",
"metadata": {},
"source": [
"Finally, as arrays are mutable, the ability to access particular elements in or parts of an array enables values within the array to be updated after object creation. If I wanted to change the first value in `array_0` to be the number `7` instead of `1`, I could do so using the following assignment:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "bd0ff82f-303c-40e3-bc2b-1406c21f8fb5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[7, 2],\n",
" [3, 4]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0[0, 0] = 7\n",
"array_0"
]
},
{
"cell_type": "markdown",
"id": "85d969e7-15d1-4f4a-abb5-dae5efd13f01",
"metadata": {},
"source": [
"## Methods\n",
"\n",
"In addition to attributes and the ability operate mathematically, `ndarray` objects have a number of helpful methods. \n",
"\n",
"### `sum()`\n",
"\n",
"For example, if you wanted to quickly compute the sum of all the values in an array, there's the method `sum` for that:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "bf203132-e51c-4fa2-a054-f39244fc3fe8",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.sum()"
]
},
{
"cell_type": "markdown",
"id": "e4f689c1-40f2-4278-8506-3c8b93698458",
"metadata": {},
"source": [
"Helpfully, this method can also operate to calculate the sum across the columns of arrays, by specifying the value 0 for the `axis` parameter:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8eada255-2e10-427a-b669-c924ac11c803",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10, 6])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.sum(axis=0)"
]
},
{
"cell_type": "markdown",
"id": "5f1bddfa-db2d-4cd3-bd4e-7d2b200903f5",
"metadata": {},
"source": [
"...or across rows by specifying the value 1:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "357bbf06-30d2-4204-b428-71d033417ce5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([9, 7])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "b0899e2b-90d9-479f-8ed9-ab2cda1cdd1f",
"metadata": {},
"source": [
"### Aggregation functions\n",
"\n",
"Beyond `sum`, there are a number of methods that calculate some statistic across your array.\n",
"\n",
"For example, `max()` provides the largest value in the array, `min()` the smallest, `mean()` the average, and `std()` the standard deviation:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "248e72a2-5a90-42fa-a803-9b0b455757e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# smallest value\n",
"array_0.min()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "e469aa50-fdd1-49d4-96b0-0621e8e9dbb8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# largest vallue\n",
"array_0.max()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "278d02d3-a40e-4c2d-9725-4109a2cfa9f2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4.0"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# average\n",
"array_0.mean()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "f17ecfcc-b484-4a3e-85df-41b577de8321",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.8708286933869707"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# standard devaation\n",
"array_0.std()"
]
},
{
"cell_type": "markdown",
"id": "31cee185-460a-4f2c-9636-254907cc1fa6",
"metadata": {},
"source": [
"As with `sum()`, the axis parameter would carry out any of the operations by row or column. For example, calculating the mean for each column:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "6c405d06-3d8e-4afa-a6e0-1cb5f54dad71",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([5., 3.])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_0.mean(axis=0)"
]
},
{
"cell_type": "markdown",
"id": "5577828a-951c-453a-bd4f-64ccb1afce52",
"metadata": {},
"source": [
"While we won't walk through examples of all of the existing methods in `NumPy`, we'll summarize a few common ones here:\n",
"\n",
"| Function | Purpose |\n",
"|-------------------|------------------------------------------------------------|\n",
"| `tolist()` | Convert the array to a nested list |\n",
"| `fill()` | Fill array with a particular value |\n",
"| `transpose()` | Transposes the axes of the array |\n",
"| `all()` | Returns True if all elements in array meet condition |\n",
"| `any()` | Returns True if any element in array meets condition |\n",
"\n",
"Additional methods can be found in the [`NumPy` Documentation](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-ndarray-methods)."
]
},
{
"cell_type": "markdown",
"id": "cdb9c512-63dc-4b51-b15a-9eb9ac9d9f78",
"metadata": {},
"source": [
"## Functions\n",
"\n",
"While the `ndarray` object is the main object in `NumPy`, there are a number of additional functions provided within the package that add additional functionality when working with arrays.\n",
"\n",
"Specifically, what if you wanted to find all of the unique values in an array quickly? There's a function (`np.array`) for that.\n",
"\n",
"For example, if you had the following array:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "cde36771-f5db-4682-ae01-6ef2728aec2f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1, 5, 5, 5, 7, 9, 10],\n",
" [ 1, 5, 5, 5, 7, 9, 10],\n",
" [ 5, 7, 9, 9, 8, 2, 3]])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_dups = np.array([[1, 5, 5, 5, 7, 9, 10],\n",
" [1, 5, 5, 5, 7, 9, 10],\n",
" [5, 7, 9, 9, 8, 2, 3]])\n",
"array_dups "
]
},
{
"cell_type": "markdown",
"id": "7710b1ad-268d-4f9d-94d1-99a647b6dd42",
"metadata": {},
"source": [
"...you could use the `np.unique()` function to return the following to extract an array of all the unique values:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "7b6e1dae-cd6c-4d55-a593-fcc1bd33f6f2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1, 2, 3, 5, 7, 8, 9, 10])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.unique(array_dups)"
]
},
{
"cell_type": "markdown",
"id": "211ce2dc-116a-4ffe-845b-09d2a85c938a",
"metadata": {},
"source": [
"Note that again the `axis` parameter would allow you to do the same across rows, returning only unique rows:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "19e5335e-999c-44ad-81c4-c070190a6e13",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1, 5, 5, 5, 7, 9, 10],\n",
" [ 5, 7, 9, 9, 8, 2, 3]])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.unique(array_dups, axis=0)"
]
},
{
"cell_type": "markdown",
"id": "00719331-d10b-422f-82bf-19e9836cd7f5",
"metadata": {},
"source": [
"While we won't walk through examples of all of the existing methods in `NumPy`, we'll summarize a few common ones here:\n",
"\n",
"| Function | Purpose |\n",
"|-------------------|------------------------------------------------------------|\n",
"| `np.where()` | Idenfies location within matrix where condition is met |\n",
"| `np.flip()` | Reverses the order of an array |\n",
"| `np.arange()` | Add values in range to array |\n",
"| `np.ones()` | Fills a 2D array with ones |\n",
"| `np.zeroes()` | Fills a 2D array with zeroes |\n"
]
},
{
"cell_type": "markdown",
"id": "2b6ad7d0-4cab-42d2-b45f-b24669e3890f",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Exercises\n",
"\n",
"Q1. **Create three 2D numpy array, each with 3 rows and 2 columns**. Fill the first one with zeroes, teh second with ones, and the third one with a range of values.\n",
"\n",
"Q2. **Using, `NumPy` attributes, double check that each array has the correct shape and size.** \n",
"\n",
"Q3. **Calculate the minimum, maximum, mean, and standard deviation of the values in the array you created with a range of values**\n",
"\n",
"Q4. **Calculate the same metrics as in Q3, but by row.**\n",
"\n",
"Q5. **Calculate the same metrics as in Q3, but by column.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}