NumPy stands for numerical Python. It's the backbone of all kinds of scientific and numerical computing in Python.
And since machine learning is all about turning data into numbers and then figuring out the patterns, NumPy often comes into play.
You can do numerical calculations using pure Python. In the beginning, you might think Python is fast but once your data gets large, you'll start to notice slow downs.
One of the main reasons you use NumPy is because it's fast. Behind the scenes, the code has been optimized to run using C. Which is another programming language, which can do things much faster than Python.
NOTE: It's important to remember the main type in NumPy is ndarray, even seemingly different kinds of arrays are still ndarray's. This means an operation you do on one array, will work on another.
- Array - A list of numbers, can be multi-dimensional.
- Scalar - A single number (e.g. 7).
- Vector - A list of numbers with 1-dimesion (e.g. np.array([1, 2, 3])).
- Matrix - A (usually) multi-deminsional list of numbers (e.g. np.array([[1, 2, 3], [4, 5, 6]])).
This is to examplify how NumPy is the backbone of many other libraries.
- np.random.rand(5, 3)
- np.random.randint(10, size=5)
- np.random.seed() - pseudo random numbers
NumPy uses pseudo-random numbers, which means, the numbers look random but aren't really, they're predetermined.
For consistency, you might want to keep the random numbers you generate similar throughout experiments.
To do this, you can use np.random.seed().
What this does is it tells NumPy, "Hey, I want you to create random numbers but keep them aligned with the seed."
Let's see it.
Because np.random.seed() is set to 0, the random numbers are the same as the cell with np.random.seed() set to 0 as well.
Setting np.random.seed() is not 100% necessary but it's helpful to keep numbers the same throughout your experiments.
For example, say you wanted to split your data randomly into training and test sets.
Every time you randomly split, you might get different rows in each set.
If you shared your work with someone else, they'd get different rows in each set too.
Setting np.random.seed() ensures there's still randomness, it just makes the randomness repeatable. Hence the 'pseudo-random' numbers.
Remember, because arrays and matrices are both ndarray's, they can be viewed in similar ways.
NumPy arrays get printed from outside to inside. This means the number at the end of the shape comes first, and the number at the start of the shape comes last.
- +, -, *, /, //, **, %
- Dot product - np.dot()
- np.sum() - faster than .sum(), make demo, np is really fast
- np.argmin() - find index of minimum value
- np.argmax() - find index of maximum value
- These work on all ndarray's
- a4.min(axis=0) -- you can use axis as well
- Comparison operators
- x != 3
- x == 3
- np.sum(x > 3)
Aggregation - bringing things together, doing a similar thing on a number of things.
Mean is the same as average. You can find the average of a set of numbers by adding them up and dividing them by how many there are.
Standard deviation is a measure of how spread out numbers are.
The variance is the averaged squared differences of the mean.
To work it out, you:
- Work out the mean
- For each number, subtract the mean and square the result
- Find the average of the squared differences
The transpose of a matrix is obtained by moving the rows data to the column and columns data to the rows.
If we have an array of shape (X, Y) then the transpose of the array will have the shape (Y, X).