1 可选实验室: Python、 NumPy 和矢量化

简要介绍本课程中使用的一些科学计算。特别是 NumPy 科学计算包及其与 python 的使用。

2 目标

在这个实验室里将回顾课程中使用的 NumPy 和 Python 的特性。

Python 是本课程中使用的编程语言。NumPy 库扩展了 python 的基本功能，添加了更丰富的数据集，包括更多的数值类型、向量、矩阵和许多矩阵函数。NumPy 和 python 相当无缝地协同工作。Python 算术运算符处理 NumPy 数据类型，许多 NumPy 函数将接受 Python 数据类型。

NumPy 的基本数据结构是一个可索引的 n 维数组，其中包含相同类型(dtype)的元素。这里维度指的是数组的索引数。一维数组有一个索引。在课程1中，我们将向量表示为 NumPy 一维数组。一维数组，shape(n,) : 从[0]到[ n-1]索引的 n 个元素。

NumPy 中的数据创建例程通常有一个第一个参数，它是对象的形状。这可以是一维结果的单个值，也可以是指定结果形状的元组(n，m，...)。下面是使用这些例程创建向量的示例。

import numpy as np    # it is an unofficial standard to use np for numpy
import time
# NumPy routines which allocate memory and fill arrays with value
a = np.zeros(4);                print(f"np.zeros(4) :   a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,));             print(f"np.zeros(4,) :  a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.random_sample(4); print(f"np.random.random_sample(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")

输出为：

np.zeros(4) :   a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.zeros(4,) :  a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.random.random_sample(4): a = [0.40589302 0.63171453 0.69259702 0.54159911], a shape = (4,), a data type = float64

有些数据创建例程不采用元组形式:

# NumPy routines which allocate memory and fill arrays with value but do not accept shape as input argument
a = np.arange(4.);              print(f"np.arange(4.):     a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.rand(4);          print(f"np.random.rand(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")

输出为：

np.arange(4.):     a = [0. 1. 2. 3.], a shape = (4,), a data type = float64
np.random.rand(4): a = [0.54170759 0.00065357 0.46959253 0.09870197], a shape = (4,), a data type = float64

值也可以手动指定：

# NumPy routines which allocate memory and fill with user specified values
a = np.array([5,4,3,2]);  print(f"np.array([5,4,3,2]):  a = {a},     a shape = {a.shape}, a data type = {a.dtype}")
a = np.array([5.,4,3,2]); print(f"np.array([5.,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")

输出为：

np.array([5,4,3,2]):  a = [5 4 3 2],     a shape = (4,), a data type = int32
np.array([5.,4,3,2]): a = [5. 4. 3. 2.], a shape = (4,), a data type = float64

这些都创建了一个有四个元素的一维向量 a。a.shape返回维度。在这里，我们看到a.shape= (4,)表示一个包含4个元素的一维数组。

3 向量操作

3.1 索引

向量的元素可以通过索引和切片来访问。NumPy 提供了一套非常完整的索引和切片功能。我们将在这里只探索课程所需的基础知识。有关更多详细信息，请参考切片和索引。

索引意味着通过数组中元素的位置来引用数组的元素。

切片意味着根据索引从数组中获取元素的子集。

NumPy 从零开始索引，因此向量 a 的第3个元素是一个[2]。

#vector indexing operations on 1-D vectors
a = np.arange(10)
print(a)

#access an element
print(f"a[2].shape: {a[2].shape} a[2]  = {a[2]}, Accessing an element returns a scalar")

# access the last element, negative indexes count from the end
print(f"a[-1] = {a[-1]}")

#indexs must be within the range of the vector or they will produce and error
try:
    c = a[10]
except Exception as e:
    print("The error message you'll see is:")
    print(e)

输出：

[0 1 2 3 4 5 6 7 8 9]
a[2].shape: () a[2]  = 2, Accessing an element returns a scalar
a[-1] = 9
The error message you'll see is:
index 10 is out of bounds for axis 0 with size 10

3.2 切片

切片使用一组三个值(start: stop: step)创建索引数组。值的子集也是有效的。它的用法可以用一个例子来解释:

#vector slicing operations
a = np.arange(10)
print(f"a         = {a}")

#access 5 consecutive elements (start:stop:step)
c = a[2:7:1];     print("a[2:7:1] = ", c)

# access 3 elements separated by two 
c = a[2:7:2];     print("a[2:7:2] = ", c)

# access all elements index 3 and above
c = a[3:];        print("a[3:]    = ", c)

# access all elements below index 3
c = a[:3];        print("a[:3]    = ", c)

# access all elements
c = a[:];         print("a[:]     = ", c)

输出：

a         = [0 1 2 3 4 5 6 7 8 9]
a[2:7:1] =  [2 3 4 5 6]
a[2:7:2] =  [2 4 6]
a[3:]    =  [3 4 5 6 7 8 9]
a[:3]    =  [0 1 2]
a[:]     =  [0 1 2 3 4 5 6 7 8 9]

3.3 单向量运算

有许多有用的运算涉及对单个向量的运算。

a = np.array([1,2,3,4])
print(f"a             : {a}")
# negate elements of a
b = -a 
print(f"b = -a        : {b}")

# sum all elements of a, returns a scalar
b = np.sum(a) 
print(f"b = np.sum(a) : {b}")

b = np.mean(a)
print(f"b = np.mean(a): {b}")

b = a**2
print(f"b = a**2      : {b}")

输出：

a             : [1 2 3 4]
b = -a        : [-1 -2 -3 -4]
b = np.sum(a) : 10
b = np.mean(a): 2.5
b = a**2      : [ 1  4  9 16]

3.4 向量与向量元素运算

大多数 NumPy 算法、逻辑和比较操作也适用于向量。这些操作符基于元素逐个元素地工作。

a = np.array([ 1, 2, 3, 4])
b = np.array([-1,-2, 3, 4])
print(f"Binary operators work element wise: {a + b}")

输出：

Binary operators work element wise: [0 0 6 8]

当然，为了使其正确工作，向量必须具有相同的大小:

#try a mismatched vector operation
c = np.array([1, 2])
try:
    d = a + c
except Exception as e:
    print("The error message you'll see is:")
    print(e)

输出：

The error message you'll see is:
operands could not be broadcast together with shapes (4,) (2,)

3.5 标量向量运算

向量可以通过标量值“缩放”。标量值只是一个数字。标量乘以向量的所有元素。

a = np.array([1, 2, 3, 4])

# multiply a by a scalar
b = 5 * a 
print(f"b = 5 * a : {b}")

输出：

b = 5 * a : [ 5 10 15 20]

3.6 矢量向量点积

点积是线性代数和 NumPy 的支柱。这是一个在本课程中广泛使用的操作。点乘将两个向量中的值按元素方式相乘，然后对结果求和。向量点积要求两个向量的尺寸相同。注意，点乘应该返回一个标量值。

def my_dot(a, b): 
    """
   Compute the dot product of two vectors
 
    Args:
      a (ndarray (n,)):  input vector 
      b (ndarray (n,)):  input vector with same dimension as a
    
    Returns:
      x (scalar): 
    """
    x=0
    for i in range(a.shape[0]):
        x = x + a[i] * b[i]
    return x
# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
print(f"my_dot(a, b) = {my_dot(a, b)}")

输出为：

my_dot(a, b) = 24

# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
c = np.dot(a, b)
print(f"NumPy 1-D np.dot(a, b) = {c}, np.dot(a, b).shape = {c.shape} ") 
c = np.dot(b, a)
print(f"NumPy 1-D np.dot(b, a) = {c}, np.dot(a, b).shape = {c.shape} ")

输出为：

NumPy 1-D np.dot(a, b) = 24, np.dot(a, b).shape = () 
NumPy 1-D np.dot(b, a) = 24, np.dot(a, b).shape = ()

3.7 速度的需要: 矢量与循环

使用NumPy库是因为它提高了速度和内存效率。

np.random.seed(1)
a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time()  # capture start time
c = np.dot(a, b)
toc = time.time()  # capture end time

print(f"np.dot(a, b) =  {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
c = my_dot(a,b)
toc = time.time()  # capture end time

print(f"my_dot(a, b) =  {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a);del(b)  #remove these big arrays from memory

输出：

np.dot(a, b) =  2501072.5817
Vectorized version duration: 46.8779 ms 
my_dot(a, b) =  2501072.5817
loop version duration: 4033.1399 ms

因此，矢量化在本例中提供了很大的速度提升。这是因为NumPy更好地利用了底层硬件中可用的数据并行性。GPU和现代CPU实现单指令多数据(SIMD)管道，允许并行发布多个操作。这在机器学习中至关重要，因为机器学习中的数据集通常非常大。

4 矩阵

矩阵，是二维数组。矩阵的元素都是相同的类型。在记谱法中，矩阵用大写字母表示，黑体字母如 X。在这个实验室和其他实验室中，m 通常是行数和列数。矩阵的元素可以用二维索引引用。在数学设置中，索引中的数字通常从1到 n。在计算机科学和这些实验室中，索引将从0运行到 n-1。通用矩阵表示法，第一个索引是行，第二个是列。

NumPy 的基本数据结构是一个可索引的 n 维数组，其中包含相同类型(dtype)的元素。这些是之前描述过的。矩阵有一个二维(2-D)索引[ m，n ]。下面你将回顾:

数据创建
切片和索引

4.1 矩阵创建

创建二维向量的函数和创建一维向量的函数一样。注意 NumPy 是如何使用方括号来表示每个维度的。更进一步的是，在打印时，每行将打印一行。

a = np.zeros((1, 5))                                       
print(f"a shape = {a.shape}, a = {a}")                     

a = np.zeros((2, 1))                                                                   
print(f"a shape = {a.shape}, a = {a}") 

a = np.random.random_sample((1, 1))  
print(f"a shape = {a.shape}, a = {a}")

输出：

a shape = (1, 5), a = [[0. 0. 0. 0. 0.]]
a shape = (2, 1), a = [[0.]
 [0.]]
a shape = (1, 1), a = [[0.44236513]]

也可以手动指定数据。尺寸是用额外的括号指定的，与上面打印的格式相匹配。

# NumPy routines which allocate memory and fill with user specified values
a = np.array([[5], [4], [3]]);   print(f" a shape = {a.shape}, np.array: a = {a}")
a = np.array([[5],   # One can also
              [4],   # separate values
              [3]]); #into separate rows
print(f" a shape = {a.shape}, np.array: a = {a}")

输出：

 a shape = (3, 1), np.array: a = [[5]
 [4]
 [3]]
 a shape = (3, 1), np.array: a = [[5]
 [4]
 [3]]

4.2 矩阵操作

4.2.1 索引

矩阵索引描述[ row，column ]。可以返回元素或行/列。见下文:

#vector indexing operations on matrices
a = np.arange(6).reshape(-1, 2)   #reshape is a convenient way to create matrices
print(f"a.shape: {a.shape}, \na= {a}")

#access an element
print(f"\na[2,0].shape:   {a[2, 0].shape}, a[2,0] = {a[2, 0]},     type(a[2,0]) = {type(a[2, 0])} Accessing an element returns a scalar\n")

#access a row
print(f"a[2].shape:   {a[2].shape}, a[2]   = {a[2]}, type(a[2])   = {type(a[2])}")

输出：

a.shape: (3, 2), 
a= [[0 1]
 [2 3]
 [4 5]]

a[2,0].shape:   (), a[2,0] = 4,     type(a[2,0]) = <class 'numpy.int32'> Accessing an element returns a scalar

a[2].shape:   (2,), a[2]   = [4 5], type(a[2])   = <class 'numpy.ndarray'>

最后一个例子值得注意。通过指定行来访问矩阵将返回一个一维向量。

Reshape：使用重塑形状来设置数组的形状。

A = np.arange (6).reshape(- 1,2)

这行代码首先创建了一个包含6个元素的1-D Vector。然后，它使用重塑命令将该向量重塑为一个二维数组。可以这样写:

A = np.arange (6).reshape(3,2)

到达相同的3行，2列数组。-1参数告诉例程计算给定数组大小和列数的行数。

4.2.2 切片

切片使用一组三个值(start:stop:step)创建一个索引数组。

#vector 2-D slicing operations
a = np.arange(20).reshape(-1, 10)
print(f"a = \n{a}")

#access 5 consecutive elements (start:stop:step)
print("a[0, 2:7:1] = ", a[0, 2:7:1], ",  a[0, 2:7:1].shape =", a[0, 2:7:1].shape, "a 1-D array")

#access 5 consecutive elements (start:stop:step) in two rows
print("a[:, 2:7:1] = \n", a[:, 2:7:1], ",  a[:, 2:7:1].shape =", a[:, 2:7:1].shape, "a 2-D array")

# access all elements
print("a[:,:] = \n", a[:,:], ",  a[:,:].shape =", a[:,:].shape)

# access all elements in one row (very common usage)
print("a[1,:] = ", a[1,:], ",  a[1,:].shape =", a[1,:].shape, "a 1-D array")
# same as
print("a[1]   = ", a[1],   ",  a[1].shape   =", a[1].shape, "a 1-D array")

输出：

a = 
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]]
a[0, 2:7:1] =  [2 3 4 5 6] ,  a[0, 2:7:1].shape = (5,) a 1-D array
a[:, 2:7:1] = 
 [[ 2  3  4  5  6]
 [12 13 14 15 16]] ,  a[:, 2:7:1].shape = (2, 5) a 2-D array
a[:,:] = 
 [[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]] ,  a[:,:].shape = (2, 10)
a[1,:] =  [10 11 12 13 14 15 16 17 18 19] ,  a[1,:].shape = (10,) a 1-D array
a[1]   =  [10 11 12 13 14 15 16 17 18 19] ,  a[1].shape   = (10,) a 1-D array

在这个实验室中，我们掌握了 Python 和 NumPy 的特性，这些特性是课程所需要的。