最优化（Opetimization）

策略
总结

这一部分非常的关键。我们通过这个最优化的过程，来找到使得 Loss function 最小的参数 W。

策略

随机搜索

如何优化 W，这里有个比较简单的方式。方法就是不断的随机尝试，找到一个相对好的。

bestloss = float('inf')
for num in xrange(1000):
    W = np.random.randn(10, 3073) * 1.0001
    loss = L(X_train, Y_train, W)
    if loss < bestloss:
        bestloss = loss
        bestW = W
    print(in attemp %d the loss was %f, best %f' %(num, loss, bestloss))

通过测试集测试准确率约为 15.5%。

随机本地搜索

策略是向不同的方向去走，如果某个方向是下山的就往这个方向走。

W = np.random.randn(10, 3073) * 0.001
bestloss = float('inf')
for i in xrange(1000):
    step_size = 0.0001
    Wtry = W + np.random.randn(10, 3073) * step_size
    loss = L(Xtr_cols, Ytr, Wtry)
    if loss < bestloss:
        bestloss = loss
        W = Wtry

这种方式的准确率是 21.4% 左右。

梯度下降方式

上面的方式虽然是在下山但却不是最快的方式。我们通过计算一个函数的梯度，可以更加快速的找到一个局部的最小值。

\begin{align} \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} \end{align}

计算梯度有两种方式：一种是数值梯度法，一种是分析梯度法。

def eval_numerical_gradient(f, x):
    fx = f(x)
    grad = np.zeros(x.shape)
    h = 0.00001
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finshed:
        ix = it.multi_index # (1, 2) ... 位置
        old_value = x[ix]
        x[ix] = old_value + h
        fxh = f(x) # x 中只是 ix 出的 value 变了
        x[ix] = old_value

        grad[ix] = (fxh - fx) / h
        it.iternext()

    return grad

这个函数返回每个点处的梯度值。

def CIFAR10_fun(W):
    return L(X_train, y_train, W)

W = np.random.rand(10, 3073) * 0.001
df = eval_numerical_gradient(CIFAR10_fun, W)

梯度告诉我们损失函数在每个维度上的斜率，以此来进行更新：

loss_original = CIFAR10_loss_fun(W) # 初始损失值
print 'original loss: %f' % (loss_original, )

# 查看不同步长的效果
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
    step_size = 10 ** step_size_log
    W_new = W - step_size * df # 权重空间中的新位置
    loss_new = CIFAR10_loss_fun(W_new)
    print 'for step size %f new loss: %f' % (step_size, loss_new)

在梯度负方向上更新：在上面的代码中，为了计算 W_new，要注意我们是向着梯度 df 的负方向去更新，这是因为我们希望损失函数值是降低而不是升高。

步长的不同会有很大的影响，上面的结果。

# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036

微分分析

使用有限差值近似计算梯度比较简单，但缺点在于终究只是近似（因为我们对于 h 值是选取了一个很小的数值，但真正的梯度定义中 h 趋向 0 的极限），且耗费计算资源太多。第二个梯度计算方法是利用微分来分析，能得到计算梯度的公式（不是近似），用公式计算梯度速度很快，唯一不好的就是实现的时候容易出错。为了解决这个问题，在实际操作时常常将分析梯度法的结果和数值梯度法的结果作比较，以此来检查其实现的正确性，这个步骤叫做梯度检查

用 SVM 的损失函数在某个数据点上的计算来举例：

\begin{align} L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right] \end{align}

下面是对 \(W_{y_i}\) 求导：

\begin{align} \nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i \end{align}

其中 1 是一个示性函数，如果括号中的条件为真，那么函数值为 1，如果为假，则函数值为 0。

梯度下降

while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights += - step_size * weights_grad # 进行梯度更新

这个简单的循环在所有的神经网络核心库中都有。虽然也有其他实现最优化的方法（比如 LBFGS），但是到目前为止，梯度下降是对神经网络的损失函数最优化中最常用的方法。

小批量数据梯度下降（Mini-batch gradient descent）：在大规模的应用中（比如 ILSVRC 挑战赛），训练数据可以达到百万级量级。如果像这样计算整个训练集，来获得仅仅一个参数的更新就太浪费了。一个常用的方法是计算训练集中的小批量（batches）数据。例如，在目前最高水平的卷积神经网络中，一个典型的小批量包含 256 个例子，而整个训练集是多少呢？一百二十万个。这个小批量数据就用来实现一个参数更新：

while True:
    data_batch = sample_training_data(data, 256) # 256 个数据
    weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
    weights += - step_size * weights_grad # 参数更新