PRML Reading Notes

PRML读书会第一章 Introduction

PRML Reading Notes

[Kaggle] House Prices: Advanced Regression Techniques & Bayesian Optimization


Terms Mentioned in the Posts

Kernel Ridge Regression (KRR) & Support Vector Regression (SVR)



Bayesian Optimization

For automatic parameter tuning.

R package

A well-known python implementation of Bayesian Optimization

Example using spearmint to train an NN against Kaggle dataset

Bayesian Optimization for Hyperparameter Tuning


[Kaggle] House Prices: Advanced Regression Techniques & Bayesian Optimization

Readings on Kaggle

Anthony Goldbloom gives you the secret to winning Kaggle competitions

Readings on Kaggle

[Reading Notes] Efficient BackProp



  1. Explains common phenomenon observed by practitioners
  2. Gives some tricks to avoid undesirable behaviors of backprop, and explains why they work
  3. Proposed a few methods that do not have the impractical limitations which most “classical” second-order methods have for large neural networks


Getting BackProp to work well, and sometimes to work at all, can seem more of an art than a science

  1. Many seemingly arbitrary choices, eg. # and types of nodes, layers, learning rates, training and test sets
  2. There is no foolproof recipe for deciding them because they are largely problem and data dependent
  3. There are heuristics or tricks for improving its performance
  4. Issues of convergence

Learning and Generalization

Much of the successful approaches can be categorized as gradient-based learning methods.

Decomposing the generalization error into two terms: bias and variance

  1. Bias: how much the network output, averaged over all possible data sets differs from the desired function
  2. Variance: how much the network output varies between datasets
  3. Early in training:
    1. Bias is large, because the network output is far from the desired function
    2. Variance is very small, because the data has had little influence yet
  4. Late in training:
    1. Bias is small, because the network has learned the underlying function
    2. If overtrained, variance will be large because the noise varies between datasets
  5. Minimum total error = min(bias + variance)

How to make the model generalize well

  1. Choice of model (model selection)
  2. Architecture
  3. Cost function
  4. Tricks that increases the speed and quality of the minimization
    • The existence of overtraining has led several authors to suggest that inaccurate minimization algorithms can be better than good ones

Standard BackProp

Learning rate

A proper choice of η is important:

  1. In the simplest case, η is a scalar constant
  2. More sophisticated procedures use variable η
  3. In other methods η takes the form of a diagonal matrix, or is an estimate of the inverse Hessian matrix of the cost function (second derivative matrix)
    1. Newton
    2. Quasi-Newton


A Few Practical Tricks

BackProp can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or flat regions.

No formula to guarantee:

  1. The network will converge to a good solution
  2. Convergence is swift
  3. Convergence even occurs at all

Below are a number of tricks that can greatly improve the chances of finding a good solution while also decreasing the convergence time often by orders of magnitude.

Stochastic versus Batch Learning

The estimate of stochastic gradient is noisy, so the weights may not move precisely down to gradient at each iteration

This “noise” at each iteration can be advantageous

Advantages of Stochastic Learning

  1. Usually much faster than batch learning
    1. Particularly on large redundant datasets
    2. In practice, examples rarely appear more than once in a dataset, but there are usually clusters of patterns that are very similar
  2. Also often results in better solution because of the noise in the updates
    1. Batch learning will discover the minimum of whatever basin the weights are initially placed
    2. In stochastic learning, the noise present in the updates can result in the weights jumping into the basin of another, possibly deeper, local minimum
  3. Can be used for tracking changes
    1. With batch learning, changes go undetected since we are likely to average over several rules
    2. Online learning will track the changes when the function being modeled is changing over time
      • A quite common scenario in industrial applications: data distribution changes gradually over time

Advantages of Batch Learning

Many acceleration techniques (eg. conjugate gradient) only operate in batch learning


No reading notes for the remaining of the paper, please just read the paper 🙂


[Reading Notes] Efficient BackProp