# R Caret Package

# PRML Reading Notes

# [Kaggle] House Prices: Advanced Regression Techniques & Bayesian Optimization

## Terms Mentioned in the Posts

Kernel Ridge Regression (KRR) & Support Vector Regression (SVR)

#### Comparison

http://scikit-learn.org/stable/auto_examples/plot_kernel_ridge_regression.html

For automatic parameter tuning.

**R package**

https://cran.r-project.org/web/packages/rBayesianOptimization/rBayesianOptimization.pdf

**A well-known python implementation of Bayesian Optimization**

https://github.com/JasperSnoek/spearmint

**Example using spearmint to train an NN against Kaggle dataset**

# TF-IDF

# Readings on Kaggle

# [Reading Notes] Efficient BackProp

# Abstract

- Explains common phenomenon observed by practitioners
- Gives some tricks to avoid undesirable behaviors of backprop, and explains why they work
- Proposed a few methods that do not have the impractical limitations which most “classical” second-order methods have for large neural networks

# Introduction

Getting BackProp to work well, and sometimes to work at all, can seem more of **an art than a science**

- Many seemingly arbitrary choices, eg. # and types of nodes, layers, learning rates, training and test sets
- There is no foolproof recipe for deciding them because they are largely
**problem and data dependent** - There are
**heuristics or tricks**for improving its performance - Issues of
**convergence**

# Learning and Generalization

Much of the successful approaches can be categorized as *gradient-based* learning methods.

## Decomposing the generalization error into two terms: bias and variance

- Bias: how much the network output, averaged over all possible data sets differs from the desired function
- Variance: how much the network output varies between datasets
- Early in training:
- Bias is large, because the network output is far from the desired function
- Variance is very small, because the data has had little influence yet

- Late in training:
- Bias is small, because the network has learned the underlying function
- If overtrained, variance will be large because the noise varies between datasets

- Minimum total error = min(bias + variance)

## How to make the model generalize well

- Choice of model (model selection)
- Architecture
- Cost function
- Tricks that increases the speed and quality of the minimization
- The existence of overtraining has led several authors to suggest that
**inaccurate minimization algorithms**can be better than good ones

- The existence of overtraining has led several authors to suggest that

# Standard BackProp

## Learning rate

A proper choice of η is important:

- In the simplest case, η is a
**scalar constant** - More sophisticated procedures use
**variable**η - In other methods η takes the form of a
**diagonal**matrix, or is**an estimate of the inverse Hessian matrix**of the cost function (second derivative matrix)- Newton
- Quasi-Newton

# A Few Practical Tricks

BackProp can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or flat regions.

No formula to guarantee:

- The network will converge to a good solution
- Convergence is swift
- Convergence even occurs at all

Below are a number of tricks that can greatly improve the chances of finding a good solution while also decreasing the convergence time often by orders of magnitude.

## Stochastic versus Batch Learning

The estimate of stochastic gradient is noisy, so the weights may not move precisely down to gradient at each iteration

This “**noise**” at each iteration can be advantageous

### Advantages of Stochastic Learning

- Usually
**much faster**than batch learning- Particularly on large redundant datasets
- In practice, examples rarely appear more than once in a dataset, but there are usually
**clusters of patterns**that are very similar

- Also often results in better solution because of the
**noise**in the updates- Batch learning will discover the minimum of whatever basin the weights are initially placed
- In stochastic learning, the noise present in the updates can result in the weights jumping into the basin of another, possibly deeper, local minimum

- Can be used for
**tracking changes**- With batch learning, changes go undetected since we are likely to average over several rules
- Online learning will track the changes when the function being modeled is changing over time
- A quite common scenario in industrial applications: data distribution changes gradually over time

### Advantages of Batch Learning

Many acceleration techniques (eg. conjugate gradient) only operate in batch learning

No reading notes for the remaining of the paper, please just read the paper 🙂