Pattern Recognition And Machine Learning

内容简介：

The dramatic growth in practical applications for machine learning over the last ten years has been accompanied by many important developments in the underlying algorithms and techniques. For example, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic techniques. The practical applicability of Bayesian methods has been greatly enhanced by the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation, while new models based on kernels have had a significant impact on both algorithms and applications.

This completely new textbook reflects these recent developments while providing a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. Familiarity with multivariate calculus and basic linear algebra is required, and some experience in the use of probabilities would be helpful though not essential as the book includes a self-contained introduction to basic probability theory.

The book is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. Extensive support is provided for course instructors, including more than 400 exercises, graded according to difficulty. Example solutions for a subset of the exercises are available from the book web site, while solutions for the remainder can be obtained by instructors from the publisher. The book is supported by a great deal of additional material, and the reader is encouraged to visit the book web site for the latest information.

目录：

1 Introduction 1

1.1 Example: Polynomial Curve Fitting . . . . . . . . . . . . . . . . . 4

1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.1 Probability densities . . . . . . . . . . . . . . . . . . . . . 17

1.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 19

1.2.3 Bayesian probabilities . . . . . . . . . . . . . . . . . . . . 21

1.2.4 The Gaussian distribution . . . . . . . . . . . . . . . . . . 24

1.2.5 Curve fitting re-visited . . . . . . . . . . . . . . . . . . . . 28

1.2.6 Bayesian curve fitting . . . . . . . . . . . . . . . . . . . . 30

1.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.4 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . 33

1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.5.1 Minimizing the misclassification rate . . . . . . . . . . . . 39

1.5.2 Minimizing the expected loss . . . . . . . . . . . . . . . . 41

1.5.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 42

1.5.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 42

1.5.5 Loss functions for regression . . . . . . . . . . . . . . . . . 46

1.6 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.6.1 Relative entropy and mutual information . . . . . . . . . . 55

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2 Probability Distributions 67

2.1 Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.1.1 The beta distribution . . . . . . . . . . . . . . . . . . . . . 71

2.2 Multinomial Variables . . . . . . . . . . . . . . . . . . . . . . . . 74

2.2.1 The Dirichlet distribution . . . . . . . . . . . . . . . . . . . 76

2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 78

2.3.1 Conditional Gaussian distributions . . . . . . . . . . . . . . 85

2.3.2 Marginal Gaussian distributions . . . . . . . . . . . . . . . 88

2.3.3 Bayes’ theorem for Gaussian variables . . . . . . . . . . . . 90

2.3.4 Maximum likelihood for the Gaussian . . . . . . . . . . . . 93

2.3.5 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 94

2.3.6 Bayesian inference for the Gaussian . . . . . . . . . . . . . 97

2.3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . 102

2.3.8 Periodic variables . . . . . . . . . . . . . . . . . . . . . . . 105

2.3.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 110

2.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 113

2.4.1 Maximum likelihood and sufficient statistics . . . . . . . . 116

2.4.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . 117

2.4.3 Noninformative priors . . . . . . . . . . . . . . . . . . . . 117

2.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 120

2.5.1 Kernel density estimators . . . . . . . . . . . . . . . . . . . 122

2.5.2 Nearest-neighbour methods . . . . . . . . . . . . . . . . . 124

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3 Linear Models for Regression 137

3.1 Linear Basis Function Models . . . . . . . . . . . . . . . . . . . . 138

3.1.1 Maximum likelihood and least squares . . . . . . . . . . . . 140

3.1.2 Geometry of least squares . . . . . . . . . . . . . . . . . . 143

3.1.3 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 143

3.1.4 Regularized least squares . . . . . . . . . . . . . . . . . . . 144

3.1.5 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 146

3.2 The Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . 147

3.3 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . 152

3.3.1 Parameter distribution . . . . . . . . . . . . . . . . . . . . 153

3.3.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 156

3.3.3 Equivalent kernel . . . . . . . . . . . . . . . . . . . . . . . 157

3.4 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . . . 161

3.5 The Evidence Approximation . . . . . . . . . . . . . . . . . . . . 165

3.5.1 Evaluation of the evidence function . . . . . . . . . . . . . 166

3.5.2 Maximizing the evidence function . . . . . . . . . . . . . . 168

3.5.3 Effective number of parameters . . . . . . . . . . . . . . . 170

3.6 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 172

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

4 Linear Models for Classification 179

4.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 181

4.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 181

4.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 182

4.1.3 Least squares for classification . . . . . . . . . . . . . . . . 184

4.1.4 Fisher’s linear discriminant . . . . . . . . . . . . . . . . . . 186

4.1.5 Relation to least squares . . . . . . . . . . . . . . . . . . . 189

4.1.6 Fisher’s discriminant for multiple classes . . . . . . . . . . 191

4.1.7 The perceptron algorithm . . . . . . . . . . . . . . . . . . . 192

4.2 Probabilistic Generative Models . . . . . . . . . . . . . . . . . . . 196

4.2.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 198

4.2.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 200

4.2.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 202

4.2.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 202

4.3 Probabilistic Discriminative Models . . . . . . . . . . . . . . . . . 203

4.3.1 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 204

4.3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 205

4.3.3 Iterative reweighted least squares . . . . . . . . . . . . . . 207

4.3.4 Multiclass logistic regression . . . . . . . . . . . . . . . . . 209

4.3.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 210

4.3.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 212

4.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 213

4.4.1 Model comparison and BIC . . . . . . . . . . . . . . . . . 216

4.5 Bayesian Logistic Regression . . . . . . . . . . . . . . . . . . . . 217

4.5.1 Laplace approximation . . . . . . . . . . . . . . . . . . . . 217

4.5.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 218

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

5 Neural Networks 225

5.1 Feed-forward Network Functions . . . . . . . . . . . . . . . . . . 227

5.1.1 Weight-space symmetries . . . . . . . . . . . . . . . . . . 231

5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

5.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . 236

5.2.2 Local quadratic approximation . . . . . . . . . . . . . . . . 237

5.2.3 Use of gradient information . . . . . . . . . . . . . . . . . 239

5.2.4 Gradient descent optimization . . . . . . . . . . . . . . . . 240

5.3 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 241

5.3.1 Evaluation of error-function derivatives . . . . . . . . . . . 242

5.3.2 A simple example . . . . . . . . . . . . . . . . . . . . . . 245

5.3.3 Efficiency of backpropagation . . . . . . . . . . . . . . . . 246

5.3.4 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 247

5.4 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 249

5.4.1 Diagonal approximation . . . . . . . . . . . . . . . . . . . 250

5.4.2 Outer product approximation . . . . . . . . . . . . . . . . . 251

5.4.3 Inverse Hessian . . . . . . . . . . . . . . . . . . . . . . . . 252

5.4.4 Finite differences . . . . . . . . . . . . . . . . . . . . . . . 252

5.4.5 Exact evaluation of the Hessian . . . . . . . . . . . . . . . 253

5.4.6 Fast multiplication by the Hessian . . . . . . . . . . . . . . 254

5.5 Regularization in Neural Networks . . . . . . . . . . . . . . . . . 256

5.5.1 Consistent Gaussian priors . . . . . . . . . . . . . . . . . . 257

5.5.2 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 259

5.5.3 Invariances . . . . . . . . . . . . . . . . . . . . . . . . . . 261

5.5.4 Tangent propagation . . . . . . . . . . . . . . . . . . . . . 263

5.5.5 Training with transformed data . . . . . . . . . . . . . . . . 265

5.5.6 Convolutional networks . . . . . . . . . . . . . . . . . . . 267

5.5.7 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 269

5.6 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 272

5.7 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . 277

5.7.1 Posterior parameter distribution . . . . . . . . . . . . . . . 278

5.7.2 Hyperparameter optimization . . . . . . . . . . . . . . . . 280

5.7.3 Bayesian neural networks for classification . . . . . . . . . 281

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

6 Kernel Methods 291

6.1 Dual Representations . . . . . . . . . . . . . . . . . . . . . . . . . 293

6.2 Constructing Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 294

6.3 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . 299

6.3.1 Nadaraya-Watson model . . . . . . . . . . . . . . . . . . . 301

6.4 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 303

6.4.1 Linear regression revisited . . . . . . . . . . . . . . . . . . 304

6.4.2 Gaussian processes for regression . . . . . . . . . . . . . . 306

6.4.3 Learning the hyperparameters . . . . . . . . . . . . . . . . 311

6.4.4 Automatic relevance determination . . . . . . . . . . . . . 312

6.4.5 Gaussian processes for classification . . . . . . . . . . . . . 313

6.4.6 Laplace approximation . . . . . . . . . . . . . . . . . . . . 315

6.4.7 Connection to neural networks . . . . . . . . . . . . . . . . 319

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

7 Sparse Kernel Machines 325

7.1 Maximum Margin Classifiers . . . . . . . . . . . . . . . . . . . . 326

7.1.1 Overlapping class distributions . . . . . . . . . . . . . . . . 331

7.1.2 Relation to logistic regression . . . . . . . . . . . . . . . . 336

7.1.3 Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . 338

7.1.4 SVMs for regression . . . . . . . . . . . . . . . . . . . . . 339

7.1.5 Computational learning theory . . . . . . . . . . . . . . . . 344

7.2 Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . 345

7.2.1 RVM for regression . . . . . . . . . . . . . . . . . . . . . . 345

7.2.2 Analysis of sparsity . . . . . . . . . . . . . . . . . . . . . . 349

7.2.3 RVM for classification . . . . . . . . . . . . . . . . . . . . 353

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

8 Graphical Models 359

8.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 360

8.1.1 Example: Polynomial regression . . . . . . . . . . . . . . . 362

8.1.2 Generative models . . . . . . . . . . . . . . . . . . . . . . 365

8.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 366

8.1.4 Linear-Gaussian models . . . . . . . . . . . . . . . . . . . 370

8.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 372

8.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 373

8.2.2 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 378

8.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . 383

8.3.1 Conditional independence properties . . . . . . . . . . . . . 383

8.3.2 Factorization properties . . . . . . . . . . . . . . . . . . . 384

8.3.3 Illustration: Image de-noising . . . . . . . . . . . . . . . . 387

8.3.4 Relation to directed graphs . . . . . . . . . . . . . . . . . . 390

8.4 Inference in Graphical Models . . . . . . . . . . . . . . . . . . . . 393

8.4.1 Inference on a chain . . . . . . . . . . . . . . . . . . . . . 394

8.4.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

8.4.3 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . 399

8.4.4 The sum-product algorithm . . . . . . . . . . . . . . . . . . 402

8.4.5 The max-sum algorithm . . . . . . . . . . . . . . . . . . . 411

8.4.6 Exact inference in general graphs . . . . . . . . . . . . . . 416

8.4.7 Loopy belief propagation . . . . . . . . . . . . . . . . . . . 417

8.4.8 Learning the graph structure . . . . . . . . . . . . . . . . . 418

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

9 Mixture Models and EM 423

9.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 424

9.1.1 Image segmentation and compression . . . . . . . . . . . . 428

9.2 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 430

9.2.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 432

9.2.2 EM for Gaussian mixtures . . . . . . . . . . . . . . . . . . 435

9.3 An Alternative View of EM . . . . . . . . . . . . . . . . . . . . . 439

9.3.1 Gaussian mixtures revisited . . . . . . . . . . . . . . . . . 441

9.3.2 Relation to K-means . . . . . . . . . . . . . . . . . . . . . 443

9.3.3 Mixtures of Bernoulli distributions . . . . . . . . . . . . . . 444

9.3.4 EM for Bayesian linear regression . . . . . . . . . . . . . . 448

9.4 The EM Algorithm in General . . . . . . . . . . . . . . . . . . . . 450

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

10 Approximate Inference 461

10.1 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . 462

10.1.1 Factorized distributions . . . . . . . . . . . . . . . . . . . . 464

10.1.2 Properties of factorized approximations . . . . . . . . . . . 466

10.1.3 Example: The univariate Gaussian . . . . . . . . . . . . . . 470

10.1.4 Model comparison . . . . . . . . . . . . . . . . . . . . . . 473

10.2 Illustration: Variational Mixture of Gaussians . . . . . . . . . . . . 474

10.2.1 Variational distribution . . . . . . . . . . . . . . . . . . . . 475

10.2.2 Variational lower bound . . . . . . . . . . . . . . . . . . . 481

10.2.3 Predictive density . . . . . . . . . . . . . . . . . . . . . . . 482

10.2.4 Determining the number of components . . . . . . . . . . . 483

10.2.5 Induced factorizations . . . . . . . . . . . . . . . . . . . . 485

10.3 Variational Linear Regression . . . . . . . . . . . . . . . . . . . . 486

10.3.1 Variational distribution . . . . . . . . . . . . . . . . . . . . 486

10.3.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 488

10.3.3 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . 489

10.4 Exponential Family Distributions . . . . . . . . . . . . . . . . . . 490

10.4.1 Variational message passing . . . . . . . . . . . . . . . . . 491

10.5 Local Variational Methods . . . . . . . . . . . . . . . . . . . . . . 493

10.6 Variational Logistic Regression . . . . . . . . . . . . . . . . . . . 498

10.6.1 Variational posterior distribution . . . . . . . . . . . . . . . 498

10.6.2 Optimizing the variational parameters . . . . . . . . . . . . 500

10.6.3 Inference of hyperparameters . . . . . . . . . . . . . . . . 502

10.7 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . 505

10.7.1 Example: The clutter problem . . . . . . . . . . . . . . . . 511

10.7.2 Expectation propagation on graphs . . . . . . . . . . . . . . 513

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

11 Sampling Methods 523

11.1 Basic Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . 526

11.1.1 Standard distributions . . . . . . . . . . . . . . . . . . . . 526

11.1.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 528

11.1.3 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 530

11.1.4 Importance sampling . . . . . . . . . . . . . . . . . . . . . 532

11.1.5 Sampling-importance-resampling . . . . . . . . . . . . . . 534

11.1.6 Sampling and the EM algorithm . . . . . . . . . . . . . . . 536

11.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 537

11.2.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 539

11.2.2 The Metropolis-Hastings algorithm . . . . . . . . . . . . . 541

11.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 542

11.4 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

11.5 The Hybrid Monte Carlo Algorithm . . . . . . . . . . . . . . . . . 548

11.5.1 Dynamical systems . . . . . . . . . . . . . . . . . . . . . . 548

11.5.2 Hybrid Monte Carlo . . . . . . . . . . . . . . . . . . . . . 552

11.6 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 554

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

12 Continuous Latent Variables 559

12.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 561

12.1.1 Maximum variance formulation . . . . . . . . . . . . . . . 561

12.1.2 Minimum-error formulation . . . . . . . . . . . . . . . . . 563

12.1.3 Applications of PCA . . . . . . . . . . . . . . . . . . . . . 565

12.1.4 PCA for high-dimensional data . . . . . . . . . . . . . . . 569

12.2 Probabilistic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . 570

12.2.1 Maximum likelihood PCA . . . . . . . . . . . . . . . . . . 574

12.2.2 EM algorithm for PCA . . . . . . . . . . . . . . . . . . . . 577

12.2.3 Bayesian PCA . . . . . . . . . . . . . . . . . . . . . . . . 580

12.2.4 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . 583

12.3 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

12.4 Nonlinear Latent Variable Models . . . . . . . . . . . . . . . . . . 591

12.4.1 Independent component analysis . . . . . . . . . . . . . . . 591

12.4.2 Autoassociative neural networks . . . . . . . . . . . . . . . 592

12.4.3 Modelling nonlinear manifolds . . . . . . . . . . . . . . . . 595

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599

13 Sequential Data 605

13.1 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

13.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 610

13.2.1 Maximum likelihood for the HMM . . . . . . . . . . . . . 615

13.2.2 The forward-backward algorithm . . . . . . . . . . . . . . 618

13.2.3 The sum-product algorithm for the HMM . . . . . . . . . . 625

13.2.4 Scaling factors . . . . . . . . . . . . . . . . . . . . . . . . 627

13.2.5 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . 629

13.2.6 Extensions of the hidden Markov model . . . . . . . . . . . 631

13.3 Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . 635

13.3.1 Inference in LDS . . . . . . . . . . . . . . . . . . . . . . . 638

13.3.2 Learning in LDS . . . . . . . . . . . . . . . . . . . . . . . 642

13.3.3 Extensions of LDS . . . . . . . . . . . . . . . . . . . . . . 644

13.3.4 Particle filters . . . . . . . . . . . . . . . . . . . . . . . . . 645

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646

14 Combining Models 653

14.1 Bayesian Model Averaging . . . . . . . . . . . . . . . . . . . . . . 654

14.2 Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655

14.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657

14.3.1 Minimizing exponential error . . . . . . . . . . . . . . . . 659

14.3.2 Error functions for boosting . . . . . . . . . . . . . . . . . 661

14.4 Tree-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . 663

14.5 Conditional Mixture Models . . . . . . . . . . . . . . . . . . . . . 666

14.5.1 Mixtures of linear regression models . . . . . . . . . . . . . 667

14.5.2 Mixtures of logistic models . . . . . . . . . . . . . . . . . 670

14.5.3 Mixtures of experts . . . . . . . . . . . . . . . . . . . . . . 672

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

Appendix A Data Sets 677

Appendix B Probability Distributions 685

Appendix C Properties of Matrices 695

Appendix D Calculus of Variations 703

Appendix E LagrangeMultipliers 707

References 711

没有影印版，没有中文版吗？

springjava 2013-01-27 0赞

没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？
没有影印版，没有中文版吗？

PRML读书会一周年资源汇总

网络上的尼采 2015-02-01 0赞

PRML读书会一周年资源汇总：http://weibo.com/p/10080817a99a8dcd9c7e83da56c7ee13ede62a/emceercd?from=page_huati_rcd_more

数学功底不够，看的云里雾里。

longsail 2014-02-12 1赞

赞扬已经够多了，引用黄亮的话来说下这本书不好的地方。

“这书把machine learning搞得太复杂太琐碎了，而迷失了其数学真意。其数学真意应该是简单统一的几何意义，而不是满屏的公式。另外这书理论深度不够，很多重要但简单的证明没讲. 简言之，这书是电子工程师写的，不是给计算机科学家看的”.

当然以我现在的功底可能还没有能够领域到这本书的好处

Pattern Recognition and Machine Learning第九章读书会记录

网络上的尼采 2013-09-18 2赞

Pattern Recognition and Machine Learning第九章读书会记录，主要内容：k-means 混合高斯 EM算法 http://weibo.com/1841149974/A1O6GzO3k?mod=weibotime

PRML读书会：Sampling Methods记录

网络上的尼采 2013-09-20 0赞

主要内容：Markov Chain Monte Carlo（MCMC），Metropolis-Hastings，Gibbs Sampling等方法。 http://weibo.com/1841149974/Aa7D5vpaC?mod=weibotime

新手必读，吐血吐血吐血吐血推荐

whjxnyzh 2014-01-08 0赞

从大四就想看这本书，一直当做宝贝供着。。。。最近才大概翻了一遍，总体评价。。神书无疑。。读一遍感觉不行，我肯定还要读第二遍，因为有些章节还是有难度的。。作者写作功底太好，每个公式解释的都很清楚，看起来毫不费力，也很全面。总之，吐血推荐，不看这本书，别跟我说你是搞机器学习的

PRML读书会请加qq群 177217565

网络上的尼采 2014-02-12 0赞

我们已经读完了Pattern Recognition And Machine Learning ，写的非常优美的一本书，另外我们正准备读MLAPP，欢迎加群177217565讨论。请在群申请理由里用简短的话描述一个算法的关键思想。

用心慢慢读

Aeolus 2014-03-25 0赞

在研一的下学期的时候，看了前三章。写得非常好，看着就不想放下。后来由于有其他事，就先停了下来。现在经过一年的实习，对机器学习感觉也算入门了，准备着手再开始看，相信这次会有完全不同的感觉。大家一起加油，PRML真是经典！

PRML Matlab ToolBox

极度视界 2013-07-05 4赞

在浏览 scikit-learn 时

无意发现之，贡献者之一是清华的Wei Li

https://github.com/kuantkid/PRML

发现了，分享了；未曾使用。

大家用用看，看看楼下怎么说：

经典ML教材之一

脖子 2016-01-13 7赞

在Bishop的这本PRML之前，学习machine learning的标准教材一般是Tom Mitchell的machine learning以及Duda&Hart的Pattern Classification (那个年代ML与PR非常大的重合之处)。不可否认，这两本书都是ML领域的经典教材，但是由于成书时间太早，基本上都属于上古读物，已经不大适合现在的学习了。而当这本PRML问世之后，立即便成为ML的又一标准教材，并且很快取代了前两本书 --- 即使将近十年以后的今天，它依然是ML的主流教材之一。

写好一本general machine learning的书，必须在内容上有所取舍。个人认为，从入手角度上分，大体上可以分为：statistical/probabilistic perspective, theoretical perspective, optimization perspective, algorithmic/model perspective等等。而这本书应该属于最后一种，也即它主要介绍各种machine learning里面最基本的模型以及相关算法。这本书最大的优点/特色我觉得有以下几点：

1. 内容选取得当。书中所介绍的所有模型以及算法，放到今天，依然是理解学习ML最最基本的组成部分，这些内容，对于读者了解更高级的算法，几乎都是必不可少的。这本书并没有试图涵盖当时所有的机器学习算法，而是精选了ML里面最本质最fundamental的方法，由此可以看出，作者对于这个领域的驾驭能力还是非常高的，准确的预见到了那些非常有生命力的模型，所以，如果你是一名ML的初学者的话，读这本书，即使过了十年，但是依然不会过时。

2. 详略难易得当。由于内容选取的少而精，所以作者可以深入浅出的介绍每一种模型，不会因为太过简略而使读者疑惑，同时对于高阶的内容又点到为止，使得整本书的难度保持在了一个对于初学者可以接受的范围内。基本上，当年我看这本书的时候，就是把它当成一个个的tutorial来看。比如我在学EM算法的时候，我主要就是以这本书的内容为主，配合网上其他资源学习。这一点在学习Graphical model的时候更加明显。众所周知这个领域比较经典的著作是Probabilistic Graphical Models以及Bayesian Reasoning and Machine Learning，但是这是两本大部头的书，一开始读起来会比较吃力。而本书的作者Bishop本身就是搞Bayesian learning以及graphical model的，PRML这边书用几章的内容就把这个领域最核心的概念以及方法解释了一遍，不得不让人佩服作者的功力。

3. 前后呼应。很多algorithmic ML书籍（尤其是国内的教材）的一个通病就是重方法轻思想，直接的结果就是里面的内容更像是方法的罗列，让人读过之后知其然而不知其所以然。而这本书这比较好的避免了这个问题，章与章之间的内容都有衔接和呼应。正如我上面提到的，由于Bishop是搞Bayesian learning以及graphical model的，所以最明显的例子莫过于作者试图通过graphical model的方式解释书中提到的每一种模型；此外在介绍玩graphical model之后，作者紧接着用两章分别讨论sampling以及approximation的内容作为求解图模型的具体方法，给人以一种浑然一提的感觉。

以上都是宏观上的感受，但是更主要的，还是作者本身深厚的功力，对知识的把握，以及出色的语言表达能力，只不过这其中的精妙所在，就只可意会不可言传了
。

PRML，Bayesian PK Frequentist，笔记

Chojin 2012-08-25 63

两年多以前有个Machine Learning课以PRML为参考书，当时就觉得这书相当的好。可惜一直以来没认真读完。最近稍闲终于重新读了一遍，比较有收获。
这书给人的最大的印象可能是everything has a Bayesian version或者说everything can be Bayesianized，比如PRML至少给出了以下Bayesian对Frequentist的PK：

Frequentist版本 Bayesian版本
Linear regression <---> Bayesian linear regression
Logistic regression <---> Bayesian logistic regression
Neural network <---> Bayesian Neural network
SVM <---> RVM
Gaussian mixture model <---> Bayesian Gaussian mixture model
Probabilistic PCA <---> Bayesian probabilistic PCA
Hidden markov model <---> Bayesian Hidden markov model
Linear dynamic system <---> Bayesian Linear dynamic system

从作者的叙述来看，Bayesian是道美味可口的菜：避免over-fitting，自动选模型参数（例如GMM的分支数K，PCA所降到的维数）等等；虽然通向它的路途颇为艰难：marginalization涉及的计算是很复杂的。因此书里大量运用了approximation（前期大量Laplace approximation，后期Variational Bayes等）。
Frequentist一方的model/algorithm稍轻松一些，面临的计算困难没那么大：前期很多甚至有closed-form solution，要不就上gradient decent，后期则大量EM算法；但Frequentist要考虑over-fitting的问题（regularizing），又要cross-validation来选model，这减少了training data、增加了额外计算量，不如Bayesian方法来得elegant。

这本书总体上写的是非常清晰的。不过也有些不尽如人意的。比如第8章读完后我仍旧不知道PGM到底是什么，还是借助了Koller那本PGM的某几章才算搞清楚。再如SVM那一章(P328)，Lagrangian function关于参数w,b最小化，关于Lagrange multiplier最大化这一点，也是看了Andrew Ng的Lecture note后才理解为什么的。

最后，分享一下本人的全部笔记：http://vdisk.weibo.com/s/KY31O （http://vdisk.weibo.com/s/bc6IJ -- 这个链接somehow已经失效了...）。

需要多读几遍：Pattern Recognition And Machine Learning

jia 2010-12-17 58

实际上这本书我花了将近两个月的时间读下来，不敢说有多理解，但是确实收获很大，分章做一个评论。

第1章的导论，不多说，看完书后需要重新回过头来看看。
第2章的概率分布，写的非常好，尽管只有几个简单的分布，但是对共轭先验的概念以及指数分布族介绍的很清楚，这一章是本书的基础。
第3章以及第4章的线性分类和回归一个非常好的方面就是都是采用Bayesisan的观点来看，应该是理解Baysian思想的基础。
第5章我没看，直接略过。(基本不影响后面的阅读)
第6章讲Guassian Process (这个东西后来我才知道是一种非参数的Bayessian方法，现在在统计学领域研究的很热门。)
第7章讲SVM 。
第8章是现代基于图模型的基础，需要仔细阅读，这一章概念介绍的非常清楚，很多的machine learning 和computer vision 的paper 现在采用的图模型的表示都可以从这里得到解释。
第9章 EM 算法，本人认为是本书的一个亮点，从最简单的K-mean出发，推导高斯混合模型，再到EM算法的推广，本章每一节都是精品。
第10章近似推断主要就是第一节的近似推断的基本原理以及第二节的一个例子。采用mean-field 、变分的方法。
第11章采样，写的很精彩，对完全不懂采样的我来说，也能很快入门。

这里需要说明的是，我的收获主要来自于第8章到第11章，光看书是不行的，期间，我主要是学习了最基本的Topic model：LDA 。在学习LDA的过程中，第8章到第11章的完全用上了。这种感觉非常好。推荐给大家。

第12章是PCA及一些改进，用到的时候再看也来得及。
第13章是HMM 模型和LDS，这两个的图模型是一样的。建议好好学习一下HMM，应该还有其他的资料供参考。
第14章最后是整合，很多东西现在我还不是很理解。

总之，这是一本非常好的书，关键是写作思路清晰，重点突出。作为阅读论文的基本参考物是值得推荐的。

说此书数学怎么怎么样的还是退了吧

达令你快乐吗 2014-03-29 31

这几天没事把尾巴扫了。

如果想做ML无论是theory（tcsers请先别吐槽好吧，以后会有槽吐你们的）、algorithm还是application此书都是必读，而且书只读这一本足够了。ML吹破天还是那点内容，想学“fashion”的concept有那么多paper、review，看书是自取其辱。有人说此书遗憾没有讲PAC神马的，Kearns的小书放在那里是当柴烧的么？CLT写上一两章有神马意义？

我觉得遗憾是SVM部分写得不好，怪怪的，还不如Ng的那些阉来阉去的notes，其实也可能是因为我先看过Ng然后看这部分觉得怪，被Ng害了。

都说此书强调Bayesian，我没感觉，可能是其他书都喜欢frequentist吧。

从行文就可以看出Bishop的非cs背景（作者phd师从Higgs，呵呵。其实发现cs里好多做理论物理的过来玩，比如Berkeley的Yun S. Song，做computational biology和tcs，phd的dissertation是Topological String Theory and Enumerative Geometry，好酷。其实我是渣不懂物理，所以我会说你看人家做math和physics的偶尔蹦出几个过来整理混乱无知的cs圈了吧），扎实透彻，再看MLaPP，一翻以为是新华字典呢，你那章deep learning是写给民科看的么？

其实一门学问的有趣之处不在于多么有用或者多么令人无法理解，认为前者的人可以去做码农，认为后者的人还是出门左拐120等候。有趣在于connection，而对connection的发现即意味着一门学科的成熟，幼稚的学问只管门头往前走，最后被撞死。所以math有意思嘛，代数分析拓扑几何风格截然不同却经常约炮乱伦，我是渣不懂数学此处只是yy。数学和物理的关系更不用说了。PRML就想把本来很naive的ML弄得有趣，我觉得应该有更多人这么做。私认为这是research里最难级别的，需要知识熟练掌控和灵感，当然绝大多数人还是problem solver，因为好发文章混吃等死啊。

ML有用还是说明尼玛世界混乱得太有规律，这不科学。

最后呼应题目，说此书数学怎么怎么样的人还是退了吧，此书无数学。

更广的背景下看Bayesian Learning

Chen_1st 2011-05-09 28

这两天因为读文章的需要，重新翻了翻这本书。觉得@raullew在http://book.douban.com/review/4474434/ 中提到的问题的确是这本书的一个缺陷。

是否真正了解一个东西，不仅取决于你是否了解这个东西的特性，还取决于你能不能把它和相似的东西区分开。比如说，你要学习什么是猫，不仅需要了解猫的特点，还要了解猫和老虎有什么差别。

这本书对Bayesian Learning方法有相当严谨的介绍和总结。不过换个角度看，缺乏对Bayesian Learning和其他一些机器学习理论(比如statistical learning theory或者PAC learning)的比较，这的确给人造成一种“Bayesian改进是真理”的感觉。如果能增加一个对No Free Lunch原理和PAC-Bayes理论的介绍，应该可以让读者在更广的背景下看Bayesian Learning。

这是我读过最好的模式识别书。

景浩 2013-02-28 26

我是学工程的，读过很多统计，模式识别，数据挖掘的书。比如Andrew Gelman 的 Beyesian data Analysis； Trevor Hastie 的 The Elements of Statistical Learning等等。。。。

我发现一个问题，但凡是统计系人出的书，我读起来都特别困难，比如以上提到的两本，基本读到第四第五章心里就在默念what the fu*k i am reading.

但是，计算机大牛Bishop出的这本书不一样，强烈的工程气息，让我觉得很有亲切感，读起来压力小多了。

可以配合stanford 大学 Andrew Ng 教授的 Machine Learning 视频教程一起来学，效果翻倍。

经典ML教材之一

脖子 2016-01-13 7

本书需要慢慢的读

joey周琦 2013-07-23 7

我是一名研一的学生，方向不是机器学习方向，但是对这方面很感兴趣。
看过一篇blog说，当下所说的机器学习其实分两种，一种如本书，可称为统计机器学习，另外一种是人工智能领域，这两种有交叉，但是研究内容有很大不同。

初读这书，刚觉很罗嗦，加上是英语，就觉得有些内容很难，符号多，无法继续下去。

我是从第八章开始的，因为在读李航统计学习方法倒数第二章读的有些模糊，来参考本书，突然发现本书写的很详细，娓娓道来。就准备边看这本书，边记下笔记，并且写成blog.可以想象进度是非常慢的，但是这样的读书方法让我受益匪浅，因为我发现自己读进去了，理解了大部分书中的内容。一句话是对的，自学的秘诀是慢。特别针对这类经典详细的书籍。

下面分享一些我的笔记内容，大家有兴趣可以看看。

贝叶斯网络（Bayesian network)）简介（PRML第8.1节总结）概率图模型（Graphical models)http://www.cnblogs.com/Dzhouqi/p/3204353.html

条件独立(conditional independence) 结合贝叶斯网络(Bayesian network) 详细解释（PRML8.2总结）http://www.cnblogs.com/Dzhouqi/p/3204481.html

马尔可夫随机场（Markov random fields) 概率无向图模型马尔科夫网（Markov network)
http://www.cnblogs.com/Dzhouqi/p/3207601.html

PRML Matlab ToolBox

极度视界 2013-07-05 4

在浏览 scikit-learn 时

无意发现之，贡献者之一是清华的Wei Li

https://github.com/kuantkid/PRML

发现了，分享了；未曾使用。

大家用用看，看看楼下怎么说：

PRML和其他

doubling 2012-08-16 4

这本书的独到之处就是Bishop能够将看似毫无联系的方法统一在一个完整的框架下。虽然@raullew在http://book.douban.com/review/4474434/吐槽，但这正是Bishop想要传递的这本书的精髓所在。如果仅仅是各个算法的单独罗列，那我觉得去看wikipedia好了，还是免费的。

相比之下，Duda&Hart的Pattern Classification看上去更像是工具书。Tom Mitchell的小册子其实挺好，基本概念讲解的清清楚楚，但是他老人家几年前就说要update直到现在都还没见到影子。而ESL在99%的篇幅里都在阐述频率学派点估计，这三位大牛各个都表示自己对Bayesian不感兴趣。

如果说缺点，个人觉得唯一的缺点就是sampling这一章关于Hybrid Monte Carlo讲的有点含糊。只是简单的介绍了一下，缺少详细的阐述整个过程。

Pattern Recognition and Machine Learning第九章读书会记录

网络上的尼采 2013-09-18 2

Pattern Recognition and Machine Learning第九章读书会记录，主要内容：k-means 混合高斯 EM算法 http://weibo.com/1841149974/A1O6GzO3k?mod=weibotime

我要写长评

Pattern Recognition And Machine Learning

推荐文章

猜你喜欢

附近的人在看

推荐阅读

拓展阅读