Linear Methods: Copy of Original page

This is a copy of the original page. The file size is huge and there are a huge number of DOM elements, as well as a lot of external CSS and javascript.

Return to Speed up MathJax slow loading time

4.1 Introduction¶

Since our predictor $G(x)$ takes values in a discrete set $\mathcal{G}$, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these decision boundaries are linear; this is what we it mean by linear methodds for classification.

To build the connection between features $X$ and discrete outcomes, we could reply on logit transformation as follows:

\begin{align} \text{Pr}(G=1|X=x) &= \frac{\exp(\beta_0+\beta^Tx)}{1+\exp(\beta_0+\beta^Tx)},\\ \text{Pr}(G=2|X=x) &= \frac{1}{1+\exp(\beta_0+\beta^Tx)},\\ \end{align}

where the monotone transformation is the logit transformation

$$ \log\frac{p}{1-p}, $$

and in fact we see that

\begin{equation} \log\frac{\text{Pr}(G=1|X=x)}{\text{Pr}(G=2|X=x)} = \beta_0 + \beta^Tx. \end{equation}

The decision boundary is the set of points for which the log-odds are zero, and this is a hyperplane defined by

$$ \left\lbrace x: \beta_0+\beta^Tx = 0 \right\rbrace. $$

We will discuss two very popular but different methods that result in linear log-odds or logits: Linear discriminant analysis and linear logistic regression.

Once could also classify discrete outcomes without using log likelihood functions, which means we can explicitly model the boundaries between the classes as linear.

We will look at two methods that explicitly look for "separating hyperplanes".

The well-known perceptron model of Rosenblatt (1958), with an algorithm that finds a separating hyperplane in the training data, if one exists.
Vapnik (1996) finds an optimally separating hyperplane if one exists, else finds a hyperplane that minimizes some measure of overlap in the training data.

When a separating hyperplane could be found we say it is linear classficiable, whereas we need to use neural network to classifiy them.

figure4.1

In [6]:

  Copied!     
 
import math
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
%config InlineBackend.figure_format = 'svg'
import math import numpy as np import pandas as pd import scipy as sp import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.discriminant_analysis import LinearDiscriminantAnalysis %config InlineBackend.figure_format = 'svg'

In [ ]:

  Copied!     
 
%%html
<style>
.output_svg div{
  width: 80% !important;
  height: 100% !important;
}
</style>
%%html 

4.2. Linear Regression of an Indicator Matrix¶

Here each of the response categories are coded via an indicator variable. For example,

$$Y_3 = [0, 0, 1, 0, 0]$$

assuming we have 5 classes. This is also called one-hot encoding.

Thus if $\mathcal{G}$ has $K$ classes, there will be $K$ such indicators $Y_k$, $k=1,\cdots,K$, with

$$ Y_k = 1 \text{ if } G = k \text{ else } 0. $$

These are collected together in a vector $Y=(Y_1,\cdots,Y_k)$, and the $N$ training instances of these form an $N\times K$ indicator response matrix $\mathbf{Y}$, which is a matrix of $0$'s and $1$'s, with each row having a single $1$.

For example,

$$ Y = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ & & \vdots & & \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} $$

We fit a linear regression model to each of the columns of $\mathbf{Y}$ simultaneously, and the fit is given by

\begin{equation} \hat{\mathbf{Y}} = \mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{Y} = \mathbf{X}\hat{\mathbf{B}}. \end{equation}

Note that we have a coefficient vector for each response columns $\mathbf{y}_k$, and hence a $(p+1)\times K$ coefficient matrix $\hat{\mathbf{B}} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{Y}$. Here $\mathbf{X}$ is the model matrix with $p+1$ columns with a leading columns of $1$'s for the intercept.

A new observation with input $x$ is classified as follows:

Compute the fitted output $\hat{f}(x)^T = (1, x^T)^T\hat{\mathbf{B}}$, a $K$ vector.
Identify the largest component and classify accordingly:

\begin{equation} \hat{G}(x) = \arg\max_{k\in\mathcal{G}} \hat{f}_k(x). \end{equation}

Masked class with the regression approach¶

There is a serious problem with the regression approach when the number of class $K\ge 3$, especially prevalent when $K$ is large. Because of the rigid nature of the regression model, classes can be masked by others. FIGURE 4.2 illustrates an extreme situation when $K=3$. The three classes are perfectly separated by linear decision boundaries, yet linear regression misses the middle class completely.

In [7]:

  Copied!     
 
def generate_data(sample_size, feature_size, cluster_means, cluster_cov):
    """
    sample size = n \\ 
    feature size = m \\ 
    classes are fixed = 3 or cluster = 3 ; 
    """
    # generate values for features x1 and x2 
    xx = np.random.multivariate_normal(
        cluster_means, cluster_cov, size=(sample_size, feature_size)
        ).flatten(order='F').reshape(-1, feature_size)
    # constant values
    const = np.ones((sample_size*3, 1))
    # stack all values 
    xmat = np.hstack(
        [const, xx]
        )
    # generate labels
    nplabel = np.repeat(
        ['c1', 'c2', 'c3'], sample_size).reshape(-1, 1)
    column_names = ['const']
    for i in range(feature_size):
        temp = 'x'+str(i+1)
        column_names.append(temp)
    sdata = pd.DataFrame(
        xmat,
        columns=column_names
    )
    sdata['class'] = nplabel
    
    ymat = pd.get_dummies(sdata['class'])
    
    return sdata, xmat, ymat
def generate_data(sample_size, feature_size, cluster_means, cluster_cov): """ sample size = n \\ feature size = m \\ classes are fixed = 3 or cluster = 3 ; """ # generate values for features x1 and x2 xx = np.random.multivariate_normal( cluster_means, cluster_cov, size=(sample_size, feature_size) ).flatten(order='F').reshape(-1, feature_size) # constant values const = np.ones((sample_size*3, 1)) # stack all values xmat = np.hstack( [const, xx] ) # generate labels nplabel = np.repeat( ['c1', 'c2', 'c3'], sample_size).reshape(-1, 1) column_names = ['const'] for i in range(feature_size): temp = 'x'+str(i+1) column_names.append(temp) sdata = pd.DataFrame( xmat, columns=column_names ) sdata['class'] = nplabel ymat = pd.get_dummies(sdata['class']) return sdata, xmat, ymat

In [8]:

  Copied!     
 
# generate three clusters
np.random.seed(789)
sample_mean = [-4, 0, 4]
sample_cov = np.eye(3)
sdata, xmat, ymat = generate_data(300, 2, sample_mean, sample_cov)
sdata.head()
# generate three clusters np.random.seed(789) sample_mean = [-4, 0, 4] sample_cov = np.eye(3) sdata, xmat, ymat = generate_data(300, 2, sample_mean, sample_cov) sdata.head()

Out[8]:

	const	x1	x2	class
0	1.0	-5.108111	-3.932767	c1
1	1.0	-4.425128	-2.733287	c1
2	1.0	-2.815007	-3.383044	c1
3	1.0	-1.927488	-3.286930	c1
4	1.0	-2.506928	-3.792087	c1

In [9]:

  Copied!     
 
# one hot encoding
ymat.head()
# one hot encoding ymat.head()

Out[9]:

	c1	c2	c3
0	1	0	0
1	1	0	0
2	1	0	0
3	1	0	0
4	1	0	0

In [10]:

  Copied!     
 
# fit linear regression
beta = np.linalg.solve(xmat.T @ xmat, xmat.T @ ymat)
beta
# fit linear regression beta = np.linalg.solve(xmat.T @ xmat, xmat.T @ ymat) beta

Out[10]:

array([[ 0.33526599,  0.33314592,  0.33158809],
       [-0.05184917, -0.00616798,  0.05801715],
       [-0.06709384,  0.00617052,  0.06092332]])

In [11]:

  Copied!     
 
def estimate_class(beta, xmat):
    y_est = xmat @ beta
    estimated_class = y_est.argmax(axis=1)+1
    estimated_class = estimated_class.astype('str')
    estimated_class = np.core.defchararray.add(
        np.array(['c']*900), estimated_class
        )
    
    return y_est, estimated_class
def estimate_class(beta, xmat): y_est = xmat @ beta estimated_class = y_est.argmax(axis=1)+1 estimated_class = estimated_class.astype('str') estimated_class = np.core.defchararray.add( np.array(['c']*900), estimated_class ) return y_est, estimated_class

In [12]:

  Copied!     
 
# calculate estimation 
y_est, estimated_class = estimate_class(beta, xmat)
y_est.shape
# calculate estimation y_est, estimated_class = estimate_class(beta, xmat) y_est.shape

Out[12]:

(900, 3)

So far, we have

computed the fitted output $\hat{f}(x)^T = (1, x^T)^T\hat{\mathbf{B}}$, a $K$ vector.

Now, we will:

Identify the largest component and classify accordingly:

\begin{equation} \hat{G}(x) = \arg\max_{k\in\mathcal{G}} \hat{f}_k(x). \end{equation}

In [13]:

  Copied!     
 
pd.Series(y_est.argmax(axis=1)).value_counts()
pd.Series(y_est.argmax(axis=1)).value_counts()

Out[13]:

0    448
2    441
1     11
dtype: int64

Notice, that most of observations from class 2 are masked as either as class 1 or class 3. Now, we will plot the original dataset and estimated classes.

In [14]:

  Copied!     
 
sdata['est_class'] = estimated_class
sdata.head()
sdata['est_class'] = estimated_class sdata.head()

Out[14]:

	const	x1	x2	class	est_class
0	1.0	-5.108111	-3.932767	c1	c1
1	1.0	-4.425128	-2.733287	c1	c1
2	1.0	-2.815007	-3.383044	c1	c1
3	1.0	-1.927488	-3.286930	c1	c1
4	1.0	-2.506928	-3.792087	c1	c1

In [15]:

  Copied!     
 
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(x='x1', y='x2', hue='class', 
                data=sdata, ax=axes[0]);
axes[0].set_title("Original Dataset")
sns.scatterplot(x='x1', y='x2', hue='est_class', 
                data=sdata, ax=axes[1]);
axes[1].set_title("Estimated Class");
# add boundary line
xx1 = np.linspace(-7, 4)
y1 = -0.7*xx1-3.5
axes[0].plot(xx1, y1, 'k--');
xx2 = np.linspace(-3, 6)
y2 = -0.7*xx2+3
axes[0].plot(xx2, y2, 'k--');
y3 = -0.7*xx2
axes[1].plot(xx2, y3, 'k--');
fig, axes = plt.subplots(1, 2, figsize=(10, 5)) sns.scatterplot(x='x1', y='x2', hue='class', data=sdata, ax=axes[0]); axes[0].set_title("Original Dataset") sns.scatterplot(x='x1', y='x2', hue='est_class', data=sdata, ax=axes[1]); axes[1].set_title("Estimated Class"); # add boundary line xx1 = np.linspace(-7, 4) y1 = -0.7*xx1-3.5 axes[0].plot(xx1, y1, 'k--'); xx2 = np.linspace(-3, 6) y2 = -0.7*xx2+3 axes[0].plot(xx2, y2, 'k--'); y3 = -0.7*xx2 axes[1].plot(xx2, y3, 'k--');

The three classes are perfectly separated by linear decision boundaries, yet linear regression misses the middle class completely. Now, we will generate another dataset that each observation has only one feature $x_1$.

In [16]:

  Copied!     
 
# figure 4.3 another example of masking class
# with one feature
np.random.seed(234)
sdata2, xmat2, ymat2 = generate_data(300, 1, sample_mean, sample_cov)
sdata2.head()
# figure 4.3 another example of masking class # with one feature np.random.seed(234) sdata2, xmat2, ymat2 = generate_data(300, 1, sample_mean, sample_cov) sdata2.head()

Out[16]:

	const	x1	class
0	1.0	-3.181208	c1
1	1.0	-3.078422	c1
2	1.0	-4.969733	c1
3	1.0	-2.574784	c1
4	1.0	-5.283554	c1

In [17]:

  Copied!     
 
# linear regression with degree 1 
beta_degree1 = np.linalg.solve(xmat2.T @ xmat2, xmat2.T @ ymat2)
beta_degree1
# linear regression with degree 1 beta_degree1 = np.linalg.solve(xmat2.T @ xmat2, xmat2.T @ ymat2) beta_degree1

Out[17]:

array([[ 3.33554011e-01,  3.33332814e-01,  3.33113175e-01],
       [-1.12992566e-01,  2.65736676e-04,  1.12726830e-01]])

In [18]:

  Copied!     
 
# fit with linear regression of degree 1 and degree 2 
xmat2_sqr = np.hstack([xmat2, xmat2[:, 1:] * xmat2[:, 1:]])
beta_degree2 = np.linalg.solve(xmat2_sqr.T @ xmat2_sqr, xmat2_sqr.T @ ymat2)
beta_degree2
# fit with linear regression of degree 1 and degree 2 xmat2_sqr = np.hstack([xmat2, xmat2[:, 1:] * xmat2[:, 1:]]) beta_degree2 = np.linalg.solve(xmat2_sqr.T @ xmat2_sqr, xmat2_sqr.T @ ymat2) beta_degree2

Out[18]:

array([[ 0.12511596,  0.75642206,  0.11846198],
       [-0.11246881, -0.0007974 ,  0.1132662 ],
       [ 0.01739609, -0.03531073,  0.01791463]])

In [19]:

  Copied!     
 
y_est1, estimated1 = estimate_class(beta_degree1, xmat2)
sdata2['est_degree1'] = estimated1
y_est2, estimated2 = estimate_class(beta_degree2, xmat2_sqr)
sdata2['est_degree2'] = estimated2
sdata2.head()
y_est1, estimated1 = estimate_class(beta_degree1, xmat2) sdata2['est_degree1'] = estimated1 y_est2, estimated2 = estimate_class(beta_degree2, xmat2_sqr) sdata2['est_degree2'] = estimated2 sdata2.head()

Out[19]:

	const	x1	class	est_degree1	est_degree2
0	1.0	-3.181208	c1	c1	c1
1	1.0	-3.078422	c1	c1	c1
2	1.0	-4.969733	c1	c1	c1
3	1.0	-2.574784	c1	c1	c1
4	1.0	-5.283554	c1	c1	c1

In [20]:

  Copied!     
 
pd.Series(y_est1.argmax(axis=1)).value_counts()
pd.Series(y_est1.argmax(axis=1)).value_counts()

Out[20]:

0    453
2    447
dtype: int64

In [21]:

  Copied!     
 
pd.Series(y_est2.argmax(axis=1)).value_counts()
pd.Series(y_est2.argmax(axis=1)).value_counts()

Out[21]:

1    333
2    285
0    282
dtype: int64

We can see that the quadratic regression solve the masking issue.

In [22]:

  Copied!     
 
# plot the results
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
    x='x1', y=-0.5,  hue='class', data=sdata2, ax=axes[0]
)
axes[0].plot(xmat2[:, 1], y_est1[:, 0], 
             color='#2678B2', linestyle='dashed')
axes[0].plot(xmat2[:, 1], y_est1[:, 1], 
             color='#FD7F28', linestyle='dashed')
axes[0].plot(xmat2[:, 1], y_est1[:, 2], 
             color='#339F34', linestyle='dashed')
axes[0].set_title('Degree = 1; Error = 0.33')
sns.scatterplot(
    x='x1', y=-1,  hue='class', data=sdata2, ax=axes[1]
)
axes[1].scatter(xmat2[:, 1], y_est2[:, 0], 
             color='#2678B2', s=2)
axes[1].scatter(xmat2[:, 1], y_est2[:, 1], 
             color='#FD7F28', s=2)
axes[1].scatter(xmat2[:, 1], y_est2[:, 2], 
             color='#339F34', s=2)
axes[1].set_title('Degree = 2; Error = 0.04');
# plot the results fig, axes = plt.subplots(1, 2, figsize=(10, 5)) sns.scatterplot( x='x1', y=-0.5, hue='class', data=sdata2, ax=axes[0] ) axes[0].plot(xmat2[:, 1], y_est1[:, 0], color='#2678B2', linestyle='dashed') axes[0].plot(xmat2[:, 1], y_est1[:, 1], color='#FD7F28', linestyle='dashed') axes[0].plot(xmat2[:, 1], y_est1[:, 2], color='#339F34', linestyle='dashed') axes[0].set_title('Degree = 1; Error = 0.33') sns.scatterplot( x='x1', y=-1, hue='class', data=sdata2, ax=axes[1] ) axes[1].scatter(xmat2[:, 1], y_est2[:, 0], color='#2678B2', s=2) axes[1].scatter(xmat2[:, 1], y_est2[:, 1], color='#FD7F28', s=2) axes[1].scatter(xmat2[:, 1], y_est2[:, 2], color='#339F34', s=2) axes[1].set_title('Degree = 2; Error = 0.04');

In [23]:

  Copied!     
 
# get predicted class
sdata2['y_pred1'] = np.max(y_est1, axis=1)
sdata2['y_pred2'] = np.max(y_est2, axis=1)
sdata2.head()
# get predicted class sdata2['y_pred1'] = np.max(y_est1, axis=1) sdata2['y_pred2'] = np.max(y_est2, axis=1) sdata2.head()

Out[23]:

	const	x1	class	est_degree1	est_degree2	y_pred1	y_pred2
0	1.0	-3.181208	c1	c1	c1	0.693007	0.658953
1	1.0	-3.078422	c1	c1	c1	0.681393	0.636200
2	1.0	-4.969733	c1	c1	c1	0.895097	1.113709
3	1.0	-2.574784	c1	c1	c1	0.624486	0.530027
4	1.0	-5.283554	c1	c1	c1	0.930556	1.204979

In [24]:

  Copied!     
 
# now we plot the results again
# plot the results
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
    x='x1', y=-0.5,  hue='class', data=sdata2, 
    palette=['#2678B2', '#FD7F28', '#339F34'],
    ax=axes[0]
)
sns.scatterplot(
    x='x1', y='y_pred1', hue='est_degree1', data=sdata2, 
    legend=False,
    palette=['#2678B2', '#339F34'],
    ax=axes[0]
)
axes[0].set_title('Degree = 1; Error = 0.33')
axes[0].annotate('class 2 was masked completely',
                 (-4, -0.2));
sns.scatterplot(
    x='x1', y=-1,  hue='class', data=sdata2, ax=axes[1]
)
sns.scatterplot(
    x='x1', y='y_pred2', hue='est_degree2', data=sdata2, 
    legend=False,
    ax=axes[1]
)
axes[1].annotate('class 2 was not masked',
                 (-3.5, 1));
axes[1].set_title('Degree = 2; Error = 0.04');
# now we plot the results again # plot the results fig, axes = plt.subplots(1, 2, figsize=(10, 5)) sns.scatterplot( x='x1', y=-0.5, hue='class', data=sdata2, palette=['#2678B2', '#FD7F28', '#339F34'], ax=axes[0] ) sns.scatterplot( x='x1', y='y_pred1', hue='est_degree1', data=sdata2, legend=False, palette=['#2678B2', '#339F34'], ax=axes[0] ) axes[0].set_title('Degree = 1; Error = 0.33') axes[0].annotate('class 2 was masked completely', (-4, -0.2)); sns.scatterplot( x='x1', y=-1, hue='class', data=sdata2, ax=axes[1] ) sns.scatterplot( x='x1', y='y_pred2', hue='est_degree2', data=sdata2, legend=False, ax=axes[1] ) axes[1].annotate('class 2 was not masked', (-3.5, 1)); axes[1].set_title('Degree = 2; Error = 0.04');

For this simple example a quadratic rather than linear fit would solve the problem. However, if there were 4 classes, a quadratic would not come down fast enough, and a cubic would be needed as well. A loose but general rule is that if $K\ge 3$ classes are lined up, polynomial terms up to degree $K-1$ might be needed to resolve them.

Note also that these are polynomials along the derived direction passing through the centroids, which can have orbitrary orientation. So in $p$-dimensional input space, one would need general polynomial terms and cross-products of total degree $K-1$, $O(p^{K-1})$ terms in all, to resolve such worst-case scenarios.

The example is extreme, but for large $K$ and small $p$ such maskings naturally occur. As a more realistic illustration, FIGURE 4.4 is a projection of the training data for a vowel recognition problem onto an informative two-dimensional subspace.

In [25]:

  Copied!     
 
# figure 4.4
vowel = pd.read_csv('./data/vowel/vowel.train', index_col=0)
vowel.head()
# figure 4.4 vowel = pd.read_csv('./data/vowel/vowel.train', index_col=0) vowel.head()

Out[25]:

	y	x.1	x.2	x.3	x.4	x.5	x.6	x.7	x.8	x.9	x.10
row.names
1	1	-3.639	0.418	-0.670	1.779	-0.168	1.627	-0.388	0.529	-0.874	-0.814
2	2	-3.327	0.496	-0.694	1.365	-0.265	1.933	-0.363	0.510	-0.621	-0.488
3	3	-2.120	0.894	-1.576	0.147	-0.707	1.559	-0.579	0.676	-0.809	-0.049
4	4	-2.287	1.809	-1.498	1.012	-1.053	1.060	-0.567	0.235	-0.091	-0.795
5	5	-2.598	1.938	-0.846	1.062	-1.633	0.764	0.394	-0.150	0.277	-0.396

In [26]:

  Copied!     
 
df_y = vowel['y']
df_x2d = vowel[['x.1', 'x.2']]
grouped = df_x2d.groupby(df_y)
df_y = vowel['y'] df_x2d = vowel[['x.1', 'x.2']] grouped = df_x2d.groupby(df_y)

In [27]:

  Copied!     
 
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
for y, x in grouped:
    x_mean = x.mean()  # mean of (x1 and x2) for each group 
    color = next(ax._get_lines.prop_cycler)['color']
    ax.scatter(x['x.1'], x['x.2'], 
               edgecolor=color, facecolors='none')
    ax.plot(x_mean[0], x_mean[1], 'o', color=color, markersize=10,
            markeredgecolor='black', markeredgewidth=3)
ax.set_xlabel('Coordinate 1 for Training Data')
ax.set_ylabel('Coordinate 2 for Training Data')
ax.set_title('Linear Discriminant Analysis');
fig, ax = plt.subplots(1, 1, figsize=(6, 4)) for y, x in grouped: x_mean = x.mean() # mean of (x1 and x2) for each group color = next(ax._get_lines.prop_cycler)['color'] ax.scatter(x['x.1'], x['x.2'], edgecolor=color, facecolors='none') ax.plot(x_mean[0], x_mean[1], 'o', color=color, markersize=10, markeredgecolor='black', markeredgewidth=3) ax.set_xlabel('Coordinate 1 for Training Data') ax.set_ylabel('Coordinate 2 for Training Data') ax.set_title('Linear Discriminant Analysis');

There are K = 11 classes in p = 10 dimensions. This is a difficult classification problem, and the best methods achieve around 40% errors on the test data. The main point here is summarized in Table 4.1; linear regression has an error rate of 67%, while a close relative, linear discriminant analysis, has an error rate of 56%. It seems that masking has hurt in this case. While all the other methods in this chapter are based on linear functions of x as well, they use them in such a way that avoids this masking problem.

Technique	Error Rate
	Training	Test
Linear regression	0.48	0.67
Linear discriminant analysis	0.32	0.56
Quadratic discriminant analysis	0.01	0.53
Logistic regression	0.22	0.51

To summarize, there are at least two reasons not to perform classification using a regression method:

(a) a regression method cannot accommodate a qualitative response with more than two classes;
(b) a regression method will not provide meaningful estimates of $Pr(Y|X)$, even with just two classes.

4.3. Linear Discriminant Analysis¶

Decision theory for classification tells us that we need to know the class posteriors mass function $\text{Pr}(G|X)$ for optimal classification. Suppose

$f_k(x)$ is the class-conditional density of $X$ in class $G=k$,
$\pi_k$ is the prior probability of class $k$, with $\sum\pi_k=1$.

A simple application of Bayes theorem gives us

\begin{equation} \text{Pr}(G=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l}. \end{equation}

We see that in terms of ability to classify, it is enough to have the $f_k(x)$. Many techniques are based on models for the class densities:

linear and quadratic discriminant analysis use Gaussian densities;
more flexible mixtures of Gaussian allow for nonlinear decision boundaries ($\S$ 6.8);
general nonparametric density estimates for each class density allow the most flexibility ($\S$ 6.6.2);
Naive Bayes models are a variant of the previous case, and assume that the inputs are conditionally independent in each class; i.e., each of the class densities are products of marginal densities ($\S$ 6.6.3).

Suppose that we model each class density as multivariate Gaussian

\begin{equation} f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left\lbrace -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right\rbrace \end{equation}

where $p$ is the number of feature dimension.

Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a common covariance matrix $\Sigma_k=\Sigma,\forall k$.

In comparing two classes $k$ and $l$, it is sufficient to look at the log-ratio, and we see that as an equation linear in $x$,

\begin{aligned} \log\frac{\text{Pr}(G=k|X=x)}{\text{Pr}(G=l|X=x)} &= \log\frac{f_k(x)}{f_l(x)} + \log\frac{\pi_k}{\pi_l} \\ &= \log\frac{\pi_k}{\pi_l} - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \frac{1}{2}\mu_l^T\Sigma^{-1}\mu_l + x^T\Sigma^{-1}(\mu_k-\mu_l) \\ &= \delta_k(x) - \delta_l(x), \end{aligned}

where $\delta_k$ is the linear discriminant function

\begin{equation} \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k. \end{equation}

This linear log-odds function implies that the decision boundary between classes $k$ and $l$

\begin{equation} \left\lbrace x: \delta_k(x) - \delta_l(x) = 0 \right\rbrace \end{equation}

is linear in $x$; in $p$ dimensions a hyperplane. Also the linear discriminant functions are equivalent description of the decision rule, with

\begin{equation} G(x) = \arg\max_k \delta_k(x). \end{equation}

In [28]:

  Copied!     
 
## why the decision boundary between classes k and l implies
## delta_k(x) - delta_l(x) = 0 
## if the posterior pk > pl, then classified it as k
np.random.seed(889)
pk = np.random.uniform(0, 1, 100)
pl = np.random.uniform(0, 1, 100)
odds_ratio = pk/pl
cls_kl = np.where(odds_ratio > 1, 'class-k', 'class-l')
boundary_data = pd.DataFrame(
    [pk, pl, odds_ratio, cls_kl]
)
boundary_data = boundary_data.transpose()
boundary_data.columns = ['pk', 'pl', 'odds_ratio', 'classification']
boundary_data.head()
## why the decision boundary between classes k and l implies ## delta_k(x) - delta_l(x) = 0 ## if the posterior pk > pl, then classified it as k np.random.seed(889) pk = np.random.uniform(0, 1, 100) pl = np.random.uniform(0, 1, 100) odds_ratio = pk/pl cls_kl = np.where(odds_ratio > 1, 'class-k', 'class-l') boundary_data = pd.DataFrame( [pk, pl, odds_ratio, cls_kl] ) boundary_data = boundary_data.transpose() boundary_data.columns = ['pk', 'pl', 'odds_ratio', 'classification'] boundary_data.head() 

Out[28]:

	pk	pl	odds_ratio	classification
0	0.428559	0.444959	0.963144	class-l
1	0.807978	0.875746	0.922616	class-l
2	0.749596	0.364627	2.055786	class-k
3	0.405277	0.922467	0.43934	class-l
4	0.274735	0.40617	0.676402	class-l

In [29]:

  Copied!     
 
fig, axes = plt.subplots(1, 1, figsize=(6, 4))
sns.scatterplot(
    x='pl', y='pk', hue='classification',
    data=boundary_data, ax=axes
    )
xx = np.linspace(0, 1, 100)
axes.plot(xx, xx, 'k--');
fig, axes = plt.subplots(1, 1, figsize=(6, 4)) sns.scatterplot( x='pl', y='pk', hue='classification', data=boundary_data, ax=axes ) xx = np.linspace(0, 1, 100) axes.plot(xx, xx, 'k--');

The figure above shows why we set decision boundary between classes $k$ and $l$

\begin{equation} \left\lbrace x: \delta_k(x) - \delta_l(x) = 0 \right\rbrace \end{equation}

as $log 1 = 0 $.

Note

For LDA, bayesian estimation and maximum likelihood estimation are equivament.

That's why we have the following decision rule too:

\begin{equation} G(x) = \arg\max_k \delta_k(x). \end{equation}

In [30]:

  Copied!     
 
# figure 4.5, a simulated example
np.random.seed(666)
sample_size = 30
sample_mean = [-0.5, 0, 0.5]
# generate a positive semidefinite matrix
rand_mat = np.random.rand(3, 3)
sample_cov = rand_mat.T @ rand_mat / 10
sdata3, xmat3, ymat3 = generate_data(
    sample_size, 2, sample_mean, sample_cov
)
# now we will shift x2 down 
sdata3['x2roll'] = np.roll(sdata3['x2'], 30)
sdata3.head()
# figure 4.5, a simulated example np.random.seed(666) sample_size = 30 sample_mean = [-0.5, 0, 0.5] # generate a positive semidefinite matrix rand_mat = np.random.rand(3, 3) sample_cov = rand_mat.T @ rand_mat / 10 sdata3, xmat3, ymat3 = generate_data( sample_size, 2, sample_mean, sample_cov ) # now we will shift x2 down sdata3['x2roll'] = np.roll(sdata3['x2'], 30) sdata3.head()

Out[30]:

	const	x1	x2	class	x2roll
0	1.0	-0.071273	-0.182977	c1	0.466982
1	1.0	-0.670211	-0.320429	c1	0.441147
2	1.0	-0.800962	-0.615667	c1	0.274339
3	1.0	-0.691924	-0.543417	c1	0.713403
4	1.0	-0.598135	-0.498490	c1	0.492381

In [31]:

  Copied!     
 
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.kdeplot(
    x='x1', y='x2', hue='class', alpha=0.5,
    data=sdata3, ax=axes[0]
    )
axes[0].set_title('Generated Data without rolling x2')
sns.move_legend(axes[0], 'upper left')  # move legend 
sns.kdeplot(
    x='x1', y='x2roll', hue='class', alpha=0.5,
    data=sdata3, ax=axes[1]
    )
sns.scatterplot(
    x='x1', y='x2roll',  hue='class', data=sdata3, ax=axes[1]
    )
axes[1].set_title('Generated Data with rolling x2');
fig, axes = plt.subplots(1, 2, figsize=(10, 5)) sns.kdeplot( x='x1', y='x2', hue='class', alpha=0.5, data=sdata3, ax=axes[0] ) axes[0].set_title('Generated Data without rolling x2') sns.move_legend(axes[0], 'upper left') # move legend sns.kdeplot( x='x1', y='x2roll', hue='class', alpha=0.5, data=sdata3, ax=axes[1] ) sns.scatterplot( x='x1', y='x2roll', hue='class', data=sdata3, ax=axes[1] ) axes[1].set_title('Generated Data with rolling x2');

Now, we will estimate the class via the linear discriminant functions:

$$ \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k. $$

In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using our training data:

$\hat\pi_k = N_k/N$,
$\hat\mu_k = \sum_{g_i = k} x_i/N_k$;
$\hat\Sigma = \sum_{k=1}^K \sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N-K)$.

In [32]:

  Copied!     
 
## chose 80% of sdata3 as tranning dataset
training_data = sdata3.sample(frac=0.8)
# estimate prior, mean and variance
prior_hat = (training_data.groupby(
    ['class']
    ).count()/training_data.shape[0])['const']
prior_hat
## chose 80% of sdata3 as tranning dataset training_data = sdata3.sample(frac=0.8) # estimate prior, mean and variance prior_hat = (training_data.groupby( ['class'] ).count()/training_data.shape[0])['const'] prior_hat

Out[32]:

class
c1    0.361111
c2    0.305556
c3    0.333333
Name: const, dtype: float64

In [33]:

  Copied!     
 
# estimate mean
mean_hat = training_data.groupby(
    ['class']
    ).mean()[['x1', 'x2roll']]
mean_hat
# estimate mean mean_hat = training_data.groupby( ['class'] ).mean()[['x1', 'x2roll']] mean_hat

Out[33]:

	x1	x2roll
class
c1	-0.542296	0.518047
c2	0.020427	-0.556884
c3	0.507638	0.013990

In [34]:

  Copied!     
 
# calcluate covariance for each group 
cov_each_group = training_data.groupby(
    ['class']
    )[['x1', 'x2roll']].cov()
cov_each_group
# calcluate covariance for each group cov_each_group = training_data.groupby( ['class'] )[['x1', 'x2roll']].cov() cov_each_group

Out[34]:

		x1	x2roll
class
c1	x1	0.110910	-0.009916
c1	x2roll	-0.009916	0.042135
c2	x1	0.147984	-0.025898
c2	x2roll	-0.025898	0.101976
c3	x1	0.032246	-0.016600
c3	x2roll	-0.016600	0.209065

In [35]:

  Copied!     
 
# calculate the average covariance
cov_hat = (cov_each_group.loc['c1'] + cov_each_group.loc['c2'] + 
            cov_each_group.loc['c3'])/(training_data.shape[0]-3)
cov_hat
# calculate the average covariance cov_hat = (cov_each_group.loc['c1'] + cov_each_group.loc['c2'] + cov_each_group.loc['c3'])/(training_data.shape[0]-3) cov_hat

Out[35]:

	x1	x2roll
x1	0.004219	-0.000760
x2roll	-0.000760	0.005118

With the following formula:

$$ \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log\pi_k. $$

In [36]:

  Copied!     
 
np.linalg.inv(cov_hat)
np.linalg.inv(cov_hat)

Out[36]:

array([[243.50634645,  36.13821796],
       [ 36.13821796, 200.73295574]])

In [37]:

  Copied!     
 
prior_hat
prior_hat

Out[37]:

class
c1    0.361111
c2    0.305556
c3    0.333333
Name: const, dtype: float64

In [38]:

  Copied!     
 
# calculate linear discriminant scores
# x.shape = 1x2, covariance is 2x2 mean_hat should be 2x1
print(
    f"X shape: {xmat3[:, 1:].shape}",
    f"Cov shape: {cov_hat.shape}",
    f"Mean shape: {mean_hat.shape}",
    sep='\n'
    )
# calculate linear discriminant scores # x.shape = 1x2, covariance is 2x2 mean_hat should be 2x1 print( f"X shape: {xmat3[:, 1:].shape}", f"Cov shape: {cov_hat.shape}", f"Mean shape: {mean_hat.shape}", sep='\n' )

X shape: (90, 2)
Cov shape: (2, 2)
Mean shape: (3, 2)

In [39]:

  Copied!     
 
def discriminant_score(x_feature_vector):
    """
    In our case, feature dimension p=2, but class dimenion K=3 \\ 
    x_feature_vector shape = 1 x 2 \\
    cov_mat = 2 x 2 \\ 
    mean_hat shape = 2 x 1 
    """
    cov_inv = np.linalg.inv(cov_hat)
    mean_c1 = mean_hat.loc['c1'].values.reshape(2, 1)
    mean_c2 = mean_hat.loc['c2'].values.reshape(2, 1)
    mean_c3 = mean_hat.loc['c3'].values.reshape(2, 1)
    c1 = (
        x_feature_vector @ cov_inv @ mean_c1 - 
        1/2 * mean_c1.T @ cov_inv @ mean_c1 +
        np.log(prior_hat['c1'])
    )
    c2 = (
        x_feature_vector @ cov_inv @ mean_c2 - 
        1/2 * mean_c2.T @ cov_inv @ mean_c2 +
        np.log(prior_hat['c2'])
    )
    c3 = (
        x_feature_vector @ cov_inv @ mean_c3 - 
        1/2 * mean_c3.T @ cov_inv @ mean_c3 +
        np.log(prior_hat['c3'])
    )

    return [c1[0], c2[0], c3[0]]
def discriminant_score(x_feature_vector): """ In our case, feature dimension p=2, but class dimenion K=3 \\ x_feature_vector shape = 1 x 2 \\ cov_mat = 2 x 2 \\ mean_hat shape = 2 x 1 """ cov_inv = np.linalg.inv(cov_hat) mean_c1 = mean_hat.loc['c1'].values.reshape(2, 1) mean_c2 = mean_hat.loc['c2'].values.reshape(2, 1) mean_c3 = mean_hat.loc['c3'].values.reshape(2, 1) c1 = ( x_feature_vector @ cov_inv @ mean_c1 - 1/2 * mean_c1.T @ cov_inv @ mean_c1 + np.log(prior_hat['c1']) ) c2 = ( x_feature_vector @ cov_inv @ mean_c2 - 1/2 * mean_c2.T @ cov_inv @ mean_c2 + np.log(prior_hat['c2']) ) c3 = ( x_feature_vector @ cov_inv @ mean_c3 - 1/2 * mean_c3.T @ cov_inv @ mean_c3 + np.log(prior_hat['c3']) ) return [c1[0], c2[0], c3[0]] 

In [40]:

  Copied!     
 
# extrat features
xmat_features = training_data[['x1', 'x2roll']]
xmat_features.head()
# extrat features xmat_features = training_data[['x1', 'x2roll']] xmat_features.head()

Out[40]:

	x1	x2roll
54	0.020254	0.020241
88	0.421421	0.381563
51	0.571343	-0.797902
73	0.289732	0.936149
21	-0.080530	0.400724

In [41]:

  Copied!     
 
# test the discriminant function
discriminant_score(
    xmat_features.iloc[1, :].values.reshape(1, 2)
)  # indeed, it gives the high score for c1
# test the discriminant function discriminant_score( xmat_features.iloc[1, :].values.reshape(1, 2) ) # indeed, it gives the high score for c1

Out[41]:

[array([-69.16687308]), array([-80.70709791]), array([27.62736642])]

Now, we will classify our trainning dataset calculate the accuracy. Then we will use the test dataset to test linear discriminant classification.

In [42]:

  Copied!     
 
# estimate the class
y_est = np.apply_along_axis(discriminant_score, 1, xmat_features)
estimated_class = y_est.argmax(axis=1)+1
estimated_class = estimated_class.astype('str')
estimated_class = np.core.defchararray.add(
        np.array(['c']), estimated_class
        )
estimated_class
training_data['estimated_class'] = estimated_class
training_data.head()
# estimate the class y_est = np.apply_along_axis(discriminant_score, 1, xmat_features) estimated_class = y_est.argmax(axis=1)+1 estimated_class = estimated_class.astype('str') estimated_class = np.core.defchararray.add( np.array(['c']), estimated_class ) estimated_class training_data['estimated_class'] = estimated_class training_data.head()

Out[42]:

	const	x1	x2	class	x2roll	estimated_class
54	1.0	0.020254	0.475237	c2	0.020241	c3
88	1.0	0.421421	0.857843	c3	0.381563	c3
51	1.0	0.571343	-0.334370	c2	-0.797902	c2
73	1.0	0.289732	0.648200	c3	0.936149	c3
21	1.0	-0.080530	-0.797902	c1	0.400724	c1

In [43]:

  Copied!     
 
# calculate accuracy
sum(
    training_data['class'] == training_data['estimated_class']
    ) / training_data.shape[0]
# calculate accuracy sum( training_data['class'] == training_data['estimated_class'] ) / training_data.shape[0]

Out[43]:

0.875

In [44]:

  Copied!     
 
# now we will plot the simulated dataset and estimated results
fig, axes = plt.subplots(1, 1, figsize=(6, 4))
sns.kdeplot(
    x='x1', y='x2roll', hue='estimated_class', alpha=0.5,
    levels=5, 
    data=training_data, ax=axes
    )
sns.scatterplot(
    x='x1', y='x2roll', hue='class', 
    data=training_data, ax=axes
    )
sns.move_legend(axes, 'lower left')
axes.set_title('Simulated data with estimated contours');
# now we will plot the simulated dataset and estimated results fig, axes = plt.subplots(1, 1, figsize=(6, 4)) sns.kdeplot( x='x1', y='x2roll', hue='estimated_class', alpha=0.5, levels=5, data=training_data, ax=axes ) sns.scatterplot( x='x1', y='x2roll', hue='class', data=training_data, ax=axes ) sns.move_legend(axes, 'lower left') axes.set_title('Simulated data with estimated contours');

In [45]:

  Copied!     
 
training_data.head()
training_data.head()

Out[45]:

	const	x1	x2	class	x2roll	estimated_class
54	1.0	0.020254	0.475237	c2	0.020241	c3
88	1.0	0.421421	0.857843	c3	0.381563	c3
51	1.0	0.571343	-0.334370	c2	-0.797902	c2
73	1.0	0.289732	0.648200	c3	0.936149	c3
21	1.0	-0.080530	-0.797902	c1	0.400724	c1

Note

LDA is equiavlent to linear regression with two classes (K=2). However, when K > 3, LDA is not the same as linear regression of the class indicator matrix, and it avoids the masking problems.

Quadratic discriminant analysis¶

When we assume the covariance is different for each class, then based on the following equation

$$ f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left\lbrace -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right\rbrace $$

the convenient cancellations do not occur. We then get quadratic discriminant functions (QDA),

\begin{equation} \delta_k(x) = -\frac{1}{2}\log|\Sigma_k| -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + \log\pi_k \end{equation}

The decision boundary between each pair of classes $k$ and $l$ is described by a quadratic equation $\left\lbrace x: \delta_k(x) = \delta_l(x) \right\rbrace$.

The estimates for QDA are similar to those for LDA, except that separate covariance matrices must be estimated for each class. When p is large this can mean a dramatic increase in parameters. Since the decision boundaries are functions of the parameters of the densities, counting the number of parameters must be done with care. For LDA, it seems there are $(K − 1) × (p + 1)$ parameters. Likewise for QDA there will be $(K-1) \times \{p(p+3)/2+1\}$ parameters. Both LDA and QDA perform well on an amazingly large and diverse set of classification tasks.

For a problem with $K$ classes, we could only need $K-1$ discriminant functions by arbitrarily choosing one class to be the base class. Then, the total number of estimated parameters for LDA is $(K-1)(p+1)$. However, for QDA, for each class $K$

number of means to be estimated: $p$
number of covariance to be estimated $p(p+1)/2$
number of class prior to be estimated $1$

The total number of estimated parameters is $(K-1) \times \{p(p+3)/2+1\}$

Therefore, the number of parameters estimated in LDA increases linearly with p while that of QDA increases quadratically with p. We would expect QDA to have worse performance than LDA when the dimension p is large.

Both techniques are widely used, and entire books are devoted to LDA. It seems that whatever exotic tools are the rage of the day, we should always have available these two simple tools. The question arises why LDA and QDA have such a good track record. The reason is not likely to be that the data are approximately Gaussian, and in addition for LDA that the covariances are approximately equal. More likely a reason is that the data can only support simple decision boundaries such as linear or quadratic, and the estimates provided via the Gaussian models are stable. This is a bias variance tradeoff—we can put up with the bias of a linear decision boundary because it can be estimated with much lower variance than more exotic alternatives.

Regularized Discriminant Analysis¶

We can find a compromise between LDA and QDA by regularizing the individual class covariance matrices. Regularization means that we put a certain restriction on the estimated parameters. In this case, we require that individual covariance matrix shrinks toward a common pooled covariance matrix through a penalty parameter $\alpha$:

$$\hat{\Sigma}_k (\alpha) = \alpha \hat{\Sigma}_k + (1-\alpha) \hat{\Sigma}$$

where $\hat{\Sigma}$ is the pooled covariance matrix as used in LDA. The pooled covariance matrix can also be regularized toward an identity matrix through a penalty parameter $\gamma$

$$\hat{\Sigma} (\gamma) = \gamma \hat{\Sigma} + (1-\gamma) \sigma^2 I$$

In situations where the number of input variables greatly exceeds the number of samples ($p>N$), the covariance matrix can be poorly estimated. Shrinkage can hopefully improve estimation and classification accuracy.

In [46]:

  Copied!     
 
class DiscriminantAnalysis():
    """
    Class for implimenting Regularized Discriminent Analysis
    LDA is performed when alpha=0
    QDA is performed when alpha=1
    Linear, Quadratic and Regularized discriminant analysis.
    Regularized discriminant analysis is a compromise between
    linear discrimenent analysis and quadratic discrimenent analysis.
    If you wish to add the constraint that the covariance matrix be
    diagonal (independent features), use Naive Bayes instead.
    Reference: https://github.com/christopherjenness/ML-lib
    """

    def __init__(self, alpha=1.0) -> None:
        self.is_model_fitted = False
        self.alpha = alpha
        self.class_categories = []
        self.class_priors = {}
        self.class_means = {}
        self.regularized_covariances = {}

    def fit(self, X, y):
        """
        X: N by p matrix
        y: N by 1 matrix 
        """
        # get unique classes 
        self.class_categories = np.unique(y)   
        # initialize the covariance 
        class_k_covs = {}
        pooled_covs = 0
        # estimate parameters: mean, covariance matrix and priors 
        for k in self.class_categories:
            class_k_idx = np.where(y==k)[0]
            class_k_features = X[class_k_idx, :]
            self.class_priors[k] = float(len(class_k_idx)) / y.shape[0]
            self.class_means[k] = np.mean(class_k_features, axis=0)
            # each column as a variable 
            class_k_covs[k] = np.cov(class_k_features, rowvar=False)
            # calculate pooled covariance
            # alternative formula: pooled_covs += class_k_covs[k]
            # pooled_covs = pooled_covs / (N-K)
            pooled_covs += class_k_covs[k] * self.class_priors[k]

        # calculate regularized covariance matrices for each class
        # when alpha = 1, it is LDA, pooled covs is not used
        for k in self.class_categories:
            self.regularized_covariances[k] = (
                self.alpha * class_k_covs[k] + 
                                (1-self.alpha) * pooled_covs
                )
        self.is_model_fitted = True

    def predict(self, X):
        """
        X: sample size by p matrix [sample_size, p]
        return: the predicted class [sample_size, 1] matrix
        """
        y_est = np.apply_along_axis(
            self.__classify, 1, X
        )

        return y_est


    def __classify(self, x):
        """
        Private method
        x: feature vector for one observation [1, p] dimension
        Returns: classified category
        """
        if not self.is_model_fitted:
            raise NameError('Please fit the model first')
        # calculate the determinant score
        classified_scores = {}
        for k in self.class_categories:
            mean_deviation = x-self.class_means[k]
            # pinv is preferred becuase of erros of float (sigularity)
            cov_inv = np.linalg.pinv(self.regularized_covariances[k])
            # use the formula in the equation (1)
            # for score1, we do not use np.log() as it might have 0s
            score1 = -0.5 * np.linalg.det(
                self.regularized_covariances[k]
                )
            score2 = -0.5 * mean_deviation.T @ cov_inv @ mean_deviation
            score3 = np.log(self.class_priors[k])

            classified_scores[k] = score1 + score2 + score3 

        # foo = {'a': 1, 'b': 3000, 'c': 0}
        # print(max(foo, key=foo.get)) 
        return max(classified_scores, key=classified_scores.get)
class DiscriminantAnalysis(): """ Class for implimenting Regularized Discriminent Analysis LDA is performed when alpha=0 QDA is performed when alpha=1 Linear, Quadratic and Regularized discriminant analysis. Regularized discriminant analysis is a compromise between linear discrimenent analysis and quadratic discrimenent analysis. If you wish to add the constraint that the covariance matrix be diagonal (independent features), use Naive Bayes instead. Reference: https://github.com/christopherjenness/ML-lib """ def __init__(self, alpha=1.0) -> None: self.is_model_fitted = False self.alpha = alpha self.class_categories = [] self.class_priors = {} self.class_means = {} self.regularized_covariances = {} def fit(self, X, y): """ X: N by p matrix y: N by 1 matrix """ # get unique classes self.class_categories = np.unique(y) # initialize the covariance class_k_covs = {} pooled_covs = 0 # estimate parameters: mean, covariance matrix and priors for k in self.class_categories: class_k_idx = np.where(y==k)[0] class_k_features = X[class_k_idx, :] self.class_priors[k] = float(len(class_k_idx)) / y.shape[0] self.class_means[k] = np.mean(class_k_features, axis=0) # each column as a variable class_k_covs[k] = np.cov(class_k_features, rowvar=False) # calculate pooled covariance # alternative formula: pooled_covs += class_k_covs[k] # pooled_covs = pooled_covs / (N-K) pooled_covs += class_k_covs[k] * self.class_priors[k] # calculate regularized covariance matrices for each class # when alpha = 1, it is LDA, pooled covs is not used for k in self.class_categories: self.regularized_covariances[k] = ( self.alpha * class_k_covs[k] + (1-self.alpha) * pooled_covs ) self.is_model_fitted = True def predict(self, X): """ X: sample size by p matrix [sample_size, p] return: the predicted class [sample_size, 1] matrix """ y_est = np.apply_along_axis( self.__classify, 1, X ) return y_est def __classify(self, x): """ Private method x: feature vector for one observation [1, p] dimension Returns: classified category """ if not self.is_model_fitted: raise NameError('Please fit the model first') # calculate the determinant score classified_scores = {} for k in self.class_categories: mean_deviation = x-self.class_means[k] # pinv is preferred becuase of erros of float (sigularity) cov_inv = np.linalg.pinv(self.regularized_covariances[k]) # use the formula in the equation (1) # for score1, we do not use np.log() as it might have 0s score1 = -0.5 * np.linalg.det( self.regularized_covariances[k] ) score2 = -0.5 * mean_deviation.T @ cov_inv @ mean_deviation score3 = np.log(self.class_priors[k]) classified_scores[k] = score1 + score2 + score3 # foo = {'a': 1, 'b': 3000, 'c': 0} # print(max(foo, key=foo.get)) return max(classified_scores, key=classified_scores.get)

In [47]:

  Copied!     
 
# figure 4.7
# read vowel train
vowel = pd.read_csv('./data/vowel/vowel.train', index_col=0)
vowel.head()
# figure 4.7 # read vowel train vowel = pd.read_csv('./data/vowel/vowel.train', index_col=0) vowel.head()

Out[47]:

	y	x.1	x.2	x.3	x.4	x.5	x.6	x.7	x.8	x.9	x.10
row.names
1	1	-3.639	0.418	-0.670	1.779	-0.168	1.627	-0.388	0.529	-0.874	-0.814
2	2	-3.327	0.496	-0.694	1.365	-0.265	1.933	-0.363	0.510	-0.621	-0.488
3	3	-2.120	0.894	-1.576	0.147	-0.707	1.559	-0.579	0.676	-0.809	-0.049
4	4	-2.287	1.809	-1.498	1.012	-1.053	1.060	-0.567	0.235	-0.091	-0.795
5	5	-2.598	1.938	-0.846	1.062	-1.633	0.764	0.394	-0.150	0.277	-0.396

In [48]:

  Copied!     
 
# read vowel test 
vowel_test = pd.read_csv('./data/vowel/vowel.test', index_col=0)
vowel_test.head()
# read vowel test vowel_test = pd.read_csv('./data/vowel/vowel.test', index_col=0) vowel_test.head()

Out[48]:

	y	x.1	x.2	x.3	x.4	x.5	x.6	x.7	x.8	x.9	x.10
row.names
1	1	-1.149	-0.904	-1.988	0.739	-0.060	1.206	0.864	1.196	-0.300	-0.467
2	2	-2.613	-0.092	-0.540	0.484	0.389	1.741	0.198	0.257	-0.375	-0.604
3	3	-2.505	0.632	-0.593	0.304	0.496	0.824	-0.162	0.181	-0.363	-0.764
4	4	-1.768	1.769	-1.142	-0.739	-0.086	0.120	-0.230	0.217	-0.009	-0.279
5	5	-2.671	3.155	-0.514	0.133	-0.964	0.234	-0.071	1.192	0.254	-0.471

In [49]:

  Copied!     
 
alpha_values = np.linspace(0, 1, 50)
train_mis_rate = []
test_mis_rate = []
# construct x and y
vowel_X = vowel.iloc[:, 1:].values
vowel_y = vowel.iloc[:, 0].values.reshape(-1, 1)
vowel_test_X = vowel_test.iloc[:, 1:].values

for alpha in alpha_values:
    vowel_rda = DiscriminantAnalysis(alpha)
    vowel_rda.fit(vowel_X, vowel_y)
    y_pred = vowel_rda.predict(vowel_X)
    # accuracy rate
    acc_rate = sum(vowel['y'] == y_pred) / vowel.shape[0]
    train_mis_rate.append(1-acc_rate)
    y_pred = vowel_rda.predict(vowel_test_X)
    acc_rate = sum(vowel_test['y'] == y_pred) / vowel_test.shape[0]
    test_mis_rate.append(1-acc_rate)
alpha_values = np.linspace(0, 1, 50) train_mis_rate = [] test_mis_rate = [] # construct x and y vowel_X = vowel.iloc[:, 1:].values vowel_y = vowel.iloc[:, 0].values.reshape(-1, 1) vowel_test_X = vowel_test.iloc[:, 1:].values for alpha in alpha_values: vowel_rda = DiscriminantAnalysis(alpha) vowel_rda.fit(vowel_X, vowel_y) y_pred = vowel_rda.predict(vowel_X) # accuracy rate acc_rate = sum(vowel['y'] == y_pred) / vowel.shape[0] train_mis_rate.append(1-acc_rate) y_pred = vowel_rda.predict(vowel_test_X) acc_rate = sum(vowel_test['y'] == y_pred) / vowel_test.shape[0] test_mis_rate.append(1-acc_rate)

In [50]:

  Copied!     
 
# figure 4.7 
fig, axes = plt.subplots(1, 1, figsize=(6, 4))
axes.scatter(
    alpha_values, train_mis_rate, 
    color='#5BB5E7', s=7, label='Train Data'
    )
axes.scatter(
    alpha_values, test_mis_rate, 
    color='#E49E25', s=7, label='Test Data'
    )
axes.set_title("Regularized Discriminant Analysis on the Vowel Data")
axes.set_ylabel("Misclassification Rate")
axes.set_xlabel(r"$\alpha$")
axes.legend();
# figure 4.7 fig, axes = plt.subplots(1, 1, figsize=(6, 4)) axes.scatter( alpha_values, train_mis_rate, color='#5BB5E7', s=7, label='Train Data' ) axes.scatter( alpha_values, test_mis_rate, color='#E49E25', s=7, label='Test Data' ) axes.set_title("Regularized Discriminant Analysis on the Vowel Data") axes.set_ylabel("Misclassification Rate") axes.set_xlabel(r"$\alpha$") axes.legend(); 

Figure 4.7 shows the results of RDA applied to the vowel data. Both the training and test error improve with increasing $\alpha$, although the test error increases sharply after $\alpha = 0.9$. The large discrepancy between the training and test error is partly due to the fact that there are many repeat measurements on a small number of individuals, different in the training and test set.

In situations where the number of input variables greatly exceeds the number of samples ($p > N$), the covariance matrix can be poorly estimated. Shrinkage can hopefully improve estimation and classification accuracy. This is illustrated by the figure below.

In [51]:

  Copied!     
 
def generate_data(n_samples, n_features):
    """
    Generate random blob-ish data with noisy features. 
    K = 2, p = n_features 
    
    Only one feature contains discriminative information
    
    The other features contain only noise.
    """
    X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])

    # add non-discriminative features
    if n_features > 1:
        X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
    return X, y
def generate_data(n_samples, n_features): """ Generate random blob-ish data with noisy features. K = 2, p = n_features Only one feature contains discriminative information The other features contain only noise. """ X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]]) # add non-discriminative features if n_features > 1: X = np.hstack([X, np.random.randn(n_samples, n_features - 1)]) return X, y

In [52]:

  Copied!     
 
n_train = 20  # samples for training
n_test = 200  # samples for testing
n_simulation = 30  # number of simulation 
n_features_max = 75  # maximum number of features
step = 4  # step size for the calculation

acc_clf1, acc_clf2 = [], []
n_features_range = range(1, n_features_max + 1, step)

for n_features in n_features_range:
    # run simulation 
    score_clf1, score_clf2 = 0, 0
    for _ in range(n_simulation):
        X, y = generate_data(n_train, n_features)
        # train the model, with alpha=0.5 and 0 (LDA)
        clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=0.5).fit(X, y)
        clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)

        X, y = generate_data(n_test, n_features)
        score_clf1 += clf1.score(X, y)
        score_clf2 += clf2.score(X, y)
    # calculate average of simulations
    acc_clf1.append(score_clf1 / n_simulation)
    acc_clf2.append(score_clf2 / n_simulation)
n_train = 20 # samples for training n_test = 200 # samples for testing n_simulation = 30 # number of simulation n_features_max = 75 # maximum number of features step = 4 # step size for the calculation acc_clf1, acc_clf2 = [], [] n_features_range = range(1, n_features_max + 1, step) for n_features in n_features_range: # run simulation score_clf1, score_clf2 = 0, 0 for _ in range(n_simulation): X, y = generate_data(n_train, n_features) # train the model, with alpha=0.5 and 0 (LDA) clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=0.5).fit(X, y) clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y) X, y = generate_data(n_test, n_features) score_clf1 += clf1.score(X, y) score_clf2 += clf2.score(X, y) # calculate average of simulations acc_clf1.append(score_clf1 / n_simulation) acc_clf2.append(score_clf2 / n_simulation)

In [53]:

  Copied!     
 
features_samples_ratio = np.array(n_features_range) / n_train

fig, axes = plt.subplots(1, 1, figsize=(6, 4))
axes.plot(features_samples_ratio, acc_clf1, linewidth=1.5,
             label="LDA with shrinkage", color='#282A3A')
axes.plot(features_samples_ratio, acc_clf2, linewidth=1.5,
             label="LDA", color='#FD7F28')

axes.set_xlabel('n_features / n_samples')
axes.set_ylabel('Classification accuracy')
axes.legend(prop={'size': 9});
features_samples_ratio = np.array(n_features_range) / n_train fig, axes = plt.subplots(1, 1, figsize=(6, 4)) axes.plot(features_samples_ratio, acc_clf1, linewidth=1.5, label="LDA with shrinkage", color='#282A3A') axes.plot(features_samples_ratio, acc_clf2, linewidth=1.5, label="LDA", color='#FD7F28') axes.set_xlabel('n_features / n_samples') axes.set_ylabel('Classification accuracy') axes.legend(prop={'size': 9});

Tips

The above simulation also shows LDA works well for the high dimension data as long as one feature could give the clear boundary of classification.
The LDA model from sklearn performances way faster than the one we wrote it

Computations for LDA¶

As a lead-in to the next topic, we briefly digress on the computations required for LDA and especially QDA. Their computations are simplified by diagonalizing $\hat{\mathbf{\Sigma}}$ or $\hat{\boldsymbol{\Sigma}}_k$. For the latter, suppose we compute the eigendecomposition for each $\hat{\boldsymbol{\Sigma}}_k=\mathbf{U}_k \mathbf{D}_k \mathbf{U}_k^T$, where $\mathbf{U}_k$ is $p \times p$ orthonormal, and $\mathbf{D}_k$ a diagonal matrix of positive eigenvalues $d_{k \ell}$. Then the ingredients for $\delta_k(x)(4.12)$ are

$\left(x-\hat{\mu}_k\right)^T \hat{\boldsymbol{\Sigma}}_k^{-1}\left(x-\hat{\mu}_k\right) = \left[\mathbf{U}_k^T\left(x-\hat{\mu}_k\right)\right]^T \mathbf{D}_k^{-1}\left[\mathbf{U}_k^T\left(x-\hat{\mu}_k\right)\right]$;
$\log \left|\hat{\boldsymbol{\Sigma}}_k\right|=\sum_{\ell} \log d_{k \ell}$.

In light of the computational steps outlined above, the LDA classifier can be implemented by the following pair of steps:

Sphere the data with respect to the common covariance estimate $\hat{\mathbf{\Sigma}}$ : $X^* \leftarrow \mathbf{D}^{-\frac{1}{2}} \mathbf{U}^T X$, where $\hat{\boldsymbol{\Sigma}}=\mathbf{U D U}^T$. The common covariance estimate of $X^*$ will now be the identity.
Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities $\pi_k$.

Reduced-Rank LDA¶

When $p > K$, we do not have to feed in all dimensions of features as we just need input space lie in an affine subspace $\leq K-1$, and if $p$ is much larger than $K$, this will be a considerable drop in dimension. Moreover, in locating the closest centroid, we can ignore distances orthogonal to this subspace, since they will contribute equally to each class. Thus we might just as well project the $X^*$ onto this centroid-spanning subspace $H_{K-1}$, and make distance comparisons there.

Therefore there is a fundamental dimension reduction in LDA, namely, that we need only consider the data in a subspace of dimension at most $K-1$. If $K=3$, e.g., this could allow us to view the data in $\mathbb{R}^2$, color-coding the classes. In doing so we would not have relinquished any of the information needed for LDA classification.

Before we go through the algorithms, let's examine an example. For a wine classification problem with three different types of wines and 13 input variables, the plot visualizes the data in two discriminant coordinates found by LDA. In this two-dimensional space, the classes can be well-separated. In comparison, the classes are not as clearly separated using the first two principal components found by PCA.

In [54]:

  Copied!     
 
from sklearn import datasets  # import the dataset
from sklearn.decomposition import PCA

wine = datasets.load_wine()
X = wine.data
y = wine.target
target_names = wine.target_names

X_r_lda = LinearDiscriminantAnalysis(
    n_components=2
    ).fit(X, y).transform(X)
X_r_pca = PCA(n_components=2).fit(X).transform(X)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
for i, target_name in zip([0, 1, 2], target_names):
    sns.scatterplot(
        x=X_r_lda[y == i, 0], y=X_r_lda[y == i, 1], 
        label=target_name, 
        alpha=0.7, ax=axes[0]
        )
    sns.scatterplot(
        x=X_r_pca[y == i, 0], y=X_r_pca[y == i, 1], 
        label=target_name, 
        alpha=0.7, ax=axes[1]
    )
sns.move_legend(axes[0], 'lower right')
axes[0].set_title('LDA for Wine dataset')
axes[1].set_title('PCA for Wine dataset')
axes[0].set_xlabel('Discriminant Coordinate 1')
axes[0].set_ylabel('Discriminant Coordinate 2')
axes[1].set_xlabel('PC 1')
axes[1].set_ylabel('PC 2');
from sklearn import datasets # import the dataset from sklearn.decomposition import PCA wine = datasets.load_wine() X = wine.data y = wine.target target_names = wine.target_names X_r_lda = LinearDiscriminantAnalysis( n_components=2 ).fit(X, y).transform(X) X_r_pca = PCA(n_components=2).fit(X).transform(X) fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for i, target_name in zip([0, 1, 2], target_names): sns.scatterplot( x=X_r_lda[y == i, 0], y=X_r_lda[y == i, 1], label=target_name, alpha=0.7, ax=axes[0] ) sns.scatterplot( x=X_r_pca[y == i, 0], y=X_r_pca[y == i, 1], label=target_name, alpha=0.7, ax=axes[1] ) sns.move_legend(axes[0], 'lower right') axes[0].set_title('LDA for Wine dataset') axes[1].set_title('PCA for Wine dataset') axes[0].set_xlabel('Discriminant Coordinate 1') axes[0].set_ylabel('Discriminant Coordinate 2') axes[1].set_xlabel('PC 1') axes[1].set_ylabel('PC 2'); 

Tips

We have observed from the previous section that LDA makes distance comparison in the space spanned by different class means. Two distinct points lie on a 1d line; three distinct points lie on a 2d plane.

For the data with $K$ classes, all means lie on a hyperplane with dimension at most $(K-1)$. In particular, the subspace spanned by the means is

$$H_{K-1} = \mu_1 \bigoplus \text{span}\{\mu_i - \mu_1, 2 \leq i \leq K\}$$

When making distance comparison in this space, distances orthogonal to this subspace would add no information since they contribute equally for each class. Hence, by restricting distance comparisons to this subspace only would not lose any information useful for LDA classification.

That means, we can safely transform our task from a $p$-dimensional problem to a $(K-1)$-dimensional problem by an orthogonal projection of the data onto this subspace.

How could we find the optimal subspace then? Fisher defined optimal to mean that the projected centroids were spread out as much as possible in terms of variance. This means we cover more data in this optimal subspace and used the maximum information we have:

A large variance among the dataset classes (spread out)
A small variance within each of the dataset classes (cluster within class)

In summary then, finding the sequences of optimal subspaces for LDA involves the following steps ($\hat{\mu}_k$ is a column vector denoting the mean vector of class $k$):

compute the $K \times p$ matrix of class centroids $M$ and the common (or pooled) covariance matrix $W$ (for within-class covariance);

$$W = \sum_{k=1}^K \sum_{g_i=k} (x_i - \hat{\mu}_k)(x_i-\hat{\mu}_k)^T$$

compute $M^* = MW^{-1/2}$ using the eigen-decomposition of $W$
compute $B^*$, the covariance matrix of $M^*$ (B for between-class covariance)

$$B^*=\sum_{k=1}^K (\hat{\mu}_k - \hat{\mu}^*)(\hat{\mu}_k - \hat{\mu}^*)^T$$

where

$$\hat{\mu}^* = \sum_{k=1}^K \pi_k \hat{\mu}_k = \text{sample mean}$$

PCA: obtain L eigenvectors $v_l^*$ in $V^*$ of $B^* = V^*D_B V^{*T}$ corresponding to the $L$ largest eigenvalues. These define the coordinates of the optimal subspace.
Obtain $L$ new (discriminant) variables $Z_l = (W^{-1/2} v_l^*)^TX, for $l = 1, \cdots, L$

Through this procedure, we reduce our data from $X_{(N \times p)}$ to $Z_{(N \times L)}$ and dimension from $p$ to $L$. Discriminant coordinate 1 and 2 in the previous wine plot are found by setting $L=2$. Repeating the previous LDA procedures for classification using the new data, $Z$, is called the reduced-rank LDA.

Fisher does not make any assumptions about the distribution of the data. Instead, he tries to find a “sensible” rule so that the classification task becomes easier. In particular, Fisher finds a linear combination of the original data, $Z=a^TX$, where the between-class variance, $B=cov(M)$, is maximized relative to the within-class variance, $W$.

The following figure shows why this rule makes intuitive sense. The rule sets out to find a direction, $a$, where, after projecting the data onto that direction, class means have maximum separation between them, and each class has minimum variance within them. The projection direction found under this rule, shown in the right plot, makes classification much easier.

fisher-LDA

Fisher's formula¶

Fisher constructs two numbers:

\begin{aligned} R \ni d_B & := \mathbf{a}^T \mathbf{B a} \\ R \ni d_W & := \mathbf{a}^T \mathbf{W a} \end{aligned}

He wants to maximize $d_B$ (spread-out) and minimize $d_W$ (cluster within class). We can merge those two optimization problems as a regularized optimization problem:

$$\max_a = d_B(a) - d_W(a) $$

or : $$ \max _{\mathbf{a}} \frac{\mathbf{a}^T \mathbf{B a}}{\mathbf{a}^T \mathbf{W a}} . $$

The problem is equivalent to (see proof) $$ \begin{aligned} &\max _{\mathbf{a}} \mathbf{a}^T \mathbf{B a} \\ &\text { s.t. } \mathbf{a}^T \mathbf{W} \mathbf{a}=1 \end{aligned} $$ since the scaling of a does not affect the solution. Now, we could form the Lagrangian as

$$L = a^T B a - \lambda (a^T W a - 1)$$

The solution is

$$Ba = \lambda W a \implies W^{-1}B a = \lambda a$$

which is the eigenvalue and eigenvector problem and $a$ is the eigenvector of $W^{-1}B$. Because $W$ might be singular and not invertible, we will stretch the diagonal of $W$ slightly to make it full rank and invertible:

$$a = \text{eig} [(W+\epsilon I)^{-1} B] $$

where $a$ is referred to as the Fisher direction or Fisher axis.

To summarize the developments so far:

Gaussian classification with common covariances leads to linear decision boundaries. Classification can be achieved by sphering the data with respect to $W$, and classifying to the closest centroid (modulo $\log \pi k$) in the sphered space.
Since only the relative distances to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.
This subspace can be further decomposed into successively optimal subspaces in term of centroid separation. This decomposition is identical to the decomposition due to Fisher

Once we transferred our dataset, we could classify according to the class $j = 1, \cdots, K$ for which

$$\frac{1}{2} || \tilde{x} - \tilde{\mu} ||^2 - \log \hat{\pi}_j$$

is smalles, where $\tilde{x} = Ax \in R^{K-1}$ and $\tilde{\mu} = A \hat{\mu}_j$

In [55]:

  Copied!     
 
class FDA:
    """
    Fisher Discriminant Analysis
    """

    def __init__(self, n_components=None, kernel=None):
        """
        n_components: the number of components (or coordinates)
        kernel: kernel of X
        """
        self.is_model_fitted = False
        self.n_components = n_components 
        # projection matrix 
        self.proj_matrix = None
        self.class_categories = []
        self.class_priors = {}
        self.class_means = {}
        if kernel is not None:
            self.kernel = kernel 
        else:
            self.kernel = 'linear'
            
    def fit(self, X, y):
        """
        X: N by p matrix
        y: N by 1 matrix
        """
        # get the unique classes
        self.class_categories = np.unique(y)
        # feature dimension
        p = X.shape[1]
        N = X.shape[0]
        K = len(self.class_categories)
        # estimate sample mean 
        sample_mean = X.mean(axis=0).reshape(p, 1)
        
        # estimate between class covariance 
        # between scatter matrix B or S, p by p 
        B = np.zeros((p, p))
        for k in self.class_categories:
            class_k_row_idx = np.where(y==k)[0]
            n_k = len(class_k_row_idx)
            X_filtered_class_k = X[class_k_row_idx, :]
            self.class_priors[k] = float(n_k)/N
            self.class_means[k] = np.mean(
                X_filtered_class_k, axis=0
                ).reshape(p, 1)
            between_mean_diff = self.class_means[k] - sample_mean
            # calculate B 
            # some people use weighted one
            # n_k = len(class_k_row_idx)
            # B += n_k * between_mean_diff @ between_mean_diff.T
            B += between_mean_diff @ between_mean_diff.T
            
        # estimate pooled covariance W
        class_k_covs = {}
        W = 0
        for k in self.class_categories:
            class_k_row_idx = np.where(y==k)[0]
            X_filtered_class_k = X[class_k_row_idx, :]
            # each column as a variable 
            class_k_covs[k] = np.cov(X_filtered_class_k, rowvar=False)
            # calculate pooled covariance
            W += class_k_covs[k] * self.class_priors[k]
            
        # rank the eigenvalues 
        epsilon = 0.00001
        # do not use eigh, it gives wrong eigenvectors
        eig_val, eig_vec = np.linalg.eig(
            np.linalg.pinv(W+epsilon*np.eye(W.shape[0])) @ B
        )
        # only take real values 
        eig_val = eig_val.real
        eig_vec = eig_vec.real 
        # sort eigenvalues in descending order
        eig_idx = eig_val.argsort()[::-1]
        eig_val = eig_val[eig_idx]
        eig_vec = eig_vec[:, eig_idx]
        # get the coordinate matrix 
        if self.n_components is not None:
            U = eig_vec[:, :self.n_components]
        else:
            U = eig_vec[:, :K-1]
            
        self.proj_matrix = U

        self.is_model_fitted = True
        
    def transform(self, X):
        """
        X: N by p matrix 
        projection matrix is p by K-1
        """
        if not self.is_model_fitted:
            raise NameError('Please fit the model first')

        return X @ self.proj_matrix

    def predict(self, X_transformed):
        """
        X_transformed: sample size by p matrix [N, n_components <= K-1]
        return: the predicted class [N, 1] matrix
        """
        y_est = np.apply_along_axis(
            self.__classify, 1, X_transformed
        )

        return y_est

    def __classify(self, x_transformed):
        """
        Private method
        x_transformed: 
            feature vector for one observation [1, K-1] dimension
        Returns: classified category
        """
        if not self.is_model_fitted:
            raise NameError('Please fit the model first')
        
        # calculate the determinant score
        classified_scores = {}
        for k in self.class_categories:
            # transform the mean, y by p - p by k-1
            transformed_mean = self.class_means[k].T @ self.proj_matrix
            mean_distance = np.linalg.norm(
                x_transformed - transformed_mean
                )
            prior_hat = self.class_priors[k]
            score =  0.5*np.square(mean_distance)-np.log(prior_hat)
            classified_scores[k] = score

        # !Warning: we want the minimal score now 
        return min(classified_scores, key=classified_scores.get)
class FDA: """ Fisher Discriminant Analysis """ def __init__(self, n_components=None, kernel=None): """ n_components: the number of components (or coordinates) kernel: kernel of X """ self.is_model_fitted = False self.n_components = n_components # projection matrix self.proj_matrix = None self.class_categories = [] self.class_priors = {} self.class_means = {} if kernel is not None: self.kernel = kernel else: self.kernel = 'linear' def fit(self, X, y): """ X: N by p matrix y: N by 1 matrix """ # get the unique classes self.class_categories = np.unique(y) # feature dimension p = X.shape[1] N = X.shape[0] K = len(self.class_categories) # estimate sample mean sample_mean = X.mean(axis=0).reshape(p, 1) # estimate between class covariance # between scatter matrix B or S, p by p B = np.zeros((p, p)) for k in self.class_categories: class_k_row_idx = np.where(y==k)[0] n_k = len(class_k_row_idx) X_filtered_class_k = X[class_k_row_idx, :] self.class_priors[k] = float(n_k)/N self.class_means[k] = np.mean( X_filtered_class_k, axis=0 ).reshape(p, 1) between_mean_diff = self.class_means[k] - sample_mean # calculate B # some people use weighted one # n_k = len(class_k_row_idx) # B += n_k * between_mean_diff @ between_mean_diff.T B += between_mean_diff @ between_mean_diff.T # estimate pooled covariance W class_k_covs = {} W = 0 for k in self.class_categories: class_k_row_idx = np.where(y==k)[0] X_filtered_class_k = X[class_k_row_idx, :] # each column as a variable class_k_covs[k] = np.cov(X_filtered_class_k, rowvar=False) # calculate pooled covariance W += class_k_covs[k] * self.class_priors[k] # rank the eigenvalues epsilon = 0.00001 # do not use eigh, it gives wrong eigenvectors eig_val, eig_vec = np.linalg.eig( np.linalg.pinv(W+epsilon*np.eye(W.shape[0])) @ B ) # only take real values eig_val = eig_val.real eig_vec = eig_vec.real # sort eigenvalues in descending order eig_idx = eig_val.argsort()[::-1] eig_val = eig_val[eig_idx] eig_vec = eig_vec[:, eig_idx] # get the coordinate matrix if self.n_components is not None: U = eig_vec[:, :self.n_components] else: U = eig_vec[:, :K-1] self.proj_matrix = U self.is_model_fitted = True def transform(self, X): """ X: N by p matrix projection matrix is p by K-1 """ if not self.is_model_fitted: raise NameError('Please fit the model first') return X @ self.proj_matrix def predict(self, X_transformed): """ X_transformed: sample size by p matrix [N, n_components <= K-1] return: the predicted class [N, 1] matrix """ y_est = np.apply_along_axis( self.__classify, 1, X_transformed ) return y_est def __classify(self, x_transformed): """ Private method x_transformed: feature vector for one observation [1, K-1] dimension Returns: classified category """ if not self.is_model_fitted: raise NameError('Please fit the model first') # calculate the determinant score classified_scores = {} for k in self.class_categories: # transform the mean, y by p - p by k-1 transformed_mean = self.class_means[k].T @ self.proj_matrix mean_distance = np.linalg.norm( x_transformed - transformed_mean ) prior_hat = self.class_priors[k] score = 0.5*np.square(mean_distance)-np.log(prior_hat) classified_scores[k] = score # !Warning: we want the minimal score now return min(classified_scores, key=classified_scores.get)

In [56]:

  Copied!     
 
# test the FDA
from sklearn import datasets  # import the dataset
from sklearn.decomposition import PCA

wine = datasets.load_wine()
X = wine.data
y = wine.target
target_names = wine.target_names
# reduced dimension 
wine_fda = FDA(n_components=2)
wine_fda.fit(X, y)
X_r_fda = wine_fda.transform(X)
X_r_pca = PCA(n_components=2).fit(X).transform(X)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
for i, target_name in zip([0, 1, 2], target_names):
    sns.scatterplot(
        x=X_r_fda[y == i, 0], y=X_r_fda[y == i, 1], 
        label=target_name, 
        alpha=0.7, ax=axes[0]
        )
    sns.scatterplot(
        x=X_r_pca[y == i, 0], y=X_r_pca[y == i, 1], 
        label=target_name, 
        alpha=0.7, ax=axes[1]
    )
sns.move_legend(axes[0], 'lower right')
axes[0].set_title('FDA for Wine dataset')
axes[1].set_title('PCA for Wine dataset')
axes[0].set_xlabel('Discriminant Coordinate 1')
axes[0].set_ylabel('Discriminant Coordinate 2')
axes[1].set_xlabel('PC 1')
axes[1].set_ylabel('PC 2');
# test the FDA from sklearn import datasets # import the dataset from sklearn.decomposition import PCA wine = datasets.load_wine() X = wine.data y = wine.target target_names = wine.target_names # reduced dimension wine_fda = FDA(n_components=2) wine_fda.fit(X, y) X_r_fda = wine_fda.transform(X) X_r_pca = PCA(n_components=2).fit(X).transform(X) fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for i, target_name in zip([0, 1, 2], target_names): sns.scatterplot( x=X_r_fda[y == i, 0], y=X_r_fda[y == i, 1], label=target_name, alpha=0.7, ax=axes[0] ) sns.scatterplot( x=X_r_pca[y == i, 0], y=X_r_pca[y == i, 1], label=target_name, alpha=0.7, ax=axes[1] ) sns.move_legend(axes[0], 'lower right') axes[0].set_title('FDA for Wine dataset') axes[1].set_title('PCA for Wine dataset') axes[0].set_xlabel('Discriminant Coordinate 1') axes[0].set_ylabel('Discriminant Coordinate 2') axes[1].set_xlabel('PC 1') axes[1].set_ylabel('PC 2'); 

In [57]:

  Copied!     
 
# figure 4.10
dimension = range(1, 11)
train_mis_rate = []
test_mis_rate = []
# construct x and y
vowel_X = vowel.iloc[:, 1:].values
vowel_y = vowel.iloc[:, 0].values.reshape(-1, 1)
vowel_test_X = vowel_test.iloc[:, 1:].values
for d in dimension:
    vowel_fda = FDA(n_components=d)
    vowel_fda.fit(vowel_X, vowel_y)
    vowel_X_transformed = vowel_fda.transform(vowel_X)
    y_pred = vowel_fda.predict(vowel_X_transformed)
    # accuracy rate
    acc_rate = sum(vowel['y'] == y_pred) / vowel.shape[0]
    train_mis_rate.append(1-acc_rate)
    vowel_test_X_transformed = vowel_fda.transform(vowel_test_X)
    y_pred = vowel_fda.predict(vowel_test_X_transformed)
    acc_rate = sum(vowel_test['y'] == y_pred) / vowel_test.shape[0]
    test_mis_rate.append(1-acc_rate)
# figure 4.10 dimension = range(1, 11) train_mis_rate = [] test_mis_rate = [] # construct x and y vowel_X = vowel.iloc[:, 1:].values vowel_y = vowel.iloc[:, 0].values.reshape(-1, 1) vowel_test_X = vowel_test.iloc[:, 1:].values for d in dimension: vowel_fda = FDA(n_components=d) vowel_fda.fit(vowel_X, vowel_y) vowel_X_transformed = vowel_fda.transform(vowel_X) y_pred = vowel_fda.predict(vowel_X_transformed) # accuracy rate acc_rate = sum(vowel['y'] == y_pred) / vowel.shape[0] train_mis_rate.append(1-acc_rate) vowel_test_X_transformed = vowel_fda.transform(vowel_test_X) y_pred = vowel_fda.predict(vowel_test_X_transformed) acc_rate = sum(vowel_test['y'] == y_pred) / vowel_test.shape[0] test_mis_rate.append(1-acc_rate)

In [58]:

  Copied!     
 
# figure 4.10
fig, axes = plt.subplots(1, 1, figsize=(6, 4))
axes.plot(
    dimension, test_mis_rate, 
    color='#E49E25',  linestyle='--', marker='o',
    markersize=4, dashes=(17, 5),
    label='Test Data'
    )
axes.plot(
    dimension, train_mis_rate, 
    color='#5BB5E7', linestyle='--', marker='o',
    markersize=4, dashes=(25, 5),
    label='Train Data'
    )
axes.set_title("LDA and Dimension Reduction on the Vowel Data")
axes.set_ylabel("Misclassification Rate")
axes.set_xlabel("Dimension")
axes.legend();
# figure 4.10 fig, axes = plt.subplots(1, 1, figsize=(6, 4)) axes.plot( dimension, test_mis_rate, color='#E49E25', linestyle='--', marker='o', markersize=4, dashes=(17, 5), label='Test Data' ) axes.plot( dimension, train_mis_rate, color='#5BB5E7', linestyle='--', marker='o', markersize=4, dashes=(25, 5), label='Train Data' ) axes.set_title("LDA and Dimension Reduction on the Vowel Data") axes.set_ylabel("Misclassification Rate") axes.set_xlabel("Dimension") axes.legend();

In [59]:

  Copied!     
 
import matplotlib.pyplot as plt
from sklearn import datasets, svm, metrics
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

digits = datasets.load_digits()

images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)
    plt.tight_layout()
import matplotlib.pyplot as plt from sklearn import datasets, svm, metrics from sklearn.discriminant_analysis import LinearDiscriminantAnalysis digits = datasets.load_digits() images_and_labels = list(zip(digits.images, digits.target)) for index, (image, label) in enumerate(images_and_labels[:4]): plt.subplot(2, 4, index + 1) plt.axis('off') plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') plt.title('Training: %i' % label) plt.tight_layout()

In [60]:

  Copied!     
 
X = digits.data
n_samples = X.shape[0]
y = digits.target
target_names = digits.target_names

# Create a classifier: a Fisher's LDA classifier
lda = LinearDiscriminantAnalysis(
    n_components=4, solver='eigen', shrinkage=0.1
    )

# Train lda on the first half of the digits
lda = lda.fit(X[:n_samples // 2], y[:n_samples // 2])
X_r_lda = lda.transform(X)

# Visualize transformed data on learnt discriminant coordinates
with plt.style.context('seaborn-talk'):
    fig, axes = plt.subplots(1,2,figsize=[10,5])
    for i, target_name in zip([0,1,2,3,4,5,6,7,8,9], target_names):
        axes[0].scatter(
            X_r_lda[y == i, 0], X_r_lda[y == i, 1], 
            alpha=.8, label=target_name,
            marker='$%.f$'%i)
        axes[1].scatter(
            X_r_lda[y == i, 2], X_r_lda[y == i, 3], 
            alpha=.8, label=target_name,
            marker='$%.f$'%i)
    axes[0].set_xlabel('Discriminant Coordinate 1')
    axes[0].set_ylabel('Discriminant Coordinate 2')
    axes[1].set_xlabel('Discriminant Coordinate 3')
    axes[1].set_ylabel('Discriminant Coordinate 4')
    plt.tight_layout()
X = digits.data n_samples = X.shape[0] y = digits.target target_names = digits.target_names # Create a classifier: a Fisher's LDA classifier lda = LinearDiscriminantAnalysis( n_components=4, solver='eigen', shrinkage=0.1 ) # Train lda on the first half of the digits lda = lda.fit(X[:n_samples // 2], y[:n_samples // 2]) X_r_lda = lda.transform(X) # Visualize transformed data on learnt discriminant coordinates with plt.style.context('seaborn-talk'): fig, axes = plt.subplots(1,2,figsize=[10,5]) for i, target_name in zip([0,1,2,3,4,5,6,7,8,9], target_names): axes[0].scatter( X_r_lda[y == i, 0], X_r_lda[y == i, 1], alpha=.8, label=target_name, marker='$%.f$'%i) axes[1].scatter( X_r_lda[y == i, 2], X_r_lda[y == i, 3], alpha=.8, label=target_name, marker='$%.f$'%i) axes[0].set_xlabel('Discriminant Coordinate 1') axes[0].set_ylabel('Discriminant Coordinate 2') axes[1].set_xlabel('Discriminant Coordinate 3') axes[1].set_ylabel('Discriminant Coordinate 4') plt.tight_layout()

The above plot allows us to interpret the trained LDA classifier. For example, coordinate 1 helps to contrast 4’s and 2/3’s while coordinate 2 contrasts 0’s and 1’s. Subsequently, coordinate 3 and 4 help to discriminate digits not well-separated in coordinate 1 and 2. We test the trained classifier using the other half of the dataset. The report below summarizes the result.

In [61]:

  Copied!     
 
n_samples = len(X)

# Predict the value of the digit on the second half:
expected = y[n_samples // 2:]
predicted = lda.predict(X[n_samples // 2:])

report = metrics.classification_report(expected, predicted)
print("Classification report:\n%s" % (report))
n_samples = len(X) # Predict the value of the digit on the second half: expected = y[n_samples // 2:] predicted = lda.predict(X[n_samples // 2:]) report = metrics.classification_report(expected, predicted) print("Classification report:\n%s" % (report))

Classification report:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        88
           1       0.94      0.88      0.91        91
           2       0.97      0.90      0.93        86
           3       0.91      0.95      0.92        91
           4       1.00      0.91      0.95        92
           5       0.93      0.96      0.94        91
           6       0.97      0.99      0.98        91
           7       0.98      0.97      0.97        89
           8       0.91      0.85      0.88        88
           9       0.79      0.93      0.86        92

    accuracy                           0.93       899
   macro avg       0.94      0.93      0.93       899
weighted avg       0.93      0.93      0.93       899

The highest precision is 99%, and the lowest is 77%, a decent result knowing that the method was proposed some 70 years ago. Besides, we have not done anything to make the procedure better for this specific problem. For example, there is collinearity in the input variables, and the shrinkage parameter might not be optimal.

4.4 Logistic Regression¶

The logistic regression model arises from the desire to model the posterior probabilities of the $K$ classes via linear functions in $x$, ensuring the natural properties of the probability: They sum to one and remain in $[0,1]$.

The model has the form

\begin{aligned} \log\frac{\text{Pr}(G=1|X=x)}{\text{Pr}(G=K|X=x)} &= \beta_{10} + \beta_1^Tx \\ \log\frac{\text{Pr}(G=2|X=x)}{\text{Pr}(G=K|X=x)} &= \beta_{20} + \beta_2^Tx \\ &\vdots \\ \log\frac{\text{Pr}(G=K-1|X=x)}{\text{Pr}(G=K|X=x)} &= \beta_{(K-1)0} + \beta_{K-1}^Tx \\ \end{aligned}

The model is specified in terms of $K-1$ log-odds or logit transformations, reflecting the constraint that the probabilities sum to one. The choice of denominator ($K$ in this case) is arbitrary in that the estimates are equivalent under this choice.

To emphasize the dependence on the entire parameter set $\theta = \left\lbrace \beta_{10}, \beta_1^T, \cdots, \beta_{(K-1)0}, \beta_{K-1}^T\right\rbrace$, we denote the probabilities

\begin{equation} \text{Pr}(G=k|X=x) = p_k(x;\theta) \end{equation}

A simple calculation shows that

\begin{aligned} \text{Pr}(G=k|X=x) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)}, \text{ for } k=1,\cdots,K-1, \\ \text{Pr}(G=K|X=x) &= \frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)}, \end{aligned}

It is better to write it as the following form:

\begin{aligned} \text{Pr}(G=k|X=x) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{\exp(0)+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)} \\ & = \frac{\exp(\beta_{k0}+\beta_k^Tx)}{\sum_{l=0}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)} \\ \text{Pr}(G=K|X=x) &= \frac{\exp(0)}{\sum_{l=0}^{K-1}\exp(\beta_{l0}+\beta_l^Tx)} \end{aligned}

and they clearly sum to one.

When $K=2$, this model is especially simple, since there is only a single linear function.

Fitting logistic regression models¶

Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of $G$ given $X$. Since $\text{Pr}(G|X)$ completely specifies the conditional distribution, the multinomial distribution is appropriate.

The log-likelihood for $N$ observation is

\begin{equation} l(\theta) = \sum_{i=1}^N \log p_{g_i}(x_i;\theta), \end{equation}

where $p_k(x_i;\theta) = \text{Pr}(G=k|X=x_i;\theta)$

We discuss in detail the two-class case, sine the algorithms simplify considerably. It is convenient to code the two-class $g_i$ via a $0/1$ response $y_i$, where $y_i=1$ when $g_i=1$, and $0$ otherwise. Then we can let

\begin{aligned} p_1(x;\theta) &= p(x;\theta) , \\ p_2(x;\theta) &= 1- p(x;\theta). \\ \end{aligned}

And it is easy to show for two-class case, we have

$$ \frac{p(x_i;\beta)}{1-p(x_i;\beta)} = \beta^Tx_i$$

The log-likelihood can be written

\begin{aligned} l(\beta) &= \sum_{i=1}^N \left\lbrace y_i\log p(x_i;\beta) + (1-y_i)\log(1-p(x_i;\beta)) \right\rbrace \\ & = \sum_{i=1}^N \left\lbrace y_i [\log p(x_i;\beta) - \log(1-p(x_i;\beta)] + \log(1-p(x_i;\beta)) \right\rbrace \\ & = \sum_{i=1}^N \left\lbrace y_i \log \frac{p(x_i;\beta)}{1-p(x_i;\beta)} + \log(1-p(x_i;\beta)) \right\rbrace \\ &= \sum_{i=1}^N \left\lbrace y_i\beta^Tx_i - \log(1+\exp(\beta^Tx)) \right\rbrace, \end{aligned}

where $\beta^T = \lbrace \beta_{10}, \beta_1^T \rbrace$, and we assume that the vector of inputs $x_i$ includes the constant term 1 to acommodate the intercept.

To maximize the log-likelihood, we set its derivatives to zero. These score equations are

$$ \frac{\partial l(\beta)}{\partial\beta} = \sum_{i=1}^N x_i(y_i-p(x_i;\beta)) = 0, $$

which are $p+1$ equations nonlinear in $\beta$. Notice that since $x_{i1} =1$, the first score equation specifies that

$$ \sum_{i=1}^N y_i = \sum_{i=1}^N p(x_i;\beta), $$

implying that the expected number of class ones matches the observed number (and hence also class twos).

Tips

For the case K = 2, we used the Sigmoid function;
For the cases of$K > 3, we will use softmax function and log likelihood function will be different.

Newton-Raphson algorithm for binary case¶

To solve the score equation, we use the Newton-Raphson algorithm, which requires the second-derivative or Hessian matrix

\begin{equation} \frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} = -\sum_{i=1}^N x_ix_i^T p(x_i;\beta)(1-p(x_i;\beta)). \end{equation}

Starting with $\beta^{\text{old}}$, a single Newton update is

\begin{equation} \beta^{\text{new}} = \beta^{\text{old}} - \left( \frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} \right)^{-1} \frac{\partial l(\beta)}{\partial\beta}, \end{equation}

where the derivatives are evaluated at $\beta^{\text{old}}$.

Iteration in matrix notation for binary case¶

Let

$\mathbf{y}$ be the vector of $y_i$ values,
$\mathbf{X}$ the $N\times (p+1)$ matrix of $x_i$ values,
$\mathbf{p}$ the vector of fitted probabilities with $i$th element $p(x_i;\beta^{\text{old}})$, and
$\mathbf{W}$ $N\times N$ diagonal matrix of weights with $i$th diagonal elements $p(x_i;\beta^{\text{old}})(1-p(x_i;\beta^{\text{old}}))$.

Then we have

\begin{aligned} \frac{\partial l(\beta)}{\partial\beta} &= \mathbf{X}^T(\mathbf{y}-\mathbf{p}) \\ \frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} &= -\mathbf{X}^T\mathbf{WX}, \end{aligned}

and thus the Newton step is

\begin{aligned} \beta^{\text{new}} &= \beta^{\text{old}} + (\mathbf{X}^T\mathbf{WX})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p}) \\ &= (\mathbf{X}^T\mathbf{WX})^{-1} \mathbf{X}^T\mathbf{W}\left( \mathbf{X}\beta^{\text{old}} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \right) \\ &= (\mathbf{X}^T\mathbf{WX})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z}, \end{aligned}

where we have re-expressed the Newton step as weighted least squares step, with the response

$$ \mathbf{z} = \mathbf{X}\beta^{\text{old}} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}), $$

sometimes known as the adjusted response.

These equations for the Newton step get solved repeatedly, since at each iteration $p$ changes, and hence so does $\mathbf{W}$ and $\mathbf{z}$. This algorithm is referred to as iteratively reweighted least squares or IRLS, since each iteration solves the weighted least squares problem:

$$ \beta^{\text{new}} \leftarrow \arg\min_\beta (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta) $$

It seems that $\beta=0$ is a good starting value, although convergence is never guaranteed. Typically the algorithm does converge, since the log-likelihood is concave, but overshooting can occur. In the rare cases that the log-likelihood decreases, step size halving will guarantee convergence.

A simplified derivation for binary case¶

Use the sigmoid function we have:

$$p(x; \beta) = \sigma(z) = \frac{e^z}{1+e^z} = \frac{1}{1+ \exp(-z)}$$

where

$$ z = X\beta := w\cdot x + b$$

The derivative of the sigmoid is very elegant:

$$\frac{d \sigma }{d z } = \sigma(z) [1- \sigma(z)]$$

Then, we have

\begin{aligned} \frac{d p}{d \beta} & = \frac{d \sigma}{d z} \frac{d z}{d \beta} \\ & = \sigma(z) [1- \sigma(z)] X^T \\ & = p(x; \beta)[1 - p(x; \beta)] X^T \end{aligned}

For the following equation,

$$ \frac{\partial l(\beta)}{\partial\beta} = \sum_{i=1}^N x_i(y_i-p(x_i;\beta)) $$

Its second derivative is easy to derive now:

$$ \frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} = -\sum_{i=1}^N x_ix_i^T p(x_i;\beta)(1-p(x_i;\beta)). $$

Tips

The neural network uses sigmoid and softmax functions a lot. Moreover, the loss function is just the negativity of log-likelihood function. Therefore, we will go through those derivations.

The first derivative takes a long time and has more equations. Nevertheless, we could still do it with several components.

For our log-likelihood function

$$ L(\beta) = \sum_{i=1}^N \left\lbrace y_i\log p(x_i;\beta) + (1-y_i)\log(1-p(x_i\beta)) \right\rbrace $$

It is easy to have

$$\frac{d L}{d p} = \frac{y}{p} - \frac{1-y}{1-p}$$

We have already shown :

$$\frac{d p}{d \beta} = p(x; \beta)[1 - p(x; \beta)] X^T = p(1 - p)X^T$$

Now, let's put them together

\begin{aligned} \frac{dL}{d \beta} & = \frac{d L}{d p} \frac{dp}{d \beta} \\ & = \big [ \frac{y}{p} - \frac{1-y}{1-p} \big ] [p(1 - p)X^T] \\ & = \frac{y-p}{p(1-p)}[p(1 - p)X^T] \\ & = (y-p)X^T \end{aligned}

which is equivalent to

$$ \frac{\partial l(\beta)}{\partial\beta} = \sum_{i=1}^N x_i(y_i-p(x_i;\beta)) $$

Tips

Since p, 1-p and p(1-p) are scalar, that's why we have to take the diagnonal element when we implement the matrix computation.

In [62]:

  Copied!     
 
# implement a simple logistic regression for binary case 
# the close form of solution 
def sigmoid(x):
    # sigmoid function
    return 1/(1+np.exp(-x))


def gradient(x, y, beta):
    """
    The first derivative of log-likelihood function
    Based on equation (4.24 or 4.21)

    x: n by p+1 matrix (with constant)
    y: n by 1 vector
    beta: p+1 by 1 vector

    Return [(p+1) by 1 row vector]
    """
    z = x @ beta  # n by 1 vector
    phat = sigmoid(z)  # n by 1 vector 
    return x.T @ (y - phat)  


def hessian(x, beta):
    """
    The second derivative of log-likelihood function
    Based on equation (4.25 or 4.23)
    x: n by p+1 matrix 
    beta: parameters, p+1 by 1
    
    Return: p+1 by p+1 matrix 
    """
    z = x @ beta  # n by 1 vector
    phat = sigmoid(z)  # n by 1 vector 
    # flatten the phat vector
    # ! Important step
    phat = phat.flatten() 
     # create the diagnoal matrix filled with 0s
    w = np.diag(phat*(1-phat))  # n by n matrix 
    
    return - x.T @ w @ x


def newton(x, y, beta_0, G, H, epsilon, max_iter):
    """
    Newton's method to estimate parameters
    beta_0: initial values p+1 by 1 
    epsilon: convergence rate 
    G: gradient function
    H: hessian function 
    """

    # !Important to copy, otherwise all values will be changed
    beta = beta_0.copy()
    # initialize the different of beta_old and new
    delta = 1  # shoul be positive
    iteration = 0 
    while delta > epsilon and iteration < max_iter:
        # save the orginal value before updating 
        beta_old = beta.copy()
        # update beta, [p+1, p+1] @ [p+1, 1] 
        beta -= np.linalg.inv(H(x, beta)) @ G(x, y, beta)
        iteration += 1 

    # calculate the standard error
    # std err is sqrt of diagonal of inverse of negative hessian
    neg_hessian = -H(x, beta)
    neg_hessian_inv = np.linalg.inv(neg_hessian)
    std_err = np.sqrt(np.diag(neg_hessian_inv))

    # one could also estimate the standard error by 
    # doing boostrapping (100 estimation with different samples)

    return beta, std_err
# implement a simple logistic regression for binary case # the close form of solution def sigmoid(x): # sigmoid function return 1/(1+np.exp(-x)) def gradient(x, y, beta): """ The first derivative of log-likelihood function Based on equation (4.24 or 4.21) x: n by p+1 matrix (with constant) y: n by 1 vector beta: p+1 by 1 vector Return [(p+1) by 1 row vector] """ z = x @ beta # n by 1 vector phat = sigmoid(z) # n by 1 vector return x.T @ (y - phat) def hessian(x, beta): """ The second derivative of log-likelihood function Based on equation (4.25 or 4.23) x: n by p+1 matrix beta: parameters, p+1 by 1 Return: p+1 by p+1 matrix """ z = x @ beta # n by 1 vector phat = sigmoid(z) # n by 1 vector # flatten the phat vector # ! Important step phat = phat.flatten() # create the diagnoal matrix filled with 0s w = np.diag(phat*(1-phat)) # n by n matrix return - x.T @ w @ x def newton(x, y, beta_0, G, H, epsilon, max_iter): """ Newton's method to estimate parameters beta_0: initial values p+1 by 1 epsilon: convergence rate G: gradient function H: hessian function """ # !Important to copy, otherwise all values will be changed beta = beta_0.copy() # initialize the different of beta_old and new delta = 1 # shoul be positive iteration = 0 while delta > epsilon and iteration < max_iter: # save the orginal value before updating beta_old = beta.copy() # update beta, [p+1, p+1] @ [p+1, 1] beta -= np.linalg.inv(H(x, beta)) @ G(x, y, beta) iteration += 1 # calculate the standard error # std err is sqrt of diagonal of inverse of negative hessian neg_hessian = -H(x, beta) neg_hessian_inv = np.linalg.inv(neg_hessian) std_err = np.sqrt(np.diag(neg_hessian_inv)) # one could also estimate the standard error by # doing boostrapping (100 estimation with different samples) return beta, std_err

Tips

Later we will just use loss function and gradient descent method to estimate coefficients, which will use more iterations as it is dealing with the first order derivative.

Example: South African heart disease (binary case)¶

Here we present an analysis of binary data to illustrate the traditional statistical use of the logistic regression model.

In [63]:

  Copied!     
 
# read heart disease data 
sa_heart = pd.read_csv('./data/heart/SAheart.data', index_col=0)
sa_heart.head()
# read heart disease data sa_heart = pd.read_csv('./data/heart/SAheart.data', index_col=0) sa_heart.head()

Out[63]:

	sbp	tobacco	ldl	adiposity	famhist	typea	obesity	alcohol	age	chd
row.names
1	160	12.00	5.73	23.11	Present	49	25.30	97.20	52	1
2	144	0.01	4.41	28.61	Absent	55	28.87	2.06	63	1
3	118	0.08	3.48	32.28	Present	52	29.14	3.81	46	0
4	170	7.50	6.41	38.03	Present	51	31.99	24.26	58	1
5	134	13.60	3.50	27.78	Present	60	25.99	57.34	49	1

In [64]:

  Copied!     
 
# we will only use four variables 
sa_heart.pop('adiposity')
sa_heart.pop('typea')
# creat dummy for famhist
sa_heart['famhist'] = sa_heart['famhist'].map(
    {
        'Present': 1, 'Absent': 0
    }
)
sa_heart.head()
# we will only use four variables sa_heart.pop('adiposity') sa_heart.pop('typea') # creat dummy for famhist sa_heart['famhist'] = sa_heart['famhist'].map( { 'Present': 1, 'Absent': 0 } ) sa_heart.head()

Out[64]:

	sbp	tobacco	ldl	famhist	obesity	alcohol	age	chd
row.names
1	160	12.00	5.73	1	25.30	97.20	52	1
2	144	0.01	4.41	0	28.87	2.06	63	1
3	118	0.08	3.48	1	29.14	3.81	46	0
4	170	7.50	6.41	1	31.99	24.26	58	1
5	134	13.60	3.50	1	25.99	57.34	49	1

In [65]:

  Copied!     
 
# prepare x and y
sa_df_y = sa_heart.pop('chd')
sa_y = sa_df_y.values.reshape(-1, 1)
sa_x = sa_heart.values
# prepare x and y sa_df_y = sa_heart.pop('chd') sa_y = sa_df_y.values.reshape(-1, 1) sa_x = sa_heart.values

In [66]:

  Copied!     
 
# plot the grapha
colors = sa_df_y.apply(lambda y: 'C1' if y else 'C0')
pd.plotting.scatter_matrix(sa_heart, color=colors, figsize=(10, 10));
# plot the grapha colors = sa_df_y.apply(lambda y: 'C1' if y else 'C0') pd.plotting.scatter_matrix(sa_heart, color=colors, figsize=(10, 10));

In [67]:

  Copied!     
 
# fit logistic model
n, p = sa_x.shape
x_const = np.hstack([np.ones((n, 1)), sa_x])
beta_initial = np.zeros((p+1, 1))
beta_est, std_errors = newton(
    x_const, sa_y, beta_initial, gradient, hessian,
    1e-6, 100
)
# fit logistic model n, p = sa_x.shape x_const = np.hstack([np.ones((n, 1)), sa_x]) beta_initial = np.zeros((p+1, 1)) beta_est, std_errors = newton( x_const, sa_y, beta_initial, gradient, hessian, 1e-6, 100 )

In [68]:

  Copied!     
 
print('{0:>15} {1:>15} {2:>15} {3:>15}'.format('Term', 'Coefficient',
                                               'Std. Error', 'Z Score'))
print('-'*64)
table_term = ['intercept'] + list(sa_heart.columns)
for term, coeff, std_err in zip(table_term, beta_est, std_errors):
    print('{0:>15} {1:>15.3f} {2:>15.3f} {3:>15.3f}'.format(
        term, float(coeff), float(std_err),
        float(coeff)/float(std_err)
        ))
print('{0:>15} {1:>15} {2:>15} {3:>15}'.format('Term', 'Coefficient', 'Std. Error', 'Z Score')) print('-'*64) table_term = ['intercept'] + list(sa_heart.columns) for term, coeff, std_err in zip(table_term, beta_est, std_errors): print('{0:>15} {1:>15.3f} {2:>15.3f} {3:>15.3f}'.format( term, float(coeff), float(std_err), float(coeff)/float(std_err) ))

           Term     Coefficient      Std. Error         Z Score
----------------------------------------------------------------
      intercept          -4.130           0.964          -4.283
            sbp           0.006           0.006           1.023
        tobacco           0.080           0.026           3.034
            ldl           0.185           0.057           3.218
        famhist           0.939           0.225           4.177
        obesity          -0.035           0.029          -1.187
        alcohol           0.001           0.004           0.136
            age           0.043           0.010           4.181

When we calculate the second derivate, we could get the Fisher informtion

$$I(\beta) = - E [\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}]= E[\mathbf{X}^T\mathbf{WX}]$$

The approximate sample distribution of $\hat{\beta}$ for large $n$ is

$$N[\beta, (X^TWX)^{-1}]$$

How does one interpret $\textsf{tobacco}$ coefficient of $0.081$ ($\text{Std. Error} = 0.026$), for example?

An increase of $1\text{kg}$ in lifetime tobacco usage accounts for an increase in the odds of coronary heart disease of $\exp(0.081)=1.084$ or $8.4\%$.

Incorporating the standard error we get an approximate $95\%$ confidence interval of

$$ \exp(0.081 \pm 2\times 0.026) = (1.03, 1.14). $$

We see that some of the variables have nonlinear effects, and when modeled appropriately, are not excluded from the model.

LDA or Logistic Regression¶

It is generally felt that logistic regression is a safer, more robust bet than the LDA model, relying on fewer assumptions. It is our experience that the models give very similar results, even when LDA is used inappropriately, such as with qualitative predictors.

LDA could be also used as a dimension reduction tool.

References¶

ML-library
RLDA
LDA Exlained
PracticalML
Tübingen ML
an illustration
FDA-Python
tutorial
CMU Notes
Reduced-Rank LDA R Code
Statistical Inference

Last update: November 13, 2022