Penalized Regression

Logistic Regression

Logistic regression is used in a variety of fields and subjects when the goal of the analysis is to predict the likelihood or probability of a particular event occurring. We typically denote instances where an event occurs as a “success” and a “failure” when the event does not occur. These names, however, are often catered to the particular application of interest. For instance, we may opt to use “spam” and “not spam” in email spam detection and “cat” and “dog” in object recognition. logitr was built with particular applications like these in mind. In order to better understand the code contained in the package, we will provide a brief mathematical summary in the following paragraphs.

Let us consider a two-class classification problem. We denote the response $y_{i}$ for the $i$th observation being contained in the set $\left \{ 0, 1 \right \}$. In this set up (without loss of generality), “1” is the success and “0” is the failure. Furthermore, we denote our observed variables $x_{1}, ..., x_{p}$ as a column vector $x_{i} = (1, x_{i1}, x_{i2}, ..., x_{ip})^{T}$. Given the information in $x_{i}$, it is the goal of logistic regression to predict the probability that $y_{i} = 1$. That is, the conditional probability $P(Y_{1} = 1 | X_{i} = x_{i}) = 1 - P(Y_{i} = 0 | X_{i} = x_{i})$.

Once this probability is calculated, we can define a decision rule (denoted by $\phi$) that classifies observation $i$ as either “0” or “1”. Intuititvely, we might say that if the probability of observation $i$ being in class “1” is greater than the probability observation $i$ being in class “0”, then we should classify observation $i$ as class “1” (and “0” otherwise). Indeed, this is the optimal decision rule given to us by the Baye’s Rules. The decision rule is the following:

\[ \phi(x_{i}) = \begin{cases} 1 & \text{ if } P(Y_{i} = 1 | X_{i} = x_{i}) \geq P(Y_{i} = 0 | X_{i} = x_{i}) \\ 0 & \text{ otherwise } \end{cases} \]

This decision rule can be summed up as choosing the class that has the greatest probability conditional on the observed information $x_{i}$. An equivalent decision rule uses what is called the log-odds ratio:

\[ \phi(x_{i}) = \begin{cases} 1 & \text{ if } \log\left (\frac{P(Y_{i} = 1 | X_{i} = x_{i})}{P(Y_{i} = 0 | X_{i} = x_{i})} \right ) \geq 0 \\ 0 & \text{ otherwise } \end{cases} \]

where the $\log(\cdot)$ term is the log-odds ratio.

This formulation is convenient because in logistic regression we make the critical assumption that this log-odds ratio is linear in form. That is, we can estimate the log-odds ration using some vector of coefficients $\beta = (\beta_{0}, \beta_{1}, ..., \beta_{p})^{T}$:

\[ \log \left (\frac{P(Y_{i} = 1 | X_{i} = x_{i})}{P(Y_{i} = 0 | X_{i} = x_{i})} \right ) \approx x_{i}^{T}\beta \]

This implies the following:

\[\begin{align} \log &\left (\frac{P(Y_{i} = 1 | X_{i} = x_{i})}{P(Y_{i} = 0 | X_{i} = x_{i})} \right ) = \log \left (\frac{P(Y_{i} = 1 | X_{i} = x_{i})}{1 - P(Y_{i} = 1 | X_{i} = x_{i})} \right ) \approx x_{i}^{T}\beta \\ &\Rightarrow \frac{P(Y_{i} = 1 | X_{i} = x_{i})}{1 - P(Y_{i} = 1 | X_{i} = x_{i})} \approx \exp(x_{i}^{T}\beta) \\ &\Rightarrow P(Y_{i} = 1 | X_{i} = x_{i}) \approx \frac{\exp(x_{i}^{T}\beta)}{1 + \exp(x_{i}^{T}\beta)} \equiv p(x_{i}) \mbox{ (say)} \end{align}\]

We see that the conditional probability $P(Y_{i} = 1 | X_{i} = x_{i})$ can be approximated by the function $p(x_{i})$ – we call this function the logit function (hence the name of our package!). We will show the method for solving for the optimal vector of coefficients ($\hat{\beta}$) in the next section.

Maximum Likelihood Estimation

The classical statistical method of optimization for linear and generalized linear models is called maximum likelihood estimation. In short, we want to maximize the probability of our observed data. This requires defining a probability distribution. Recall that our formulation assumes $Y_{i}$ is some random variable that takes values in the set $\left \{ 0, 1 \right \}$. By definition, this random variable follows a Bernoulli distribution:

\[ P(Y_{i} = y_{i} | X_{i} = x_{i}) = \begin{cases} p(x_{i}) & \text{ if } y_{i} = 1\\ 1 - p(x_{i}) & \text{ if } y_{i} = 0 \end{cases} \]

Equivalently, we can write this as:

\[ P(Y_{i} = y_{i} | X_{i} = x_{i}) = p(x_{i})^{y_{i}}(1 - p(x_{i}))^{1 - y_{i}} \]

Having defined a probability distribution, we can now construct our (log) likelihood function (assume an iid setting):

\[\begin{align} l(\beta) = \log L(\beta) &= \log \left [ \prod_{i = 1}^{n}p(x_{i})^{y_{i}}(1 - p(x_{i})^{1 - y_{i}}) \right ] \\ &= \log \left [\prod_{i = 1}^{n} \left (\frac{\exp(x_{i}^{T}\beta)}{1 + \exp(x_{i}^{T}\beta)} \right)^{y_{i}} \left (1 - \frac{\exp(x_{i}^{T}\beta)}{1 + \exp(x_{i}^{T}\beta)} \right)^{1 - y_{i}} \right] \\ &= \log \left [\prod_{i = 1}^{n} \left (\frac{\exp(x_{i}^{T}\beta)}{1 + \exp(x_{i}^{T}\beta)} \right)^{y_{i}} \left (\frac{1}{1 + \exp(x_{i}^{T}\beta)} \right)^{1 - y_{i}} \right] \\ &= \log \left [\prod_{i = 1}^{n} (\exp(x_{i}^{T}\beta))^{y_{i}}(1 + \exp(x_{i}^{T}\beta))^{-1} \right] \\ &= \sum_{i = 1}^{n}y_{i}(x_{i}^{T}\beta) - \sum_{i = 1}^{n}\log(1 + \exp(x_{i}^{T}\beta)) \end{align}\]

Maximum likelihood estimation involves find the $\beta$ vector that maximizes the log-likelihood function – or equivalently, minimizes the negative log-likelihood:

\[ \hat{\beta} = \arg\min_{\beta} -l(\beta) \]

In addition, we can add regularization terms if we so choose:

Ridge Logistic Regression

\[ \hat{\beta} = \arg\min_{\beta} -l(\beta) + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \]

Bridge Logistic Regression

\[ \hat{\beta} = \arg\min_{\beta} -l(\beta) + \lambda\frac{1}{\alpha}\sum_{j = 1}^{p}\left|\beta_{j} \right|^{\alpha} \]

where $\lambda \geq 0$ and $\alpha \in \left(1, 2 \right)$. Note that we do not penalize the intercept in either method. Algorithms for computing $\hat{\beta}$ will be the subject of our next section.

Note:

\[\begin{align*} \nabla l(\beta) &= \sum_{i = 1}^{n}y_{i}x_{i} - \sum_{i = 1}^{n}\left (\frac{x_{i}\exp(x_{i}^{T}\beta)}{1 + \exp(x_{i}^{T}\beta)} \right) = \sum_{i = 1}^{n}y_{i}x_{i} - \sum_{i = 1}^{n}x_{i}p(x_{i}) \\ & \\ \nabla^{2} l(\beta) &= -\sum_{i = 1}^{n}\left (\frac{x_{i}x_{i}^{T}\exp(x_{i}^{T}\beta)(1 + \exp(x_{i}^{T}\beta)) - x_{i}x_{i}^{T}\exp(x_{i}^{T}\exp(x_{i}^{T}))}{(1 + \exp(x_{i}^{T}\beta))^{2}} \right) \\ &= -\sum_{i = 1}^{T}\left (\frac{x_{i}x_{i}^{T}\exp(x_{i}^{T}\beta)}{(1 + \exp(x_{i}^{n}\beta))^{2}} \right) = -\sum_{i = 1}^{n}x_{i}x_{i}^{T}p(x_{i})(1 - p(x_{i})) = \sum_{i = 1}^{n}x_{i}x_{i}^{T}w(x_{i}) \mbox{ (say)} \end{align*}\]

Iterative Re-Weighted Least Squares (IRLS)

The first algorithm we propose to compute the optimal $\beta$ is called iterative re-weighted least squares. The reasoning behind the name will be clear once the method is fully derived. For convenience, in this section we will assume that there is no intercept.

According to Taylor’s Theorem (Newton-Raphson),

\[ -l(\beta) = -l(\beta^{(k)}) + \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \]

where $\beta^{(k)}$ is some estimator for the true $\beta$. In this case $\beta^{(k)}$ is the $\beta$ estimate for the $k$th iteration. Dropping constants and adding our regularization term (only ridge penalty – bridge will be computed using MM algorithm), we see the following:

\[\begin{align*} \beta^{(k + 1)} &= \arg\min_{\beta} \left \{ \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta))(\beta - \beta^{(k)}) + (\nabla - l(\beta))^{T}(\beta - \beta^{(k)}) + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \right\} \\ &= \arg\min_{\beta} \left \{ \frac{1}{2}(\beta - \beta^{(k)})^{T}(\sum_{i = 1}^{n}x_{i}x_{i}^{T}w_{i})(\beta - \beta^{(k)}) -\sum_{i = 1}^{n}x_{i}^{T}(y_{i} - p_{i}^{(k)})^{T}(\beta - \beta^{(k)}) + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \right \} \\ &= \arg\min_{\beta} \left \{ \frac{1}{2}\sum_{i = 1}^{n}w_{i}^{(k)}(x_{i}^{T}\beta - x_{i}^{T}\beta^{(k)})^{T}(x_{i}^{T}\beta - x_{i}^{T}\beta^{(k)}) - \sum_{i = 1}^{n}w_{i}^{(k)}(x_{i}^{T}\beta - x_{i}^{T}\beta^{(k)})\frac{y_{i} - p_{i}^{(k)}}{w_{i}^{(k)}} + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \right \} \\ &= \arg\min_{\beta} \left \{\sum_{i = 1}^{n}w_{i}^{(k)}\left (\frac{y_{i} - p_{i}^{(k)}}{w_{i}^{(k)}} + x_{i}^{T}\beta^{(k)} - x_{i}^{T}\beta \right)^{2} + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \right \} \\ &= \arg\min_{\beta} \left \{\sum_{i = 1}^{n}w_{i}^{(k)}(z_{i}^{(k)} - x_{i}^{T}\beta)^{2} + \frac{\lambda}{2}\sum_{j = 1}^{p}\beta_{j}^{2} \right \} \\ &= \arg\min_{\beta} \left \{\frac{1}{2}(z^{(k)} - X\beta)^{T}W^{(k)}(z^{(k)} - X\beta) + \frac{\lambda}{2}\beta^{T}\beta \right \} \end{align*}\]

where $z^{(k)}$ is the column vector with $i$th entry $z_{i}^{(k)} = \frac{y_{i} - p_{i}^{(k)}}{w_{i}^{(k)}} + x_{i}^{T}\beta^{(k)}$ for the $k$th iteration. Further, $W^{(k)}$ is a diagonal matrix with $i$th diagonal element $w_{i}^{(k)} = p(x_{i})(1 - p(x_{i}))$ and $X$ is an $n \times p$ matrix containing all $n$ (sample size) observations.

We can see that this is a weighted-least squares problem! The iterative aspect comes from the fact that we will be solving this minimization problem and looping it for $k = 0, 1, ..., K$ until convergence. Solving for the gradient and setting it equal to zero implies that,

\[ \beta^{(k + 1)} = (X^{T}W^{(k)}X + \lambda I_{p})^{-1}X^{T}W^{(k)}z^{(k)} \]

If $p$ is particularly large (say $p >> n$) then we can solve an alternative formulation:

\[\beta^{(k + 1)} = X^{T}\sqrt{W^{(k)}}(\sqrt{W^{(k)}}XX^{T}\sqrt{W^{(k)}} + \lambda I_{p})^{-1}\sqrt{W^{(k)}}z^{(k)} \]

This formulation makes use of what’s called the “kernel trick”. Note that in the latter formulation, we are only inverting a $n \times n$ matrix (as opposed to an $n \times n$ matrix). This allows us to save significant computation time when $p$ is particularly large. The logitr package will solve this formulation by default when $p > n$.

Putting everything together, we have the IRLS algorithm:

Algorithm:

Intialize $\beta^{(0)}, W^{(0)}$.
For $k = 0, ..., K$ until convergence:
- $p^{(k + 1)} = logit(X\beta^{(k)})$
- $W^{(k + 1)} = p^{(k + 1)}(1 - p^{(k + 1)})$
- $z^{(k + 1)} = (y - p^{(k + 1)})/W^{(k + 1)} + X\beta^{(k)}$
- $\beta^{(k + 1)} = (X^{T}W^{(k)}X + \lambda I_{p})^{-1}X^{T}W^{(k)}z^{(k)}$

Below you can find two snippets of code from the logitr package. The first snippet solves the weighted least squares problem. Note that this portion uses SVD decomposition which will not be discussed in this report. The second snippet is the code for IRLS.

Weighted least squares code snippet:

Note this is not the actual code. The real code is written in c++.

#calculate the coefficient estimates for weighted least squares
linearr = function(X, y, lam = 0, weights = NULL, intercept = TRUE, 
    kernel = FALSE) {
    
    # if p > n, linear kernel ridge regression
    if ((p > n) | (kernel == TRUE)) {
        
        # SVD
        svd = svd(X.)
        
        # adjust d vector for regularization and diagonalize
        d_adj = diag(1/(svd$d^2 + lam))
        
        # calculate beta estimates
        betas = t(X.) %*% svd$u %*% d_adj %*% t(svd$u) %*% 
            y.
        rownames(betas) = NULL
        
    } else {
        # compute normal ridge regression
        
        # SVD
        svd = svd(X.)
        
        # adjust d vector for regularization and diagonalize
        d_adj = diag(svd$d/(svd$d^2 + lam))
        
        # calculate beta estimates
        betas = svd$v %*% d_adj %*% t(svd$u) %*% y.
        rownames(betas) = NULL
        
    }
    
    returns = list(coefficients = betas)
    return(returns)
}

IRLS code snippet:

Note this is not the actual code. The real code is written in c++.

# calculates the coefficient estimates for logistic
# regression (IRLS)
IRLS = function(X, y, lam = 0, intercept = TRUE, tol = 10^(-5),
    maxit = 1e+05, vec) {

    # IRLS algorithm
    while ((iteration < maxit) & (max(abs(grads)) > tol)) {

        # update working data
        Xb = X %*% betas
        P = logitr(Xb)
        weights = as.numeric(P * (1 - P))
        z = (y - P)/weights + Xb

        # calculate new betas
        betas = linearr(X = X, y = z, lam = 0.1, weights = weights,
            intercept = intercept, kernel = FALSE)$coefficients

        # calculate updated gradients
        grads = gradient_IRLS_logistic(betas, X, y, lam,
            vec)
        iteration = iteration + 1
    }

    returns = list(coefficients = betas, total.iterations = iteration, gradient = grads)
    return(returns)
}

Majorize-Minimization (MM)

The second algorithm we propose to compute $\hat{\beta}$ is called majorize-minimization (MM). Unlike the IRLS algorithm, the MM can be used with the ridge penalty as well as the bridge penalty. The main idea is the following:

Similar to IRLS, the MM algorithm will be looped for $k = 0, 1, ..., K$ until convergence. Unlike IRLS, we will assume an intercept for this problem. We will show how to find $Q$ for penalized logistic regression. Let’s state our objective:

\[ \hat{\beta} = \arg\min_{\beta} -l(\beta) + \lambda\left(\gamma\frac{1}{2}\sum_{j = 1}^{p}\beta_{j}^{2} +(1 - \gamma) \frac{1}{\alpha}\sum_{j = 1}^{p}\left|\beta_{j} \right|^{\alpha}\right) \]

where $\alpha \in \left(1, 2 \right)$ and $\gamma \in \left\{0, 1 \right\}$ will act as indicator function. If penalty = "ridge" then $\gamma = 1$. If penalty = "bridge" then $\gamma = 0$ and the appropriate terms will be set to zero.

Solve for $Q(\beta | \beta^{(k)})$

Note the following fact: if for some function $\phi$, $\phi''(x) < 0$ for all $x$ then

\[ \phi(x) \leq \phi(x^{(k)}) + \phi'(x^{(k)})(x - x^{(k)}) \]

Let $\phi(x) = (x)^{\alpha/2}$:

\[\frac{d}{dx}(x^{\alpha/2}) = \frac{\alpha}{2}x^{\alpha/2-1} > 0 \]

\[\frac{d^{2}}{d^{2}x}(x^{\alpha/2}) = \frac{\alpha}{2}(\frac{\alpha}{2} - 1)x^{\alpha/2-2} < 0 \]

This implies that the following inequality must hold:

\[ \left|\beta_{j}\right|^{\alpha} = (\beta_{j}^{2})^{\alpha/2} \leq (\beta_{j}^{(k)^{2}})^{\alpha/2} + \frac{\alpha}{2}(\beta_{j}^{(k)^{2}})^{\alpha/2 - 1}(\beta_{j}^{2} - \beta_{j}^{(k)^{2}}) \]

We now derive our $Q$ function:

\[\begin{align} -l(\beta) &+ \lambda\left(\gamma\frac{1}{2}\sum_{j = 1}^{p}\beta_{j}^{2} +(1 - \gamma) \frac{1}{\alpha}\sum_{j = 1}^{p}\left|\beta_{j} \right|^{\alpha}\right) \\ &= -l(\beta^{(k)}) + \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \lambda\left(\gamma\frac{1}{2}\sum_{j = 1}^{p}\beta_{j}^{2} +(1 - \gamma) \frac{1}{\alpha}\sum_{j = 1}^{p}\left|\beta_{j} \right|^{\alpha}\right) \\ &\leq \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \lambda\left(\gamma\frac{1}{2}\sum_{j = 1}^{p}\beta_{j}^{2} +(1 - \gamma) \frac{1}{\alpha}\sum_{j = 1}^{p}\left(\beta_{j}^{(k)^{2}})^{\alpha/2} + \frac{\alpha}{2}(\beta_{j}^{(k)^{2}})^{\alpha/2 - 1}(\beta_{j}^{2} - \beta_{j}^{(k)^{2}} \right)\right) + const. \\ &\leq \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \lambda\left(\frac{\gamma}{2}(v \circ \beta)^{T}\beta +(1 - \gamma) \frac{1}{2}\sum_{j = 1}^{p}(\beta_{j}^{(k)^{2}})^{\alpha/2 - 1}\beta_{j}^{2} \right) + const. \\ &\leq \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \lambda\left(\frac{\gamma}{2}(v \circ \beta)^{T}\beta +(1 - \gamma) \frac{1}{2}\beta^{T}diag(d^{(k)})\beta \right) + const. \\ & \nonumber \\ &\mbox{ where $v \circ \beta$ is the dot product between $\beta$ and a $(p + 1) \times 1$ vector $v$ with all entries equal to} \nonumber \\ &\mbox{ zero except the first ($v = (0, 1, 1, 1,...)^{T}$), $d^{(k)}$ is the column vector with $d_{1}^{(k)} = 0$ and} \nonumber \\ &\mbox{ $d_{j}^{(k)} = (\beta_{j}^{(k)^{2}})^{\alpha/2 -1}$ for $j = 2,..,p$ and $\delta = 10^{-5}$ (some small number)} \nonumber \\ & \nonumber \\ &= \frac{1}{2}(\beta - \beta^{(k)})^{T}(\nabla^{2} -l(\beta^{(k)}))(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \frac{\lambda}{2}\left(\beta^{T} diag\left[\gamma(v - d^{(k)}) + d^{(k)} \right]\beta \right) + const. \\ &\leq \frac{1}{2}(\beta - \beta^{(k)})^{T}X^{T}(\frac{1}{4} + \delta)X(\beta - \beta^{(k)}) + (\nabla - l(\beta^{(k)}))^{T}(\beta - \beta^{(k)}) \\ &+ \frac{\lambda}{2}\left(\beta^{T} diag\left[\gamma(v - d^{(k)}) + d^{(k)} \right]\beta \right) + const. \\ &=: Q(\beta | \beta^{(k)}) \end{align}\]

That was a lot of effort to find our $Q$ function but we eventually see that our $Q$ function is quadratic in form. Therefore, we can find a closed solution for each update. By plugging in the gradient and hessian of our log-likelihood function and dropping constants, we will eventually see by solving for the gradient of $Q$ and setting it equal to zero, that we have:

\[\beta^{(k + 1)} = \left(X^{T}(\frac{1}{4} + \delta)X +\lambda diag \left[\gamma(v - d^{(k)}) + d^{(k)}\right] \right)^{-1}\left(X^{T}(y - p^{(k)}) + X^{T}(\frac{1}{4} + \delta)X \right) \]

Finally, putting everything together we have the MM algorithm for penalized logistic regression:

Algorithm:

Initialize $\beta^{(0)}$.
For $k = 0, ..., K$ until convergence:
- $\beta^{(k + 1)} = \left(X^{T}(\frac{1}{4} + \delta)X +\lambda diag \left[\gamma(v - d^{(k)}) + d^{(k)}\right] \right)^{-1}\left(X^{T}(y - p^{(k)}) + X^{T}(\frac{1}{4} + \delta)X \right)$

Below you can find a snippet of code for the MM function.

MM code snippet:

Note this is not the actual code. The real code is written in c++.

# calculates the coefficient estimates for logistic
# regression (MM)
MM = function(X, y, lam = 0, alpha = 1.5, gamma = 1, intercept = TRUE, 
    tol = 10^(-5), maxit = 1e+05, vec = NULL) {
    
    # MM algorithm
    while ((iteration < maxit) & (max(abs(grads)) > tol)) {
        
        # update d vector
        d = as.numeric((betas^2)^(alpha/2 - 1))
        d[1] = 0
        
        # qrsolve
        betas = qr.solve(Z + lam * diag(gamma * (vec - d) + 
            d), t(X) %*% (y - logitr(X %*% betas)) + Z %*% 
            betas)
        rownames(betas) = NULL
        
        # calculate updated gradients
        grads = gradient_MM_logistic(betas, X, y, lam, alpha, 
            gamma, vec)
        iteration = iteration + 1
        
    }
    
    returns = list(coefficients = betas, total.iterations = iteration, 
        gradient = grads)
    return(returns)
}

Usage

Let us walk through some functions in the logitr package. We will use the iris data set for illustration.

library(logitr)
library(microbenchmark)

#we will use the iris data set
X = dplyr::select(iris, -c(Species, Sepal.Length))
y = dplyr::select(iris, Sepal.Length)
y_class = ifelse(dplyr::select(iris, Species) == "setosa", 1, 0)

#ridge regression (specify lam = 0.1)
linearr(X, y, lam = 0.1, penalty = "ridge")

## 
## Call: linearr(X = X, y = y, lam = 0.1, penalty = "ridge")
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.1    NaN
## 
## MSE:
## [1] 0.09631325
## 
## Coefficients:
##               [,1]
## intercept  1.87785
##            0.64624
##            0.70231
##           -0.54160

#ridge regression (use CV for optimal lam; options default to seq(0, 2, 0.1))
linearr(X, y, penalty = "ridge")

## 
## Call: linearr(X = X, y = y, penalty = "ridge")
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.2    NaN
## 
## MSE:
## [1] 0.09634343
## 
## Coefficients:
##               [,1]
## intercept  1.89913
##            0.64175
##            0.69573
##           -0.52729

#ridge regression (use CV for optimal lam; specify seq(0, 2, 0.01))
linearr(X, y, lam = seq(0, 2, 0.01), penalty = "ridge")

## 
## Call: linearr(X = X, y = y, lam = seq(0, 2, 0.01), penalty = "ridge")
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.2    NaN
## 
## MSE:
## [1] 0.09634343
## 
## Coefficients:
##               [,1]
## intercept  1.89913
##            0.64175
##            0.69573
##           -0.52729

#ridge logistic regression (IRLS) (use CV for optimal lam)
logisticr(X, y_class, penalty = "ridge")

## 
## Call: logisticr(X = X, y = y_class, penalty = "ridge")
## 
## Iterations:
## [1] 18
## 
## Tuning parameters:
##       lam  alpha
## [1,]    0    NaN
## 
## MSE:
## [1] 8.0815e-14
## 
## logloss:
## [1] 6.929591e-06
## 
## misclassification:
## [1] 0
## 
## Coefficients:
##                [,1]
## intercept   7.88535
##             8.89485
##            -7.54863
##           -18.60425

#ridge logistic regression (MM) (specify lam = 0.1)
logisticr(X, y_class, lam = 0.1, penalty = "ridge", method = "MM")

## 
## Call: logisticr(X = X, y = y_class, lam = 0.1, penalty = "ridge", method = "MM")
## 
## Iterations:
## [1] 5459
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.1    NaN
## 
## MSE:
## [1] 5.134801e-05
## 
## logloss:
## [1] 0.3956563
## 
## misclassification:
## [1] 0
## 
## Coefficients:
##               [,1]
## intercept  6.27623
##            1.54082
##           -3.64177
##           -1.63050

#bridge logistic regression (MM)
fit = logisticr(X, y_class, lam = 0.1, alpha = 1.2, penalty = "bridge")

## [1] "using MM algorithm..."

fit

## 
## Call: logisticr(X = X, y = y_class, lam = 0.1, alpha = 1.2, penalty = "bridge")
## 
## Iterations:
## [1] 26021
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.1    1.2
## 
## MSE:
## [1] 2.749702e-05
## 
## logloss:
## [1] 0.1878654
## 
## misclassification:
## [1] 0
## 
## Coefficients:
##               [,1]
## intercept 13.08031
##            0.61701
##           -5.71684
##           -0.23004

#predict using bridge logistic regression estimates
predict(fit, X[1:3,], y_class[1:3])

## $fitted.values
##           [,1]
## [1,] 0.9992467
## [2,] 0.9989747
## [3,] 0.9994881
## 
## $class
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## 
## $MSE
## [1] 6.269174e-07
## 
## $log.loss
## [1] 0.002291457
## 
## $misclassification
## [1] 0

Example

Let us now consider a real-life data set. We will use the High Time Resolution Universe Survey (HTRU) data set. The goal of this study was to classify Pulsar (type of star) candidates. From the website: “Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter” (1). This data set has 17,898 rows and 9 columns – what each column means is not of interest to us (we are only interested in computation time).

Link : HTRU2 Data

In the following code, we see that is only takes 0.2 seconds to compute our estimates using IRLS method and only 3.6 seconds to compute them using the MM algorithm.

library(microbenchmark)

dim(data)

## [1] 17898     9

head(data)

##          V1       V2          V3         V4       V5       V6        V7
## 1 140.56250 55.68378 -0.23457141 -0.6996484 3.199833 19.11043  7.975532
## 2 102.50781 58.88243  0.46531815 -0.5150879 1.677258 14.86015 10.576487
## 3 103.01562 39.34165  0.32332837  1.0511644 3.121237 21.74467  7.735822
## 4 136.75000 57.17845 -0.06841464 -0.6362384 3.642977 20.95928  6.896499
## 5  88.72656 40.67223  0.60086608  1.1234917 1.178930 11.46872 14.269573
## 6  93.57031 46.69811  0.53190485  0.4167211 1.636288 14.54507 10.621748
##          V8 V9
## 1  74.24222  0
## 2 127.39358  0
## 3  63.17191  0
## 4  53.59366  0
## 5 252.56731  0
## 6 131.39400  0

X = dplyr::select(data, -V9)
y = dplyr::select(data, V9)

#time IRLS (specify lam = 0.1)
microbenchmark(logisticr(X, y, lam = 0.1, penalty = "ridge"))

## Unit: milliseconds
##                                           expr      min       lq   mean
##  logisticr(X, y, lam = 0.1, penalty = "ridge") 86.82767 88.84533 92.008
##    median       uq      max neval
##  89.92341 92.06625 129.2564   100

#time MM (specify lam = 0.1)
microbenchmark(logisticr(X, y, lam = 0.1, penalty = "ridge", method = "MM"), times = 5)

## Unit: seconds
##                                                          expr      min
##  logisticr(X, y, lam = 0.1, penalty = "ridge", method = "MM") 2.861886
##        lq     mean   median       uq      max neval
##  2.875933 3.012516 3.028524 3.097979 3.198258     5

#time IRLS (use CV for optimal lam)
microbenchmark(logisticr(X, y, penalty = "ridge"), times = 5)

## Unit: seconds
##                                expr      min       lq     mean   median
##  logisticr(X, y, penalty = "ridge") 7.615897 7.810411 7.860075 7.812191
##        uq      max neval
##  7.925333 8.136543     5

#ridge regression using IRLS
fit = logisticr(X, y, lam = 0.1, penalty = "ridge")
fit

## 
## Call: logisticr(X = X, y = y, lam = 0.1, penalty = "ridge")
## 
## Iterations:
## [1] 9
## 
## Tuning parameters:
##       lam  alpha
## [1,]  0.1    NaN
## 
## MSE:
## [1] 0.01715031
## 
## logloss:
## [1] 1307.936
## 
## misclassification:
## [1] 0
## 
## Coefficients:
##               [,1]
## intercept -8.91880
##            0.02937
##           -0.03488
##            6.51787
##           -0.60957
##           -0.02853
##            0.05315
##            0.04646
##           -0.00470

#predictions
predict(fit, X[1:3,], y$V9[1:3])

## $fitted.values
##             [,1]
## [1,] 0.001018225
## [2,] 0.018336697
## [3,] 0.009291594
## 
## $class
##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0
## 
## $MSE
## [1] 0.0001412017
## 
## $log.loss
## [1] 0.02886067
## 
## $misclassification
## [1] 0

Computer Specs:

MacBook Pro (Late 2016)
Processor: 2.9 GHz Intel Core i5
Memory: 8GB 2133 MHz
Graphics: Intel Iris Graphics 550

References

R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society 459 (1), 1104-1123, DOI: 10.1093/mnras/stw656
R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1.

Matt Galloway

Logistic Regression

Maximum Likelihood Estimation

Ridge Logistic Regression

Bridge Logistic Regression

Iterative Re-Weighted Least Squares (IRLS)

Majorize-Minimization (MM)

Solve for \(Q(\beta | \beta^{(k)})\)

Usage

Example

Computer Specs:

References

Contents