require(rms)
# Fetch the old lrm.fit (function olrm)
require(orms)
# Define simple timer functions with and without printing
<- function(...) {
stime <- system.time(...)['elapsed']
ti cat('Elapsed time: ', ti, 's\n', sep='')
invisible(ti)
}
<- function(...) system.time(...)['elapsed']
stim
# Define a function that will time a series of lines of code, repeating each
# reps times (default is 10). When only one line of code is given, the
# elapsed execution time is printed and the result of the line is returned.
# Otherwise, a vector of run times corresponding to all the lines of code is
# returned, and the values returned from the lines are stored in the global
# environment in object Res, with components named according to inputs to tim.
# When reps > 1 the values from running code lines are from the last rep.
<- function(...) {
tim <- 10
reps <- sys.call()
w <- names(w)
n <- length(n)
m <- 2 : m
k if('reps' %in% n) {
<- which(n == 'reps')
i <- eval(w[[i]])
reps <- setdiff(k, i)
k
}<- n[k]
n <- length(n)
m <- numeric(m)
r <<- vector('list', m) # put in global environment
Res names(r) <- n
names(Res) <<- n
<- 0
l for(i in k) {
<- l + 1
l <- stim(for(j in 1 : reps)
s <- eval(w[[i]], parent.frame())) / reps
R <- s
r[l] <<- R
Res[[l]]
}label(r) <- paste('Per-run execution time in seconds, averaged over', reps, 'runs')
if(m == 1) {
print(r)
return(R)
}
r
}
<- function(x) max(abs(x))
m <- function(a, b) c(mad = mean(abs(a - b)),
mad relmad = 2 * mean(abs(a - b) / (abs(a) + abs(b))))
<- function(r) exp(max(abs(log(r)))) # worst ratio, whether < or > 1.0
wratio
# Function creating a table of matrix dimensions for matrices in a list
<- function(w) {
mdim <- sapply(w, is.matrix)
i <- function(x) c(rows=nrow(x), columns=ncol(x))
g sapply(w[i], g)
}
# Function to summarize a series of model fits stored in Res
<- function() {
smod <- sapply(Res, function(x) if(length(x$u)) m(x$u) else NA)
max_abs_u <- sapply(Res, function(x) if(length(x$iter)) tail(x$iter, 1) else NA)
iter <- sapply(Res, function(x) tail(x$deviance, 1))
deviance print(data.frame(deviance, max_abs_u, iter))
<- length(Res)
l <- names(Res)
n if(l > 1) {
for(i in 1 : l) {
<- Res[[i]]
r # See if polr or BFGS
<- inherits(r, 'polr') || (length(r$opt_method) && r$opt_method=='BFGS')
a $var <- if(inherits(r, 'orm'))
rvcov(r, intercepts='all') else if(! a) vcov(r)
if(inherits(r, 'polr'))
$coefficients <- c(-r$zeta, coef(r))
r<- r
Res[[i]]
}
cat('\nMaximum |difference in coefficients|,',
'Maximum |relative difference|\n',
'worst ratio of covariance matrices\n\n')
<- NULL
d for(i in 1 : (l-1))
for(j in (i+1) : l) {
<- Res[[i]]; rj <- Res[[j]]
ri <- paste(n[i], 'vs.', n[j])
comp <- mad(ri$coefficients, rj$coefficients)
co <- data.frame(Comparison = comp,
w `Max |difference|` = co[1],
`Max |rel diff|` = co[2],
`Cov ratio` = NA, check.names=FALSE)
if(length(ri$var) && length(rj$var))
$`Cov ratio` <- wratio(ri$var / rj$var)
w<- rbind(d, w)
d
}
}$`Cov ratio` <- ifelse(is.na(d$`Cov ratio`), '', format(d$`Cov ratio`))
drownames(d) <- NULL
d }
Statistical Computing Approaches to Maximum Likelihood Estimation
rms
package logistic regression function lrm
I explored several of these issues. Comparisons of execution time in R vs. Fortran are given. Different coding styles in both R and Fortran are also explored. Hopefully some of these explorations will help others who may not have studied MLE optimization and related statistical computing algorithms.
Overview
Maximum likelihood estimation (MLE) is a gold standard estimation procedure in non-Bayesian statistics, and the likelihood function is central to Bayesian statistics (even though it is not maximized in the Bayesian paradigm). MLE may be unpenalized (the standard approach) or various penalty functions such as L1 (lasso, absolute value penalty), and L2 (ridge regression; quadratic) penalties may be added to the log-likelihood to achieve shrinkage (aka regularization). I have been doing MLE my entire career, using mainly a Newton-Raphson algorithm with step-halving to achieve rapid convergence. I never stopped to think about other optimization algorithms, and the R language has a good many excellent algorithms included in the base stats
package. There is also an excellent maxLik
R package devoted to MLE (not used here), and two of its vignettes provide excellent introductions to MLE. General MLE background and methods including more about penalization may be found here, which includes details about how to do QR factorization and to back-transform after the fit.
In this article I’ll explore several optimization strategies and variations of them, in the context of binary and ordinal logistic regression. The programming scheme that is used here is the one used by many R packages: Write functions to compute the scaler objective function (here, -2 log likelihood), the gradient vector (vector of first derivatives), and the hessian matrix (matrix of second derivatives), and pass those functions to general optimization functions. Convergence (at least to local minima for -2 LL (log of the likelihood function)) is achieved for smooth likelihood functions when the gradient vector values are all within a small tolerance (say \(10^{-7}\)) of zero or when the -2 LL objective function completely settles down in, say, the \(8^\text{th}\) significant digit. The gradient vector is all zeros if the parameter values are exactly the MLEs (local minima achieved on -2 LL).
This article is based on rms
version 7.0-0, a major new release of the package, which will likely be available on CRAN around 2025-01-08.
History
The R rms
package lrm
function is dedicated to maximum likelihood estimation (MLE) for fitting binary and ordinal (proportional odds) logistic regression models using the logit link, with or without quadratic (ridge) penalization. Semiparamteric regression models, also called ordinal regression models, allow one to do efficient analyses without depending on how the dependent variable Y is transformed. Ordinal models encode the entire cumulative distribution function of Y by having an intercept for each distinct Y level less one. For ordinal models, versions of lrm
before rms
6.9-0 were efficient for up to 400 distinct Y-values (399 intercepts) in the sense that execution time was under 10 seconds for 10,000 observations on 10 predictors. The rms
orm
function is intended for modeling continuous outcome Y variables and was efficient for up to 8000 intercepts prior to rms
7.0-0. orm
implements 4 link functions other than the logit. For rms
7.0-0, lrm
and orm
run in 2.5s for a sample size of 300,000 with continuous Y and 20 predictors, i.e., with 299,999 intercepts (9.5s for 40 predictors). lrm
uses the R function lrm.fit
for its heavy lifting, and likewise orm
uses orm.fit
. Much of lrm.fit
was written in 1980 and served as the computational engine for the first SAS procedure for logistic regression, PROC LOGIST
. It used Fortran 77 for computationally intensive work, and used only a Newton-Raphson algorithm with step-halving for iterative MLE. On rare occasions when serious collinearities were present, such as when multiple continuous variables were fitted using restricted cubic splines (which use a truncated power basis), lrm.fit
would fail to converge.
Are Intercepts Regular Parameters?
Let \(F\) be a smooth cumulative distribution function. The cumulative probability class of ordinal semiparametric models can be written as follows. It is more traditional to state the model in terms of \(P(Y \leq y | X)\) but we use \(P(Y \geq y | X)\) so that higher predicted values are associated with higher \(Y\), and when \(F\) is the logistic distribution \(\frac{1}{1 + \exp(-x)}\) the ordinal logistic (proportional odds) model reduces exactly to a binary logistic model. Let the ordered distinct values of \(Y\) be denoted by \(y_{1}, y_{2}, \dots, y_{k}\) and let the \(k\) intercepts associated with \(y_{1}, \dots, y_{k}\) be \(\alpha_{1}, \alpha_{2}, \dots, \alpha_{k}\), where \(\alpha_{1} = \infty\) because \(P(Y \geq y_{1}) = 1\). Let \(\alpha_{y} = \alpha_{i}, i:y_{i}=y\). Then the cumulative probability semiparametric model is
\[ P(Y \geq y | X) = F(\alpha_{y} + X\beta) \]
When \(F\) is the extreme value type I (also called the Gumbel minimum value) distribution \(F(x) = \exp(-\exp(-x))\), the inverse function is the \(\log-\log\) link and the model is a proportional hazards (PH) model. The Cox 1972 PH model uses the only generating distribution \(F(x)\) such that the marginal (getting rid of the \(\alpha\)s) distribution of the ranks of \(Y\) can be evaluated without multi-dimensional integrals. This gives rise to Cox’s partial likelihood, which until time-dependent covariates are included can be computationally fast for any number of distinct failure times. The Cox approach requires a second step to estimate the intercepts (underlying survival curve for a person with some reference value of \(X\beta\)). There is some arbitrariness to which second-step estimator is used, e.g., the Breslow estimator or the Kalbfleisch-Prentice estimator. And variances of \(\hat{P}(Y \geq y | X)\) are complicated because uncertainties from both of the steps must be included.
In fact, a partial likelihood, which also makes it difficult to handle interval censoring, is only needed until you realize that
- the maximum likelihood estimate (MLE) of the vector of intercepts \(\alpha\) when \(\beta=0\) are just the link function of all the one minus cumulative probabilities in the absence of censoring; this gives rise to instantly-computed and convergence-accelerating initial values for iterative MLE estimation
- the \(\alpha\)s are always in descending order (ascending if using the more popular statement of ordinal models)
- the gradient (first derivative) for the log likelihood function can be computed quickly no matter how large is \(k\)
- the hessian (second derivatives of the log likelihood function) can be computed in just over twice the time needed to compute the gradient, and takes twice the amount of array storage size as the gradient, so MLE scales wonderfully for large \(k\)
- for most needs, the entire information matrix (negative hessian) never needs to be inverted; portions of the inverse of the whole can be quickly computed without inverting the whole information matrix
- the R
Matrix
package is made for efficient storage and calculation on such sparse hessians
Because of the strict ordering of \(\alpha\), MLE iterations are fast and the effective degrees of freedom of the model are more like \(4 + p\) where \(p\) is the number of \(\beta\)s. The \(4\) comes from the following line of reasoning. In the no-covariate case, consider confidence bands for the empirical cumulative distribution function (ECDF) for \(Y\) , then fit a 4-parameter parametric distribution to the raw \(Y\) values. Confidence intervals for \(F(y)\) from this parametric fit will be about the same widths as those from the ECDF.
Re-Write of lrm.fit
and orm.fit
To modernize the Fortran code, better condition the design matrix \(X_{n\times p}\) (\(n\) observations and \(p\) columns), and to explore a variety of optimization algorithms, I did a complete re-write of the rms
package lrm
and lrm.fit
functions in November, 2024, for rms
version 6.9-0. To reduce optimization divergence when there are extreme collinearities, and to better scale \(X\), I was interested in implementing mean centering followed by a QR factorization of \(X\) to orthogonalize its columns. Details about how this is done and how the parameter estimates and covariance matrix are converted back to the original \(X\) space may be found here. QR can be turned on by setting the lrm.fit
argument transx
to TRUE
. When QR is in play, the rotated columns of \(X\) are scaled to have standard deviation 1.0.
Besides dealing with fundamental statistical computing issues, I changed lrm.fit
to use Therneau’s survival
package concordancefit
function to compute concordance probabilities used by various rank correlation indexes such as Somers’ \(D_{xy}\). This got rid of a good deal of code. Previously, lrm.fit
binned predicted probabilities into 501 bins to calculate rank measures almost instantly. But I decided it was time to use exact calculations now that concordancefit
is so fast. Though not used in rms
, concordancefit
also computes accurate standard errors.
A new Fortran 2018 subroutine lrmll
was written to efficiently calculate the -2 log-likelihood function (the deviance), the gradient vector, and the hessian matrix of all second partial derivatives of the log-likelihood with respect to intercept(s) \(\alpha\) and regression coefficients (slopes) \(\beta\).
lrm.fit
now implements several optimization algorithms. When Y is binary and there is no penalization, it has the option to use glm.fit(..., family=binomial())
which runs iteratively reweighted least squares, a fast-converging algorithm for binary logistic regression (but does not extend to ordinal regression). lrm.fit
implements 8 optimization algorithms.
- optional for proportional odds models (according to
lrm.fit
initglm
argument), does a first pass withglm.fit
that fits a binary logistic regression for the probability that Y is greater than or equal to the median Y; this fit can then be used for starting values after shifting the default starting intercept values so that the middle intercept matches the intercept from the binary fit nlminb
: a function in thestats
package, tied withNR
as the fastest algorithm in general; uses hessians. This uses Fortran routines in the Bell Labs port library for which the original paper may be found here.NR
: Newton-Raphson iteration with step-halving, implemented here as R functionnewtonr
. This algorithm is the default because it has the advantage of having full control over convergence criteria. It requires convergence with respect to 3 simultaneous criteria: changes in -2 LL, changes in parameter estimates, and nearness of gradient to zero. The user can relax any of the 3 criteria thresholds to relax conditions. This is the optimization method used in the oldlrm.fit
function, but written in Fortran there for slightly increased speed. Defaults for tolerance parameters are such thateps
(iteration-to-iteration change in -2 LL) will usually dictate when convergence called.LM
: Levenberg-Marquart algorithm, which is a kind of Newton method with generalized step-halvingglm.fit
: for binary Y without penalization onlynlm
: thestats
function that is usually recommended for maximum likelihood, but I found it is slower thannlminb
without offering other advantagesBFGS
andL-BFGS-B
using thestats
optim
function: fast general-purpose algorithms that do not require the hessian, so these can be used with an unlimited number of intercepts as long as the user sets thelrm.fit
parametercompvar
toFALSE
so that the hessian is not calculated once after convergenceCG
andNelder-Mead
: seeoptim
MASS
package polr
functionThe last four methods do not involve computing the hessian, which is the most computationally intensive calculation in MLE. But they are still slower overall if you want to get absolute convergence, due to requiring many more evaluations of the object function (-2 LL).
See this for useful comparisons of some of the algorithms.
For rms
7.0-0, lrm.fit
and orm.fit
were changed to use the R Matrix
package for much more efficient handling of sparse Hession/information matrices. Now there are no limitations on the number of distinct Y-values analyzed by lrm
and orm
. The primary differences between the two modeling procedures are
lrm
only implements a single link function (logistic)lrm
implements multiple optimization methodsorm
only implements Newton-Raphson optimization (with step-halving)lrm
output (fromprint.lrm
) includes rank correlation model performance indexes that are more suitable for discrete Yorm
output (fromprint.orm
) includes only Spearman’s \(\rho\) as a rank predictive discrimination measure; this is more suitable for continuous Yorm
has aQuantile
and anExProb
method; bothlrm
andorm
haveMean
methods, as means work on discrete numeric Y (unlike quantiles)lrm
implementstransx
forQR
orthonormalization of the design matrix \(X\)orm
implementsscale
for mean-centering and standard deviation scaling of \(X\)
Background: Convergence
The two most commonly used convergence criteria for MLE are
- relative convergence: stop iterations when the change in deviance is small or when the relative change in parameter estimates is small
- absolute convergence: stop iterations when the first derivative of the log likelihood is small in absolute value for all parameters, or there is a very small absolute change in parameter values
Absolute convergence with respect to the first derivatives (gradients) is similar to demanding that none of the regression parameters change very much since the last iteration. From the standpoint of what is important statistically, convergence of the deviance (what the gold standard likelihood ratio test is based on) is sufficient with respect to what matters. Changing parameter values when the deviance does not change in the \(6^\text{th}\) decimal place will be buried in the noise. Absolute convergence may affect \(\hat{\beta}\), but relative convergence tends to result in a very stable \(\frac{\hat{\beta}}{\text{s.e.}}\). You might deem convergence satisfactory when parameter estimates between successive iterations change by less than 0.05 standard errors (which would require evaluation of the hessian to know), but approximately this corresponds to relative convergence judged by the deviance.
Despite achieving statistically relevant convergence more easily, there is a real issue of reproducibility. Different algorithms and software may result in semi-meaningful differences in \(\hat{\beta}\) taken in isolation (ignoring how much of \(\hat{\beta}\) is noise). Only by having all the implementations achieve absolute convergence will different analysts be very likely to reproduce each others’ work. If this is important to you, either avoid the BFGS algorithms (for which the R optim
function does not have an absolute convergence criterion) or use a highly stringent relative convergence criterion, e.g., specify the lrm.fit
argument reltol
as \(10^{-11}\). Below I explore how convergence and execution time are affected by reltol
.
Overview of Findings
opt_method='NR'
and 'nlminb'
are the fastest methods, even slightly faster than glm.fit
.
Limited tests of transx
in lrm.fit
to use QR factorization does not show not much benefit, but see below for details.
For ordinal Y, using opt_method='BFGS'
with compstats=FALSE
for lrm
mimics the polr
function in the MASS
package. For a large number of intercepts, lrm.fit
is much faster due to computing the deviance and derivatives in highly efficient compiled Fortran and capitalizing on sparse matrices.
Setting initglm=TRUE
tells lrm.fit
to get initial ordinal model parameter values from a binary logistic glm.fit
run when cutting Y at the median. This does not seem to offer much benefit over setting starting values to covariate-less MLEs of \(\alpha\) (which are calculated instantly when \(\beta=0\)) and setting \(\beta=0\). In one example using nlminb
the algorithm actually diverged with initglm=TRUE
but ran fine without it. This is probably due to large intercept values with many distinct Y values.
Extensive tests of opt_method='BFGS'
show good stochastic performance (convergence to what matters, deviance-wise), but unimpressive execution time if you want absolute convergence by setting reltol
to a small number.
The best overall algorithm that uses the hessian is NR
in terms of speed and convergence, with nlminb
and LM
close seconds. NR
is the method used in the old lrm.fit
, so for most datasets, the new optimization options are not needed.
Even though lrm.fit
is optimized for the logistic link function, there is not much difference in execution time between lrm.fit
and orm.fit
for binary and ordinal logistic models.
Validation
Two kinds of validations appear below.
- Validation of the Fortran-calculated deviance and derivatives: for a given small dataset, get parameter estimates from
rms::orm
with tight convergence criteria, and test the Fortran code by evaluating the deviance and derivatives at these parameter values. The gradient (first derivative) should be very close to zero, and the deviance should be identical to that fromorm
. The inverse of the negative of the hessian matrix should equal the variance-covariance matrix computed byorm
. - Validate
lrm.fit
overall by letting it pick its usual starting values for iteration, and compare its output to that fromorm
and other fitting functions including the last pre-6.9-0 version oflrm.fit
which is namedolrm
below.
Check -2 Log Likelihood and Derivatives for a Simple Model
First define an R function that makes it easy to run the Fortran subroutine.
<- function(x, y, alpha, beta, what=2L, debug=0L, penhess=1L,
rfort offset=rep(0., n), wt=rep(1., n),
penmat=matrix(0., p, p)) {
<- as.matrix(x)
x <- ncol(x)
p <- nrow(x)
n <- sort(unique(y))
yd <- max(yd)
k <- as.integer(k + p)
nv if(length(yd) != k + 1 || any(yd != 0 : k))
stop('y must be coded 0-k for lrmll')
storage.mode(x) <- storage.mode(offset) <- storage.mode(wt) <- storage.mode(penmat) <-
storage.mode(alpha) <- storage.mode(beta) <- 'double'
<- .Fortran('lrmll', n, as.integer(k), p, x, as.integer(y), offset, wt, penmat, alpha, beta,
w logL=numeric(1), grad=numeric(nv),
a=matrix(0e0, k, 2), b=matrix(0e0, p, p), ab=matrix(0e0, k, p),
what=as.integer(what), debug=as.integer(debug),
penhess=as.integer(penhess), salloc=integer(1))
# lrmll creates 3 compact hessian submatrices
# Put them together into a single hessian
$hess <- infoMxop(w[c('a', 'b', 'ab')])
w
w }
Binary Y
<- 1 : 10
x <- c(0, 1, 0, 0, 0, 1, 0, 1, 1, 1)
y
# From orm. Deviance = 13.86294, 10.86673
<- -2.4412879506377 ; beta <- 0.4438705364796
alpha <- rfort(x, y, alpha, beta)
w $logL w
[1] 10.86673
$grad w
[1] -1.814104e-13 -1.156408e-12
<- rfort(x, y, alpha, beta, what=3L)
w $hess # negative inverse of covariance matrix w
[,1] [,2]
[1,] -1.813852 -9.976185
[2,] -9.976185 -66.123262
- solve(vcov(glm(y ~x, family=binomial(), control=list(epsilon=1e-12))))
(Intercept) x
(Intercept) -1.813852 -9.976185
x -9.976185 -66.123262
<- list(glm = glm(y ~ x, family=binomial(), control=list(epsilon=1e-12)),
Res olrm = olrm(x, y, eps=1e-10),
lrm.fit = lrm.fit(x, y, reltol=1e-12),
orm = orm(y ~ x, eps=1e-10) )
smod()
deviance max_abs_u iter
glm 10.86673 NA 4
olrm 10.86673 2.220446e-15 NA
lrm.fit 10.86673 3.270271e-08 5
orm 10.86673 3.552714e-15 5
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 glm vs. olrm 5.551115e-16 4.320309e-16 1
2 glm vs. lrm.fit 5.551115e-16 4.320309e-16 1
3 glm vs. orm 0.000000e+00 0.000000e+00 1
4 olrm vs. lrm.fit 0.000000e+00 0.000000e+00 1
5 olrm vs. orm 5.551115e-16 4.320309e-16 1
6 lrm.fit vs. orm 5.551115e-16 4.320309e-16 1
Cov ratio
in the above output is the anti-log of the absolute value of the log of the ratio of elements of two covariance matrices. So it represents the worst disagreement in the two matrices, with 1.0 being perfect agreement to 7 decimal places. Max |difference|
is the highest absolute difference in estimated regression coefficients between two methods, and Max |rel diff|
is the maximum ratio of absolute differences to the sum of absolute values of two coefficient estimates.Y=0, 1, 2
<- 1 : 10
x <- c(0, 2, 0, 1, 0, 2, 2, 1, 1, 2)
y <- orm(y ~ x, eps=1e-10) # deviance 21.77800 19.79933
f - solve(vcov(f, intercepts='all'))
y>=1 y>=2 x
y>=1 -2.336337 1.148893 -5.103426
y>=2 1.148893 -2.657167 -10.455125
x -5.103426 -10.455125 -110.128337
- solve(vcov(lrm(y ~ x, eps=1e-10)))
y>=1 y>=2 x
y>=1 -2.336337 1.148893 -5.103426
y>=2 1.148893 -2.657167 -10.455125
x -5.103426 -10.455125 -110.128337
- solve(olrm(x, y, eps=1e-10)$var)
y>=1 y>=2 x[1]
y>=1 -2.336337 1.148893 -5.103426
y>=2 1.148893 -2.657167 -10.455125
x[1] -5.103426 -10.455125 -110.128337
- solve(vcov(MASS::polr(factor(y) ~ x)))
x 0|1 1|2
x -110.12802 5.103400 10.455131
0|1 5.10340 -2.336318 1.148877
1|2 10.45513 1.148877 -2.657156
# Note a problem with VGAM
- solve(vcov(VGAM::vglm(y ~ x, VGAM::cumulative(reverse=TRUE, parallel=TRUE))))
(Intercept):1 (Intercept):2 x
(Intercept):1 -2.434082 1.155134 -5.148614
(Intercept):2 1.155134 -2.580855 -9.505256
x -5.148614 -9.505256 -100.063658
- solve(vcov(ordinal::clm(factor(y) ~ x)))
0|1 1|2 x
0|1 -2.336337 1.148893 5.103426
1|2 1.148893 -2.657167 10.455125
x 5.103426 10.455125 -110.128337
<- c(-0.8263498291155, -2.3040967379853)
alpha <- 0.3091154153068
beta
# Analytically compute 2nd derivative of log L wrt beta
<- function(x) {
info <- plogis(alpha[1] + x * beta)
p1 <- plogis(alpha[2] + x * beta)
p2 <- p1 - p2
d <- p1 * (1 - p1)
v1 <- p2 * (1 - p2)
v2 - v2
v1 <- p1 * (1 - p1) * (1 - 2 * p1)
w1 <- p2 * (1 - p2) * (1 - 2 * p2)
w2 - w2
w1 * x * ((w1 - w2) * d - (v1 - v2)^2) / d / d
x
}
# Compute 2nd derivative of log(p1 - p2) wrt beta numerically
<- function(x, beta) log(plogis(alpha[1] + x * beta) - plogis(alpha[2] + x * beta))
dif = 1e-6
del <- function(x) ((dif(x, beta + del) - dif(x, beta)) / del - (dif(x, beta) - dif(x, beta - del)) / del) / del
d2 c(info(4), info(8), info(9))
[1] -6.88269 -24.55641 -27.93077
<- c(d2(4), d2(8), d2(9))
num print(num)
[1] -6.882495 -24.556135 -27.931213
sum(num)
[1] -59.36984
<- rfort(x, y, alpha, beta)
w $logL w
[1] 19.79933
$grad w
[1] -1.896261e-13 -1.408873e-13 -2.330580e-12
<- rfort(x, y, alpha, beta, what=3L)
w $hess w
3 x 3 sparse Matrix of class "dgCMatrix"
[1,] -2.336337 1.148893 -5.103426
[2,] 1.148893 -2.657167 -10.455125
[3,] -5.103426 -10.455125 -110.128337
Simple Ordinal Model With Weights, Offsets, and Penalties
We first ignore weights, offsets, and penalties, then incorporate them.
set.seed(1)
<- rnorm(50)
x1 <- rnorm(50)
x2 <- cbind(x1, x2)
X <- sample(0:5, 50, TRUE)
y <- runif(50)
wt <- wt / sum(wt)
wt <- rnorm(50)
of <- cbind(c(1.2, 0.6), c(0.6, 1.2))
pm
<- olrm(X, y, eps=1e-15)
f $deviance f
[1] 177.1350 176.8656
<- coef(f)
cof <- rfort(X, y, cof[1:5], cof[6:7])
w $logL w
[1] 176.8656
$grad w
[1] 3.330669e-16 3.552714e-15 2.886580e-15 -2.886580e-15 1.998401e-15
[6] 1.998401e-15 1.332268e-15
<- rfort(X, y, cof[1:5], cof[6:7], what=3L)
w range(w$hess + solve(vcov(f)))
[1] -1.421085e-14 1.065814e-14
# Needed reltol=1e-15 to get gradient to 1e-8 with BFGS
# CG achieved 1e-6 with default, with 475 function evaluations
<- lrm.fit(X, y, trace=1, opt_method='nlm', gradtol=1e-14,
g transx=TRUE, compstats=FALSE)
iteration = 0
Step:
[1] 0 0 0 0 0 0 0
Parameter:
[1] 1.51634749 0.57536414 -0.08004271 -0.84729786 -2.19722458 0.00000000
[7] 0.00000000
Function Value
[1] 177.135
Gradient:
[1] 1.199041e-14 -2.664535e-15 -1.243450e-14 7.993606e-15 3.330669e-15
[6] 2.510851e+00 3.400018e+00
iteration = 4
Parameter:
[1] 1.53972787 0.60066749 -0.06107634 -0.83760533 -2.19876783 -0.07124309
[7] -0.10606549
Function Value
[1] 176.8656
Gradient:
[1] -8.388246e-11 -3.561773e-11 8.475087e-11 1.565637e-11 7.530643e-13
[6] 5.342615e-12 9.455770e-13
Last global step failed to locate a point lower than x.
Either x is an approximate local minimum of the function,
the function is too non-linear for this algorithm,
or steptol is too large.
m(g$u); g$iter # L-BFGS-B tool factor as low as 1e2 to get u=3e-6, 50 iter
[1] 4.237544e-11
[1] 4
mad(cof, g$coefficients)
mad relmad
1.377462e-12 4.676386e-12
range(f$var / vcov(g))
[1] 1 1
Now use weights, offset, and penalties.
<- olrm(X, y, offset=of, weights=wt, penalty.matrix=2*pm, eps=1e-12)
f $deviance f
[1] 3.475801 3.803296 3.799336
<- coef(f)
cof <- rfort(X, y, cof[1:5], cof[6:7], offset=of, wt=wt, penmat=2*pm)
w $logL w
[1] 3.799336
$grad w
[1] -1.578598e-16 1.526557e-16 -6.938894e-17 3.642919e-17 -4.250073e-17
[6] 1.734723e-17 1.387779e-17
<- rfort(X, y, cof[1:5], cof[6:7], offset=of, wt=wt, penmat=2*pm, what=3L)
w range(w$hess + solve(vcov(f)))
[1] -6.661338e-16 1.332268e-15
<- lrm.fit(X, y, trace=3, offset=of, weights=wt,
g penalty.matrix=2e0*pm, opt_method='nlminb', compstats=FALSE)
0: 3.8301785: 1.23911 0.611204 -0.0435624 -0.772679 -2.45652
3: 3.8032957: 1.50965 0.925273 0.280639 -0.513375 -2.39790
0: 3.8032957: 1.50965 0.925273 0.280639 -0.513375 -2.39790 0.00000 0.00000
3: 3.7993356: 1.50984 0.927634 0.283085 -0.513316 -2.39750 0.0149656 -0.0431891
m(g$u); g$iter
[1] 3.404934e-09
iterations evaluations.function evaluations.gradient
3 4 3
mad(coef(f), coef(g))
mad relmad
7.703453e-09 7.127106e-09
$iter g
iterations evaluations.function evaluations.gradient
3 4 3
$u g
y>=1 y>=2 y>=3 y>=4 y>=5
3.404934e-09 6.040723e-10 -1.193834e-09 -1.123210e-09 -1.671228e-09
x1 x2
3.391934e-10 7.970911e-11
$deviance f
[1] 3.475801 3.803296 3.799336
$deviance g
[1] 3.475801 3.803296 3.799336
range(vcov(f) / vcov(g))
[1] 0.9999999 1.0000002
Check Accuracy Against Old lrm.fit
For a Variety of Levels of Y
set.seed(1)
<- 150
n <- NULL
w for(i in 1 : 40) {
<- sample(1 : 20, 1, TRUE)
k <- sample(0 : k, n, TRUE)
y <- runif(n)
x1 <- exp(rnorm(n))
x2 <- cbind(x1, x2)
X <- olrm(X, y, eps=1e-10)
f <- lrm.fit( X, y, opt_method='nlminb', compstats=FALSE)
g <- coef(f) - coef(g)
d <- wratio(vcov(f) / vcov(g))
r <- rbind(w, data.frame(i, k, mad.beta=m(d), Cov.ratio=r))
w
}range(w$Cov.ratio)
[1] 1 1
with(w, plot(k, mad.beta, log='y'))
Study Convergence and Timings
Fortran vs. R
Regarding execution speed, a key question is whether it’s worth the effort to code part of the calculations in a compiled language such as Fortran, C, or C++, as compared to just using R. Let’s explore this by coding the gradient vector calculation in R and timing it against the new Fortran code. Also write a function making the Fortran routine easy to call when computing the gradient.
The code below makes use of the facts that \(\frac{\partial \log \text{expit}(x)}{\partial x} = \text{expit}(-x) = 1 - \text{expit}(x)\) and \(\frac{\partial \text{expit}(x)}{\partial x} = \text{expit(x)}(1 - \text{expit(x)})\). The philosophy of this code is that nothing is calculated unless it is relevant, which makes for more lines of code. For example, the code does not create extra intercepts to yield probabilities of 0 or 1, but instead handles each case Y=0, Y=\(k\), \(0 < \text{Y} < k\) separately. For the non-interior levels of Y the gradient is very simple as probabilities are expits and not differences.
<- function(alpha, beta, x, y) {
grad <- length(alpha)
k <- length(beta)
p <- plogis
f <- as.vector(x %*% matrix(beta, nrow=p))
xb <- as.vector(x %*% beta)
xb <- P2 <- numeric(n)
P1 <- y == 0
i0 <- y == k
ik <- y > 0 & y < k
ib <- 1e0
P1[i0] <- f(alpha[1] + xb[i0])
P2[i0] <- f(alpha[k] + xb[ik])
P1[ik] <- 0e0
P2[ik] <- f(alpha[y[ib] ] + xb[ib])
P1[ib] <- f(alpha[y[ib] + 1] + xb[ib])
P2[ib] <- P1 * (1e0 - P1)
pq1 <- P2 * (1e0 - P2)
pq2 <- P1 - P2
P <- rep(0e0, k + p)
U
1] <- - sum(1e0 - P[i0])
U[<- U[k] + sum(1e0 - P[ik])
U[k]
# Gradiant for intercepts
if(k > 1) for(m in 1 : k) { # only interior y values create complexity
if(m < k) U[m] <- U[m] + sum(pq1[y == m] / P[y == m])
if(m > 1) U[m] <- U[m] - sum(pq2[y == m - 1] / P[y == m - 1])
}# Gradient for slopes
for(m in 1 : p) {
+ m] <- - sum(x[i0, m] * (1e0 - P[i0]))
U[k + m] <- U[k + m] + sum(x[ik, m] * (1e0 - P[ik]))
U[k if(k > 1) for(i in 1 : (k - 1)) {
<- y == i
j + m] <- U[k + m] + sum(x[j, m] * (pq1[j] - pq2[j]) / P[j])
U[k
}
}
U
}
<- function(alpha, beta, x, y) rfort(x, y, alpha, beta)$grad
fgrad
# This calls a version of the Fortran code using the alpha extension approach
<- function(alpha, beta, x, y) {
fgrad2 <- as.matrix(x)
x <- nrow(x)
n <- ncol(x)
p <- max(y) # y assumed to be coded 0-k
k .Fortran('lrmll2',
as.integer(n), as.integer(k), as.integer(p),
rep(0e0, n), rep(1e0, n), matrix(0e0, nrow=p, ncol=p),
x, y, numeric(1), u=numeric(k + p), numeric(1), 2L, 0L, 0L)$u
alpha, beta, }
But is the elegance of following the letter of the proportional odds model’s definition worth the trouble? What if we used the trick that is most often used in MLE and Bayesian modeling where we define extra intercepts so that all values of Y appear to be interior values and the same difference in probabilities can be computed everywhere, and we did not use special cases to compute derivatives of log likelihood components? Specifically we can write the model as the following, with y = \(0, 1, \ldots, k\) and \(\text{expit}(x) = \frac{1}{1 + \exp(-x)}\).
\[\Pr(Y \geq y) = \text{expit}(\alpha_y + X\beta)\]
by expanding the original vector of intercepts \(\alpha\) by adding \(\alpha_0 = 100\) and \(\alpha_{k+1} = -100\). Then \(\Pr(Y = y) = \text{expit}(\alpha_y + X\beta) - \text{expit}(\alpha_{y+1} + X\beta)\) for any \(y=0, \ldots, k\). The \(\pm 100\) is chosen so that \(\text{expit}(x)\) is indistinguishable from 1 and 0.
We need to run some computational tests to make sure that the upcoming shortcuts do not cause any computational inefficiencies or inaccuracies.
# Check that R computes expit very quickly for extreme values of x
tim(smallvals = plogis(rep( 1, 100000)),
m50 = plogis(rep(-50, 100000)),
p50 = plogis(rep( 50, 100000)))
Per-run execution time in seconds, averaged over 10 runs
smallvals m50 p50
0.0010 0.0010 0.0011
# Check that taking the log of probabilities is as accurate as
# using plogis' special log probability calculation
<- seq(-50, 50, by=1); m(log(plogis(x)) - plogis(x, log=TRUE)) x
[1] 8.881784e-16
So it appears that the “intercept extension” approach will not cause any numerical problems. To code this method while computing the gradient, we need the derivative of the log of the difference in probabilities (call this \(Q = P_1 - P_2\)) given above. Consider a general parameter \(\theta\) which may be one of the (interior) \(\alpha\)s or one of the \(\beta\)s.
\[\frac{\partial \log(Q)}{\partial\theta} = \frac{\frac{\partial P_1}{\partial\theta} - \frac{\partial P_2}{\partial\theta}}{Q}\] Since the main part of \(\frac{\partial \text{expit}(x)}{\partial\theta} = \text{expit(x)}(1 - \text{expit(x)})\),
\[\frac{\partial P_1}{\partial\theta} = P_1 (1 - P_1) \frac{\partial(\alpha_y + X\beta)}{\partial\theta}\] \[\frac{\partial P_2}{\partial\theta} = P_2 (1 - P_2) \frac{\partial(\alpha_{y + 1} + X\beta)}{\partial\theta}\]
\(\frac{\partial(\alpha_{y} + X\beta)}{\partial\theta}\) is the 0/1 indicator function \([y=j]\) when \(\theta = \alpha_{j}\) and is \(X_{j}\) when \(\theta = \beta_{j}\). Now code this in R.
<- function(alpha, beta, x, y) {
grad2 <- length(alpha)
k <- length(beta)
p <- as.vector(x %*% matrix(beta, nrow=p))
xb <- as.vector(x %*% beta)
xb <- c(100e0, alpha, -100e0)
alpha # Must add 1 to y to compute P1 and P2 since index starts at 1, not 0
<- plogis(alpha[y + 1] + xb)
P1 <- plogis(alpha[y + 2] + xb)
P2 <- P1 - P2
Q <- P1 * (1e0 - P1)
pq1 <- P2 * (1e0 - P2)
pq2 <- numeric(k + p)
U
# Gradiant for intercepts
for(m in 1 : k)
<- sum((pq1 * (y == m) - pq2 * (y + 1 == m)) / Q)
U[m]
# Use element-wise multiplication then get the sum for each column
# Element-wise = apply the same value of the weights to each row of x
+ 1) : (k + p)] <- colSums(x * (pq1 - pq2) / Q)
U[(k
U }
Check that grad
and grad2
yield the same answer and check their relative speeds.
set.seed(1)
<- 50000; p <- 50; k <- 100
n <- matrix(rnorm(n * p), nrow=n)
x <- sample(0 : k, n, TRUE)
y stopifnot(length(unique(y)) == k + 1)
<- seq(-6, 6, length=k)
alpha <- runif(p, -0.5, 0.5)
beta tim(g1 = grad( alpha, beta, x, y),
g2 = grad2(alpha, beta, x, y),
reps = 5) # creates Res
Per-run execution time in seconds, averaged over 5 runs
g1 g2
0.9576 0.0646
m(Res$g1 - Res$g2) # maximum absolute difference
[1] 4.774847e-11
Even though the streamlined code in grad2
required evaluating a few quantities that are known to be 0 or 1, its vectorization resulted in significantly faster R code. Later this will be compared with the speed of Fortran code.
Here is the central part of the Fortran code for computing the gradient vector. Fortran is blazing fast and easier to learn than C and C++, so more users may wish to translate some execution-time critical portions of their R code to Fortran 2018. R makes it easy to include Fortran code in packages, and it is also easy to include Fortran functions in RStudio
sessions.
= 0_dp
u
! All obs with y=0
! The derivative of log expit(x) wrt x is expit(-x)
! Prob element is expit(-alpha(1) - lp)
1) = - sum(wt(i0) * (1_dp - d(i0)))
u(if(p > 0) then
do l = 1, p
+ l) = - sum(wt(i0) * x(i0, l) * (1_dp - d(i0)))
u(k end do
end if
! All obs with y=k
! Prob element is expit(alpha(k) + lp)
= u(k) + sum(wt(ik) * (1_dp - d(ik)))
u(k) if(p > 0) then
do l = 1, p
+ l) = u(k + l) + sum(wt(ik) * x(ik, l) * (1_dp - d(ik)))
u(k end do
end if
! All obs with 0 < y < k
if(nb > 0) then
do ii = 1, nb
= ib(ii)
i = y(i)
j ! For p1, D() = 1 for alpha(j), 0 for alpha(j+1)
! For p2, D() = 0 for alpha(j), 1 for alpha(j+1)
= u(j) + wt(i) * v1(i) / d(i)
u(j) + 1) = u(j + 1) - wt(i) * v2(i) / d(i)
u(j if(p > 0) then
do l = 1, p
+ l) = u(k + l) + wt(i) * x(i, l) * (v1(i) - v2(i)) / d(i)
u(k end do
end if
end do
This code can be streamlined using the \(\alpha\) extension approach:
= [100d0, alpha, -100d0]
ealpha = expit(ealpha(y + 1) + lp)
p1 = expit(ealpha(y + 2) + lp)
p2 = p1 - p2
q = p1 * (1_dp - p1)
pq1 = p2 * (1_dp - p2)
pq2 do j = 1, k
= sum((pq1 * merge(1_dp, 0_dp, y == j) - &
u(j) * merge(1_dp, 0_dp, y + 1 == j)) / (q / wt))
pq2 end do
if(p > 0) then
do j = 1, p
+ j) = sum(x(:, j) * (pq1 - pq2) / (q / wt))
u(k end do
end if
But this code runs faster (this is the code tested below):
do i = 1, n
= q(i) / wt(i)
w do j = 1, k
if(y(i) == j) u(j) = u(j) + pq1(i) / w
if(y(i) + 1 == j) u(j) = u(j) - pq2(i) / w
end do
if(p > 0) then
do j = 1, p
+ j) = u(k + j) + x(i, j) * (pq1(i) - pq2(i)) / w
u(k end do
end if
end do
First let’s compare accuracy and speed of two ways of coding the gradient calculation in Fortran (click above to see both versions).
tim(Fortran = fgrad (alpha, beta, x, y),
Fortran2 = fgrad2(alpha, beta, x, y), reps=44/10 )
Per-run execution time in seconds, averaged over 4.4 runs
Fortran Fortran2
0.02568182 0.01022727
m(Res$Fortran - Res$Fortran2)
[1] 6.82121e-12
Though it produces the same result to within \(7\times 10^{-12}\), the streamlined Fortran is \(1.5\times\) slower than the longer Fortran code.
Run the R grad2
function defined above, and the Fortran routine included in the new rms
package, for \(n=50,000\), \(p=50\) predictors, and \(k=100\) intercepts.
# Check agreement of R and Fortran code
tim(R = grad2(alpha, beta, x, y),
Fortran = fgrad(alpha, beta, x, y), reps=40 )
Per-run execution time in seconds, averaged over 40 runs
R Fortran
0.06005 0.01415
m(Res$R - Res$Fortran)
[1] 6.82121e-12
We see that the compiled Fortran code is \(> 7\times\) faster than the R code. In a nutshell Fortran allows you to not worry about vectorizing calculations, allowing for simpler code (there are many functions in Fortran for vectorizing operations but these are used more for brevity than for speed).
As streamlined as this code is, it does not improve execution time over my R grad
function, taking 1.1s to run on the data given above while yielding the wrong answer.
<- function(alpha, beta, X, Y) {
gradient_proportional_odds <- nrow(X) # Number of observations
n <- ncol(X) # Number of predictors
p <- length(alpha) # Number of thresholds (max Y value)
k
# Compute linear predictors
<- X %*% beta # n x 1 vector
eta
# Expand eta to match dimensions with alpha
<- matrix(eta, n, k, byrow = FALSE) # n x k matrix
eta_matrix
# Compute expit(alpha_y + eta) for all thresholds y
<- eta_matrix + matrix(alpha, n, k, byrow = TRUE)
eta_alpha <- 1 / (1 + exp(-eta_alpha)) # n x k matrix of expit values
expit_vals
# Compute probabilities for P(Y = y)
<- cbind(1, expit_vals) # P(Y >= 0) = 1
expit_upper <- cbind(expit_vals, 0) # P(Y >= k+1) = 0
expit_lower <- expit_upper[, 1:k] - expit_lower[, 1:k] # P(Y = y)
prob_Y
# Indicator matrix for observed Y
<- matrix(0, n, k)
Y_ind for (i in 1:k) Y_ind[, i] <- as.numeric(Y == (i - 1))
# Compute weights (observed minus predicted probabilities)
<- (Y_ind - prob_Y)
weights
# Gradients w.r.t. alpha
<- colSums(weights)
grad_alpha
# Gradients w.r.t. beta
<- expit_vals * (1 - expit_vals) # Derivative of expit
d_expit <- numeric(p)
grad_beta for (j in 1:p) {
<- sum(weights * d_expit * X[, j])
grad_beta[j]
}
# Combine gradients
<- c(grad_alpha, grad_beta)
grad return(grad)
}
The hessian requires about a factor of \(\frac{k + p}{2}\) more calculations than the gradient when computed inefficiently, so the Fortran code pays off even more when using hessian-based optimization algorithms or computing the final covariance matrix. The lrmll
Fortran code called by the new lrm.fit
, and ormll
called by orm.fit
, capitalize on the tri-band diagonal form of the hessian for the cumulative probability model. In rms
7.0-0 lrm.fit
and orm.fit
were completely re-written to benefit from Fortran code anx to use the Matrix
package for more efficient sparse matrix handling.
Efficient Computation of the hessian for General Cumulative Probability Models
The lrmll
Fortran subroutine computes the hessian very efficiently for the proportional odds model (logit link), making use of simplifications for the logistic model. Now consider general links. For \(Y = 1, ..., k - 1, P(Y = j)\) is written in terms of the cumulative probability function \(f\) by \(f(\alpha_{j} + X\beta) - f(\alpha_{j+1} + X\beta)\). We need all the second partial derivatives of the log of this difference in cumulative probabilities. Let’s simplify the expression to \(f(a + \beta x) - f(b + \beta x)\). ChatGPT provided the following results, which I simplified somewhat and corrected a sign error in \(\frac{\partial^2 g}{\partial b \ \partial \beta}\).
- Let \(g(a, b, \beta, x) = \log\left(f(a + \beta x) - f(b + \beta x)\right)\)
- Let \(F(a, b, \beta, x) = f(a + \beta x) - f(b + \beta x)\)
- Substitute \(d\) for \(F(a, b, \beta, x)\) post differentiation
- Divide all the second partial derivatives below by \(d^2\)
\[ \begin{aligned} \frac{\partial^2 g}{\partial a^2} &= d f''(a + \beta x) - f'(a + \beta x)^2 \\ \frac{\partial^2 g}{\partial b^2} &= -d f''(b + \beta x) - f'(b + \beta x)^2 \\ \frac{\partial^2 g}{\partial \beta^2} &= x^2 d \big(f''(a + \beta x) - f''(b + \beta x)\big) - x^{2} \big(f'(a + \beta x) - f'(b + \beta x)\big)^2 \\ \frac{\partial^2 g}{\partial a \partial b} &= f'(a + \beta x) \cdot f'(b + \beta x) \\ \frac{\partial^2 g}{\partial a \partial \beta} &= x d f''(a + \beta x) - x f'(a + \beta x) \cdot \big(f'(a + \beta x) - f'(b + \beta x)\big) \\ \frac{\partial^2 g}{\partial b \partial \beta} &= -x d f''(b + \beta x) + x f'(b + \beta x) \cdot \big(f'(a + \beta x) - f'(b + \beta x)\big) \end{aligned} \]
For derivatives with respect only to \(\beta\), substitute \(x_{i} x_{j}\) for \(x^2\) to get second partial derivatives with respect to \(\beta_{i}\) and \(\beta_{j}\).
Let \(d = 1 - f\) for \(Y=0\) and \(d = f\) otherwise. For the interior values of \(Y\), first consider \(Y=0\) which has probability \(d = 1 - f(a + x \beta)\). The second partial derivatives are, when multiplied by \((1 - f)^{2} = d^{2}\),
\[ \begin{aligned} \frac{\partial^2 g}{\partial a^2} &= - f'^{2} - df'' \\ \frac{\partial^2 g}{\partial \beta^2} &= - x^{2} \big(f'^{2} + df''\big) \\ \frac{\partial^2 g}{\partial a\partial\beta} &= - x\big(f'^{2} - df''\big) \end{aligned} \]
Here \(f\) stands for \(f(a + \beta x)\) etc.
For \(Y=k\) the hessian elements are, when multiplied by \(f^2\),
\[ \begin{aligned} \frac{\partial^2 g}{\partial a^2} &= d f'' - f'^{2} \\ \frac{\partial^2 g}{\partial \beta^2} &= x^{2} \big(d f'' - f'^{2}\big) \\ \frac{\partial^2 g}{\partial a\partial\beta} &= x\big(d f'' - f'^{2}\big) \end{aligned} \]
All these formulas are implemented in the ormll
Fortran subroutine used in the 6.9-1 version of rms
.
Sometimes it’s easier to derive second derivatives on the original scale without logging, and using a Jacobian to translate to the other scale. ChatGPT provided the following Jacobian solution for our example. This is not used in the Fortran code because it might slow it down a bit. I believe that MASS::polr
uses the Jacobian approach.
To systematically compute and represent the transformation of the second partial derivatives of \(F(a, b, \beta, x)\) into the second partial derivatives of \(g(a, b, \beta, x) = \log(F(a, b, \beta, x))\), we organize the derivatives into a hessian matrix and apply the transformation rules using Jacobian operations.
Step 1: Define the hessian Matrix for \(F\)
The hessian matrix of \(F(a, b, \beta, x)\) with respect to \((a, b, \beta)\) is: \[ \mathbf{H}_F = \begin{bmatrix} \frac{\partial^2 F}{\partial a^2} & \frac{\partial^2 F}{\partial a \partial b} & \frac{\partial^2 F}{\partial a \partial \beta} \\ \frac{\partial^2 F}{\partial b \partial a} & \frac{\partial^2 F}{\partial b^2} & \frac{\partial^2 F}{\partial b \partial \beta} \\ \frac{\partial^2 F}{\partial \beta \partial a} & \frac{\partial^2 F}{\partial \beta \partial b} & \frac{\partial^2 F}{\partial \beta^2} \end{bmatrix}. \]
Using the expressions for second partial derivatives derived earlier: \[ \begin{aligned} \frac{\partial^2 F}{\partial a^2} &= f''(a + \beta x), \quad \frac{\partial^2 F}{\partial b^2} = -f''(b + \beta x), \quad \frac{\partial^2 F}{\partial \beta^2} = x^2 \big(f''(a + \beta x) - f''(b + \beta x)\big), \\ \frac{\partial^2 F}{\partial a \partial b} &= 0, \quad \frac{\partial^2 F}{\partial a \partial \beta} = x f''(a + \beta x), \quad \frac{\partial^2 F}{\partial b \partial \beta} = -x f''(b + \beta x). \end{aligned} \]
Thus: \[ \mathbf{H}_F = \begin{bmatrix} f''(a + \beta x) & 0 & x f''(a + \beta x) \\ 0 & -f''(b + \beta x) & -x f''(b + \beta x) \\ x f''(a + \beta x) & -x f''(b + \beta x) & x^2 \big(f''(a + \beta x) - f''(b + \beta x)\big) \end{bmatrix}. \]
Step 2: Define the Transformation for \(g = \log(F)\)
We apply the chain rule: \[ g = \log(F) \implies \frac{\partial g}{\partial z} = \frac{1}{F} \frac{\partial F}{\partial z}, \] \[ \frac{\partial^2 g}{\partial z \partial w} = \frac{1}{F} \frac{\partial^2 F}{\partial z \partial w} - \frac{1}{F^2} \frac{\partial F}{\partial z} \frac{\partial F}{\partial w}. \]
This can be written compactly in matrix form.
Step 3: Matrix Transformation
Let:
- \(\nabla F = \begin{bmatrix} \frac{\partial F}{\partial a} & \frac{\partial F}{\partial b} & \frac{\partial F}{\partial \beta} \end{bmatrix}^\top\)
- \(\mathbf{H}_F\) be the hessian matrix of \(F\)
- \(F\) be the scalar value of \(F(a, b, \beta, x)\).
The hessian of \(g(a, b, \beta, x)\) is given by: \[ \mathbf{H}_g = \frac{1}{F} \mathbf{H}_F - \frac{1}{F^2} \nabla F \nabla F^\top. \]
Explanation:
- \(\frac{1}{F} \mathbf{H}_F\): Scales the hessian of \(F\) by \(\frac{1}{F}\)
- \(\frac{1}{F^2} \nabla F \nabla F^\top\): Accounts for the interaction of first derivatives of \(F\)
Step 4: Components of \(\nabla F\)
From earlier: \[ \frac{\partial F}{\partial a} = f'(a + \beta x), \quad \frac{\partial F}{\partial b} = -f'(b + \beta x), \quad \frac{\partial F}{\partial \beta} = x \big(f'(a + \beta x) - f'(b + \beta x)\big). \]
Thus: \[ \nabla F = \begin{bmatrix} f'(a + \beta x) \\ -f'(b + \beta x) \\ x \big(f'(a + \beta x) - f'(b + \beta x)\big) \end{bmatrix}. \]
Step 5: Compute \(\nabla F \nabla F^\top\)
The outer product \(\nabla F \nabla F^\top\) is: \[ \nabla F \nabla F^\top = \begin{bmatrix} f'(a + \beta x)^2 & -f'(a + \beta x) f'(b + \beta x) & f'(a + \beta x) x \big(f'(a + \beta x) - f'(b + \beta x)\big) \\ -f'(a + \beta x) f'(b + \beta x) & f'(b + \beta x)^2 & -f'(b + \beta x) x \big(f'(a + \beta x) - f'(b + \beta x)\big) \\ f'(a + \beta x) x \big(f'(a + \beta x) - f'(b + \beta x)\big) & -f'(b + \beta x) x \big(f'(a + \beta x) - f'(b + \beta x)\big) & x^2 \big(f'(a + \beta x) - f'(b + \beta x)\big)^2 \end{bmatrix}. \]
Final Hessian of \(g\)
The hessian of \(g(a, b, \beta, x)\) is: \[ \mathbf{H}_g = \frac{1}{F} \mathbf{H}_F - \frac{1}{F^2} \nabla F \nabla F^\top. \]
Substitute \(\mathbf{H}_F\) and \(\nabla F \nabla F^\top\) explicitly to get the full matrix. This provides the second derivatives of \(g\) in terms of \(F\), its derivatives, and the structure of \(f\).
Check Speed of NR
, nlminb
, and glm.fit
glm.fit
is tailored to be efficient when Y is binary and there is no penalization, by using iteratively weighted least squares. How does it stack up against NR
and nlminb
? Let’s try fitting a large binary logistic model.
set.seed(1)
<- 100000; p <- 100
n <- matrix(rnorm(n * p), nrow=n)
X <- sample(0 : 1, n, TRUE)
y tim(NR = lrm.fit(X, y, compstats=FALSE),
nlminb = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE),
glm.fit = glm.fit(cbind(1, X), y), reps=3)
Per-run execution time in seconds, averaged over 3 runs
NR nlminb glm.fit
1.302667 1.295333 1.650667
# transx=TRUE adds 2.3s to lrm.fit
Check Convergence Under Complete Separation
Consider a simple example where there is complete separation because the predictor values are identical to the response. In this case the MLE of the intercept is \(-\infty\) and the MLE of the slope is \(\infty\). The MLEs are approximated by \(\alpha=-50, \beta=100\), yielding predicted logits of \(-50\) and \(50\) because the \(\text{expit}\)s are sufficiently close to 0.0 and 1.0.
plogis(-50, log=TRUE)
[1] -50
plogis( 50, log=TRUE)
[1] -1.92875e-22
exp(plogis(-50, log=TRUE))
[1] 1.92875e-22
exp(plogis( 50, log=TRUE))
[1] 1
See how 4 optimization algorithms fare with their default parameters and with some adjustments. In some of the trace output, the first floating point number listed is -2LL, and following that are the current \(\alpha, \beta\) estimates.
set.seed(1)
<- sample(0 : 1, 20, TRUE)
x <- x
y <- try(lrm.fit(x, y, opt_method='NR', trace=1)) # default opt_method w
Iteration:1 -2LL:26.92047 Max |gradient|:9.6 Max |change in parameters|:4.166667
Iteration:2 -2LL:4.704224 Max |gradient|:1.754066 Max |change in parameters|:2.249045
Iteration:3 -2LL:1.588994 Max |gradient|:0.616103 Max |change in parameters|:2.08089
Iteration:4 -2LL:0.5685801 Max |gradient|:0.2233137 Max |change in parameters|:2.028578
Iteration:5 -2LL:0.2071309 Max |gradient|:0.0817248 Max |change in parameters|:2.010364
Iteration:6 -2LL:0.07592908 Max |gradient|:0.03000809 Max |change in parameters|:2.003793
Iteration:7 -2LL:0.02789647 Max |gradient|:0.01103173 Max |change in parameters|:2.001393
Iteration:8 -2LL:0.01025764 Max |gradient|:0.004057316 Max |change in parameters|:2.000512
Iteration:9 -2LL:0.003772913 Max |gradient|:0.001492464 Max |change in parameters|:2.000188
Iteration:10 -2LL:0.001387888 Max |gradient|:0.000549028 Max |change in parameters|:2.000069
Iteration:11 -2LL:0.0005105632 Max |gradient|:0.0002019736 Max |change in parameters|:2.000025
Iteration:12 -2LL:0.0001878241 Max |gradient|:7.430157e-05 Max |change in parameters|:2.000009
Iteration:13 -2LL:6.90964e-05 Max |gradient|:2.733397e-05 Max |change in parameters|:2.000003
Iteration:14 -2LL:2.541911e-05 Max |gradient|:1.00556e-05 Max |change in parameters|:2.000001
<- try(lrm.fit(x, y, opt_method='LM', trace=1)) w
Iteration:1 -2LL:4.726435 Max |gradient|:9.6 Max |change in parameters|:4.15697
Iteration:2 -2LL:1.596267 Max |gradient|:1.760569 Max |change in parameters|:2.249698
Iteration:3 -2LL:0.5711236 Max |gradient|:0.6183991 Max |change in parameters|:2.081206
Iteration:4 -2LL:0.2080488 Max |gradient|:0.2241376 Max |change in parameters|:2.028698
Iteration:5 -2LL:0.07626438 Max |gradient|:0.08202485 Max |change in parameters|:2.010408
Iteration:6 -2LL:0.0280195 Max |gradient|:0.03011806 Max |change in parameters|:2.003809
Iteration:7 -2LL:0.01030286 Max |gradient|:0.01107213 Max |change in parameters|:2.001399
Iteration:8 -2LL:0.003789542 Max |gradient|:0.004072171 Max |change in parameters|:2.000514
Iteration:9 -2LL:0.001394004 Max |gradient|:0.001497928 Max |change in parameters|:2.000189
Iteration:10 -2LL:0.0005128133 Max |gradient|:0.0005510379 Max |change in parameters|:2.00007
Iteration:11 -2LL:0.0001886518 Max |gradient|:0.0002027129 Max |change in parameters|:2.000026
Iteration:12 -2LL:6.94009e-05 Max |gradient|:7.457357e-05 Max |change in parameters|:2.000009
Iteration:13 -2LL:2.553113e-05 Max |gradient|:2.743404e-05 Max |change in parameters|:2.000003
Iteration:14 -2LL:9.392375e-06 Max |gradient|:1.009241e-05 Max |change in parameters|:2.000001
<- try(lrm.fit(x, y, opt_method='nlminb', trace=1)) w
0: 26.920467: -0.405465 0.00000
1: 5.9001181: -1.83776 3.67925
2: 1.9469377: -2.99693 5.99700
3: 0.69222129: -4.04687 8.09673
4: 0.25163200: -5.06435 10.1316
5: 0.092171572: -6.07067 12.1442
6: 0.033854569: -7.07298 14.1489
7: 0.012447190: -8.07383 16.1506
8: 0.0045780905: -9.07414 18.1512
9: 0.0016840535: -10.0743 20.1514
10: 0.00061951084: -11.0743 22.1515
11: 0.00022790289: -12.0743 24.1515
12: 8.3840460e-05: -13.0743 26.1515
13: 3.0843137e-05: -14.0743 28.1515
14: 1.1346550e-05: -15.0743 30.1515
15: 4.1741617e-06: -16.0743 32.1515
16: 1.5355882e-06: -17.0743 34.1515
17: 5.6491130e-07: -18.0743 36.1515
18: 2.0781925e-07: -19.0743 38.1515
19: 7.6452432e-08: -20.0743 40.1515
20: 2.8125276e-08: -21.0743 42.1515
21: 1.0346714e-08: -22.0743 44.1515
22: 3.8063437e-09: -23.0743 46.1515
23: 1.4002772e-09: -24.0743 48.1515
24: 5.1513283e-10: -25.0743 50.1515
25: 1.8950352e-10: -26.0743 52.1515
26: 6.9714901e-11: -27.0743 54.1515
27: 2.5648816e-11: -28.0743 56.1515
28: 9.4360075e-12: -29.0743 58.1515
29: 3.4692249e-12: -30.0743 60.1515
30: 1.2789769e-12: -31.0743 62.1515
31: 4.7073456e-13: -32.0743 64.1515
32: 1.6875390e-13: -33.0743 66.1515
33: 6.2172489e-14: -34.0743 68.1515
34: 2.6645353e-14: -35.0743 70.1515
35: 8.8817842e-15: -36.0743 72.1515
36: 0.0000000: -37.0743 74.1515
37: 0.0000000: -37.0743 74.1515
<- try(lrm.fit(x, y, opt_method='glm.fit', trace=1)) w
Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12
Deviance = 1.888413e-05 Iterations - 13
Deviance = 6.947082e-06 Iterations - 14
Deviance = 2.555688e-06 Iterations - 15
Deviance = 9.401851e-07 Iterations - 16
Deviance = 3.458748e-07 Iterations - 17
Deviance = 1.272402e-07 Iterations - 18
Deviance = 4.680906e-08 Iterations - 19
Deviance = 1.722009e-08 Iterations - 20
Deviance = 6.33492e-09 Iterations - 21
Deviance = 2.330491e-09 Iterations - 22
Deviance = 8.573409e-10 Iterations - 23
Deviance = 3.15401e-10 Iterations - 24
Deviance = 1.160316e-10 Iterations - 25
Deviance = 4.268585e-11 Iterations - 26
Deviance = 1.570299e-11 Iterations - 27
Deviance = 5.782042e-12 Iterations - 28
<- try(lrm.fit(x, y, opt_method='BFGS', trace=1)) w
initial value 26.920467
iter 10 value 0.000153
iter 20 value 0.000017
iter 30 value 0.000004
iter 40 value 0.000001
iter 50 value 0.000000
final value 0.000000
stopped after 50 iterations
nlminb
took 37 iterations, going so far that the hessian matrix was singular. It should have stopped with 10.glm.fit
took 28 iterations; should have stopped with 10BFGS
: stopped after 50 iterations; should have stopped with 10
Now specify arguments to lrm.fit
that are tuned to this task.
<- lrm.fit(x, y, opt_method='nlminb', abstol=1e-3, trace=1) w
0: 26.920467: -0.405465 0.00000
1: 5.9001181: -1.83776 3.67925
2: 1.9469377: -2.99693 5.99700
3: 0.69222129: -4.04687 8.09673
4: 0.25163200: -5.06435 10.1316
5: 0.092171572: -6.07067 12.1442
6: 0.033854569: -7.07298 14.1489
7: 0.012447190: -8.07383 16.1506
8: 0.0045780905: -9.07414 18.1512
9: 0.0016840535: -10.0743 20.1514
10: 0.00061951084: -11.0743 22.1515
<- lrm.fit(x, y, opt_method='glm.fit', reltol=1e-3, trace=1) w
Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12
<- lrm.fit(x, y, opt_method='BFGS', reltol=1e-4, trace=1) w
initial value 26.920467
iter 10 value 0.000153
final value 0.000100
converged
The tolerance parameters are too large to use when infinite coefficients are not a problem.
Check 4 Algorithms With k=1000
For timings that follow, compstats=FALSE
is specified to lrm.fit
so that we can focus on computationally efficiencies of various optimization algorithms. Some of the differences in run times may not seem to be consequential, but once extremely large datasets are analyzed or one needs to fit models in a bootstrap or Monte Carlo simulation loop, the differences in speed will matter.
When the number of distinct Y values is large, and this far exceeds the number of predictors (\(k >> p\)), the rms
lrm
and orm
function are highly efficient. They take into account the sparsity of the intercept portion of the hessian matrix, which is tri-band diagonal and has only \(2\times k\) nonzero values for \(k+1\) distinct Y values, taking the matrix’s symmetry into account. Outside of lrm
, orm
and SAS JMP
, ordinal regression fitting software treats the hessian matrix as being \(k\times k\) and does not capitalize on sparsity.
set.seed(1)
<- 10000; k <- 1000
n <- rnorm(n); y <- sample(0:k, n, TRUE)
x length(unique(y))
[1] 1001
tim(orm = rms::orm.fit(x, y, eps=1e-10),
bfgs = lrm.fit(x, y, opt_method='BFGS', compstats=FALSE, maxit=100),
nlminb = lrm.fit(x, y, opt_method='nlminb', compstats=FALSE),
nr = lrm.fit(x, y, opt_method='NR', compstats=FALSE),
nlm = lrm.fit(x, y, opt_method='nlm', compstats=FALSE),
polr = MASS::polr(factor(y) ~ x, control=list(reltol=1e-10)),
reps= 2)
Per-run execution time in seconds, averaged over 2 runs
orm bfgs nlminb nr nlm polr
0.0125 0.0325 0.3300 0.0105 6.3640 6.1450
smod()
deviance max_abs_u iter
orm 137142.5 1.118337e-09 4
bfgs 137142.5 2.376188e+00 4
nlminb 137142.5 1.431033e-09 3
nr 137142.5 7.625981e-05 2
nlm 137142.5 7.625981e-05 1
polr 137142.5 NA NA
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 orm vs. bfgs 6.991343e-06 3.950698e-04
2 orm vs. nlminb 1.219205e-13 1.169461e-13 1.000000
3 orm vs. nr 1.219707e-13 1.168651e-13 1.000000
4 orm vs. nlm 4.136310e-07 3.812141e-07 1.001368
5 orm vs. polr 9.637588e-06 1.359060e-05
6 bfgs vs. nlminb 6.991343e-06 3.950698e-04
7 bfgs vs. nr 6.991343e-06 3.950698e-04
8 bfgs vs. nlm 6.776204e-06 3.949523e-04
9 bfgs vs. polr 9.425718e-06 4.030520e-04
10 nlminb vs. nr 1.629917e-16 3.095490e-16 1.000000
11 nlminb vs. nlm 4.136309e-07 3.812140e-07 1.001368
12 nlminb vs. polr 9.637588e-06 1.359060e-05
13 nr vs. nlm 4.136309e-07 3.812140e-07 1.001368
14 nr vs. polr 9.637588e-06 1.359060e-05
15 nlm vs. polr 9.231301e-06 1.326760e-05
nlminb
is slower then NR
because to run nlminb
requires converting the Hession from a sparse Matrix
into a regular non-sparse matrix
.
Check Timing and Agreement for n=100000, k=10, p=5
Check timing and calculation agreement for n=100000, 10 intercepts, 5 predictors.
set.seed(1)
<- 100000
n <- sample(0 : 10, n, TRUE)
y <- rnorm(n)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- cbind(x1, x2, x3, x4, x5)
X
tim(old.lrm.fit = olrm(X, y, eps=1e-7),
nr = lrm.fit(X, y, opt_method='NR', compstats=FALSE),
lm = lrm.fit(X, y, opt_method='LM', compstats=FALSE),
nlm = lrm.fit(X, y, opt_method='nlm', compstats=FALSE),
nlminb = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=TRUE),
nlminb.notransx = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=FALSE),
bfgs = lrm.fit(X, y, opt_method='BFGS', compstats=FALSE),
bfgs.reltol = lrm.fit(X, y, opt_method='BFGS',
compstats=FALSE, reltol=1e-12),
polr = MASS::polr(factor(y) ~ X,
control=list(reltol=1e-10)),
reps = 5)
Per-run execution time in seconds, averaged over 5 runs
old.lrm.fit nr lm nlm nlminb
0.1476 0.0786 0.0772 0.4906 0.1060
nlminb.notransx bfgs bfgs.reltol polr
0.0802 0.4040 0.6006 2.3722
smod()
deviance max_abs_u iter
old.lrm.fit 479562.9 8.754331e-12 NA
nr 479562.9 1.992452e-07 3
lm 479562.9 4.246614e-05 3
nlm 479562.9 5.051226e-02 1
nlminb 479562.9 1.992456e-07 3
nlminb.notransx 479562.9 1.992552e-07 3
bfgs 479562.9 6.987317e-01 7
bfgs.reltol 479562.9 3.683979e-05 19
polr 479562.9 NA NA
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 old.lrm.fit vs. nr 5.540598e-16 1.373265e-14 1.000000
2 old.lrm.fit vs. lm 1.964277e-09 3.686250e-08 1.000000
3 old.lrm.fit vs. nlm 3.004074e-06 2.122140e-05 1.000202
4 old.lrm.fit vs. nlminb 1.053797e-11 4.463604e-11 1.000000
5 old.lrm.fit vs. nlminb.notransx 1.053798e-11 4.462038e-11 1.000000
6 old.lrm.fit vs. bfgs 9.057305e-06 7.174198e-04
7 old.lrm.fit vs. bfgs.reltol 5.712939e-08 1.703834e-06
8 old.lrm.fit vs. polr 5.977469e-06 2.409730e-03
9 nr vs. lm 1.964277e-09 3.686251e-08 1.000000
10 nr vs. nlm 3.004074e-06 2.122140e-05 1.000202
11 nr vs. nlminb 1.053798e-11 4.462390e-11 1.000000
12 nr vs. nlminb.notransx 1.053798e-11 4.460824e-11 1.000000
13 nr vs. bfgs 9.057305e-06 7.174198e-04
14 nr vs. bfgs.reltol 5.712939e-08 1.703834e-06
15 nr vs. polr 5.977469e-06 2.409730e-03
16 lm vs. nlm 3.002323e-06 2.123986e-05 1.000202
17 lm vs. nlminb 1.954251e-09 3.688525e-08 1.000000
18 lm vs. nlminb.notransx 1.954251e-09 3.688523e-08 1.000000
19 lm vs. bfgs 9.056084e-06 7.173837e-04
20 lm vs. bfgs.reltol 5.903326e-08 1.726187e-06
21 lm vs. polr 5.978924e-06 2.409739e-03
22 nlm vs. nlminb 3.004063e-06 2.122136e-05 1.000202
23 nlm vs. nlminb.notransx 3.004063e-06 2.122136e-05 1.000202
24 nlm vs. bfgs 8.867603e-06 7.344671e-04
25 nlm vs. bfgs.reltol 3.059332e-06 2.243215e-05
26 nlm vs. polr 7.995997e-06 2.398944e-03
27 nlminb vs. nlminb.notransx 2.695688e-17 2.221290e-14 1.000000
28 nlminb vs. bfgs 9.057301e-06 7.174198e-04
29 nlminb vs. bfgs.reltol 5.713989e-08 1.703866e-06
30 nlminb vs. polr 5.977476e-06 2.409730e-03
31 nlminb.notransx vs. bfgs 9.057301e-06 7.174198e-04
32 nlminb.notransx vs. bfgs.reltol 5.713989e-08 1.703866e-06
33 nlminb.notransx vs. polr 5.977476e-06 2.409730e-03
34 bfgs vs. bfgs.reltol 9.069540e-06 7.167658e-04
35 bfgs vs. polr 1.347117e-05 2.800575e-03
36 bfgs.reltol vs. polr 5.951212e-06 2.410693e-03
Other Speed Tests
n=1,000,000, p=100 predictors, k=50 intercepts
lrm.fit
: 13.5s (13.6 withopt_method=nlminb
, 120s withopt_method='BFGS'
)orm.fit
: 9sMASS::polr
: 70s without the hessian
n=300,000, p=20, k=299,999
lrm.fit
: 2.25sorm.fit
: 2s
Execution time is proportional to \(k \times p\)
Check Impact of initglm
and transx
Generate a sample of 500 with 30 predictors and 269 levels of Y where a subset of the predictors are strongly related to Y and there are collinearities.
set.seed(1)
<- 500
n <- 30
p <- matrix(runif(n * p), nrow=n)
x 1:8] <- x[, 1:8] + 2 * x[, 9]
x[, <- varclus(~ x)$hclust
s plot(as.dendrogram(s), horiz=TRUE, axes=FALSE,
xlab=expression(paste('Spearman ', rho^2)))
<- seq(0, 1, by=0.1)
rh axis(1, at=1 - rh, labels=format(rh))
<- x[, 1] + 2 * x[, 2] + 3 * x[, 3] + 4 * x[, 4]+ 5 * x[, 5] +
y 3 * runif(n, -1, 1)
<- round(y, 1)
y length(unique(y))
[1] 269
<- function(..., opt_method='NR')
f lrm.fit(x, y, compstats=FALSE, opt_method=opt_method, maxit=1000, ...)
# f(initglm=TRUE) (using nlminb) would not work: NA/NaN gradient evaluation
# This did not happen without collinearities
tim(default = f(),
transx = f(transx=TRUE),
nlm = f(opt_method='nlm'),
bfgs = f(opt_method='BFGS'),
nlminb = f(opt_method='NR'),
reps = 10 )
Per-run execution time in seconds, averaged over 10 runs
default transx nlm bfgs nlminb
0.0128 0.0170 0.5476 0.2795 0.0124
smod()
deviance max_abs_u iter
default 3879.848 6.112532e-05 7
transx 3879.848 6.112532e-05 7
nlm 3879.848 6.112532e-05 6
bfgs 3879.848 9.721526e-03 587
nlminb 3879.848 6.112532e-05 7
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 default vs. transx 7.418667e-15 8.758898e-16 1.000000
2 default vs. nlm 1.063035e-05 5.990158e-07 1.000291
3 default vs. bfgs 1.918945e-05 3.544624e-06
4 default vs. nlminb 0.000000e+00 0.000000e+00 1.000000
5 transx vs. nlm 1.063035e-05 5.990158e-07 1.000291
6 transx vs. bfgs 1.918945e-05 3.544624e-06
7 transx vs. nlminb 7.418667e-15 8.758898e-16 1.000000
8 nlm vs. bfgs 2.249980e-05 3.725436e-06
9 nlm vs. nlminb 1.063035e-05 5.990158e-07 1.000291
10 bfgs vs. nlminb 1.918945e-05 3.544624e-06
lrm.fit
vs. orm.fit
as k \(\uparrow\)
The fitting function for rms::orm
, orm.fit
, uses sparse hessian matrices so that the computation time is roughly proportional to \(n (3k + p^{2} + kp)\) where \(k\) is the number of intercepts and \(p\) is the number of predictors. Computation of the hessian in lrm.fit
needs about \(n (k + p)^{2}\) computations, but some parts of the computation are faster and there is some overhead of handling sparse matrices in orm.fit
. Let’s explore execution time as a function of \(k\) when \(n=10000, p=30\) and \(k\) varies. There should be very little difference.
if(! file.exists('breakeven.rds')) {
set.seed(1)
<- 10000
n <- 30
p <- seq(100, 10000, by=200)
ks <- length(ks)
l <- t2 <- d <- numeric(l)
t1 <- matrix(rnorm(n * p), nrow=n)
x for(i in 1 : l) {
cat(ks[i], ' ')
<- rep(0 : ks[i], length=n)
y <- stim(for(j in 1:20) f <- rms::lrm.fit(x, y)) / 20
t1[i] <- stim(for(j in 1:20) g <- rms::orm.fit(x, y)) / 20
t2[i] <- m(coef(f) - coef(g))
d [i]
}<- llist(ks, t1, t2, d)
w saveRDS(w, 'breakeven.rds')
else {
} <- readRDS('breakeven.rds')
w <- w$ks; t1 <- w$t1; t2 <- w$t2; d <- w$d
ks
}# Make sure coefficients agree
range(d)
[1] 8.881784e-16 1.532996e-12
plot(ks, t1, type='b', xlab='k', ylab='Time, seconds', ylim=c(0.02, 0.18))
points(ks, t2, col='red')
lines (ks, t2, col='red')
Execution time for both functions is linear in k. orm
is consistently a little faster than lrm
. Since the code in the orm.fit
Fortran ormll
subroutine is more general, currently implementing 5 link functions, there is no real reason to maintain separate code. In the future I plan to merge the functions to minimize duplication, and having an lrm
front-end for orm
for backward compatibility.
Better Understanding Convergence with BFGS Optimizer
Using the same simulated data just used with k=20, use BFGS to fit an ordinal model with relative tolerance varying from \(10^{-2}\) to \(10^{-20}\). Estimates are compared to orm
. In addition to comparing parameter estimates as done above we also compute differences in units of standard errors as computed by orm
.
set.seed(1)
<- sample(0:20, n, TRUE)
y <- orm(y ~ x, eps=1e-10)
g <- sqrt(diag(vcov(g, intercepts='all')))
se length(se)
[1] 50
if(file.exists('bfgs-reltol.rds')) d <- readRDS('bfgs-reltol.rds') else {
<- NULL
d for(i in 2 : 20) {
cat(i, '')
<- stim({
s for(j in 1:5)
<- lrm.fit(x, y, compstats=FALSE,
f opt_method='BFGS',
maxit=1000,
reltol=10^(-i))
} )
<- data.frame(i, elapsed=s / 5, maxu=m(f$u),
w maxbeta=m(coef(g) - coef(f)),
maxbeta.per.se=m((coef(g) - coef(f)) / se),
deviance=tail(f$deviance, 1),
iter=tail(f$iter, 1))
<- rbind(d, w)
d
}rownames(d) <- NULL
saveRDS(d, 'bfgs-reltol.rds')
} d
i elapsed maxu maxbeta maxbeta.per.se deviance iter
1 2 0.0150 1.474752e+02 2.559219e-02 1.485374e+00 60853.10 1
2 3 0.0156 1.474752e+02 2.559219e-02 1.485374e+00 60853.10 1
3 4 0.0520 4.754120e+01 8.310188e-03 4.823241e-01 60843.74 3
4 5 0.0676 2.731752e+01 6.488538e-03 3.765954e-01 60843.47 4
5 6 0.1164 2.570406e+01 4.628291e-03 1.152693e-01 60842.69 10
6 7 0.1634 3.232073e+00 4.015687e-03 9.284606e-02 60842.60 16
7 8 0.4220 1.849727e+00 4.627223e-04 1.288221e-02 60842.57 55
8 9 0.4202 4.694068e-01 8.486176e-04 2.144243e-02 60842.57 53
9 10 0.5654 1.027168e-01 1.576543e-04 3.340893e-03 60842.57 75
10 11 0.3670 3.118309e-02 5.997566e-05 1.292625e-03 60842.57 52
11 12 0.5104 1.435433e-02 6.238079e-06 1.344461e-04 60842.57 71
12 13 0.3986 9.525472e-03 1.005773e-05 2.167691e-04 60842.57 56
13 14 0.3882 1.054183e-03 1.688144e-06 3.638369e-05 60842.57 57
14 15 0.5556 1.669547e-03 1.222338e-06 2.832486e-05 60842.57 70
15 16 0.4710 4.720818e-04 1.373869e-07 7.956884e-06 60842.57 60
16 17 0.3992 4.720818e-04 1.373869e-07 7.956884e-06 60842.57 60
17 18 0.3950 4.720818e-04 1.373869e-07 7.956884e-06 60842.57 60
18 19 0.4126 4.720818e-04 1.373869e-07 7.956884e-06 60842.57 60
19 20 0.4522 4.720818e-04 1.373869e-07 7.956884e-06 60842.57 60
<- c(deviance=8, beta=10, beta.se=6, grad=15) # minimum i for which success achieved
z <- function() abline(v=z, col=gray(0.60))
h <- function() {} # remove this line to show reference lines
h par(mfrow=c(2, 3), mar=c(4, 4, 1, 1), las=1, mgp=c(2.8, .45, 0))
with(d, {
plot(i, elapsed, type='l'); h()
plot(i, maxu, type='l', log='y'); h()
plot(i, maxbeta, type='l', log='y'); h()
plot(i, maxbeta.per.se, type='l'); h()
plot(i, deviance - 3879, type='l', log='y'); h()
plot(i, iter, type='l', log='y'); h()
})
NULL
See that stochastic convergence, as judged by deviance, occurs by the time the relative tolerance is \(10^{-8}\), by \(10^{-10}\) to control maximum absolute parameter difference, by \(10^{-6}\) to get parameter estimates to within 0.06 standard error, and by \(10^{-15}\) to achieve gradients \(< 10^{-4}\) in absolute value.
When using BFGS a recommendation is to use a relative tolerance of \(10^{-8}\) to nail down estimates to the extent that it matters precision-wise, and use \(10^{-10}\) to achieve reproducibility.
Recall that BFGS is only appealing when the number of intercepts is large, you don’t need the covariance matrix, and you are not using orm
.
Matrix Inversion
The information matrix (negative hessian) must be inverted to compute the variance-covariance matrix. The default inversion method in R is the solve
function, which defaults to using the LU decomposition. This is a fast algorithm, but the Cholesky decomposition is faster and behaves as well as LU numerically. Another approach uses the QR decomposition, implemented in the qr.solve
function. Let’s compare speed and accuracy of the three approaches when applied to a high-dimensional almost-singular matrix.
# ChatGPT created this function to generate an almost-singular symmetric
# positive definite matrix of dimension p x p
<- function(p, epsilon = 1e-8) {
genas # Generate a random symmetric positive definite matrix
<- matrix(rnorm(p^2), p, p)
A <- A + t(A) + p * diag(p) # Symmetric and positive definite
A
# Modify eigenvalues to make the matrix almost singular
<- eigen(A)
eigen_decomp $values[p] <- epsilon # Set the smallest eigenvalue close to zero
eigen_decomp$vectors %*% diag(eigen_decomp$values) %*% t(eigen_decomp$vectors)
eigen_decomp
}set.seed(3)
<- genas(750, 1e-9)
x 1:3, 1:3] x[
[,1] [,2] [,3]
[1,] 748.051108 -1.0242634 0.3929600
[2,] -1.024263 747.2599805 0.9261079
[3,] 0.392960 0.9261079 745.6809913
# LU method
stime(for(i in 1:3) a <- solve(x))
Elapsed time: 0.645s
mad(x, solve(a)) # reverse the inversion and compare to x
mad relmad
0.0001921486 0.0006909615
m(diag(750) - x %*% a) # compare x inverse * x to identity matrix
[1] 3.083591e-05
# QR
stime(for(i in 1:3) b <- qr.solve(x, tol=1e-13))
Elapsed time: 1.43s
mad(x, qr.solve(b, tol=1e-13))
mad relmad
0.0002050283 0.0007250491
m(diag(750) - x %*% b)
[1] 3.917762e-05
# Cholesky
stime(for(i in 1:3) ch <- chol2inv(chol(x)))
Elapsed time: 0.324s
mad(x, chol2inv(chol(ch))) # mean |difference| and mean relative difference
mad relmad
0.0002323721 0.0008362380
m(diag(750) - x %*% ch) # mean |difference|
[1] 1.71175e-05
QR takes significantly longer and offers no accuracy advantange. Inversion via Cholesky decomposition was almost twice as fast as LU, though both methods took less than \(\frac{1}{4}\) of a second to invert a \(750\times 750\) matrix. Cholesky was a little more accurate in getting the product of the original matrix and its inverse closer to an identity matrix. Cholesky was very slightly worse in recovering the original matrix by inverting its inverse.
One of the reasons lrm.fit
and orm.fit
are more efficient in rms
7.0 is that the entire information matrix is not inverted upon convergence when creating the final fit object. Much use is made of the Matrix
package for efficient storage and computation of sparse matrices, and only the 3 minimal submatrices that make up the information matrix are stored. These are operated on quite generally by the new rms
infoMxop
function, which is called by the vcov
method to invert pieces of the information matrix only as needed. During Newton-type updating, the Matrix
solve
function is used, which is quite fast as it uses sparse representations and does not actually invert the Hession but solves for the inverse of the hessian multiplied by the gradient vector.
Here is an example where only elements 10:20 from the inverse of a 1000 x 1000 matrix are obtained. This type of coding is used in infoMxop
.
# Create a 1000 x 1000 symmetric positive definite matrix
set.seed(1)
<- matrix(rnorm(10000 * 1000), ncol=1000)
x <- crossprod(x)
v <- 10:20 # submatrix of v inverse that we want
i <- length(i)
l <- matrix(0, 1000, l)
w cbind(i, 1:l)] <- 1 # w is all zeros except for 1 in i elements
w[sum(w)
[1] 11
stime(vi <- solve(v, w)[i, , drop=FALSE])
Elapsed time: 0.121s
dim(vi)
[1] 11 11
stime(vi_slow <- solve(v))
Elapsed time: 0.498s
range(vi - vi_slow[i, i])
[1] 0 0
What about inverting the kind of sparse matrices that ordinal models deal with? Let’s build one using the Matrix
package and rms::infoMxop
.
require(Matrix)
<- 200
p <- 20000
k set.seed(1)
<- list(a = cbind(runif(k), runif(k)),
w b = v[1:p, 1:p],
ab= matrix(runif(k * p), k, p) )
stime(z <- infoMxop(w))
Elapsed time: 0.207s
object.size(w)
32641088 bytes
object.size(z)
97282280 bytes
# Sparse representation of intercept components used by infoMxop
<- Matrix::bandSparse(k, k=c(0,1), diagonals=w$a, symmetric=TRUE)
ia object.size(ia)
561720 bytes
# Time to invert this sparse matrix
stime(via <- Matrix::solve(ia))
Elapsed time: 1.922s
dim(via)
[1] 20000 20000
# The inverse of a tri-band diagonal matrix is dense but can be represented efficiently
object.size(via)
383980824 bytes
length(via@x)
[1] 31991591
# Compute size needed if did not make use of sparsity
8 * (p + k) ^ 2
[1] 3264320000
# Get covariate portion of inverted matrix
stime(ub <- infoMxop(w, i='x'))
Elapsed time: 0.371s
# Get the first intercept and beta portion of the inverse
<- c(1, (k + 1) : (k + p))
i stime(u <- infoMxop(w, i=i)) # 26s for i=one element
Elapsed time: 48.352s
dim(u); dim(ub); range(u[-1, -1] - ub)
[1] 201 201
[1] 200 200
[1] -5.247539e-17 5.086263e-17
# Don't try this: infoMxop(w, invert=TRUE))
The specialized method with i='x'
for getting just the portion of the inverse corresponding to the \(\beta\)s is very fast. Otherwise there are speed challenges but the sparse representation does allow the overall inverse to be computed, something not possible with naive matrix calculations.
Now consider execution time for computing the standard errors of predicted values . Consider a predicted value of the form \(\hat{\alpha}_{j} + X\hat{\beta}\) involving a single intercept \(\hat{\alpha}_j\). Let \(V\) be the variance-covariance matrix for \(\theta = [\alpha_{j}, \beta]\) where \(\beta\) is a vector of length \(p\). Let the \(m\times (p + 1)\) matrix \(X\) consisting of a column of ones followed by \(p\) columns of predictor settings. \(m\) is the number of observations for which predictions are sought. Standard errors of interest are square roots of the diagonal of the variance-covariance matrix \(W\) for \(\hat{Y} = X\hat{\theta}\), where \(W = XVX'\) is an \(m\times m\) matrix.
rms::infoMxop
makes computations such as \(W\) efficient using the following strategy. Let \(J\) be a matrix that is mainly zeros but with ones in positions that indicate which elements of \(V^{-1}\) to compute. Let \(I\) denote the information matrix for the entire model, with \(k + p\) rows and columns. Concentrate on computation of \(VX'\). Instead of computing \(I^{-1}\) by itself, compute only the needed elements of it by computing solve(I, J)
to get \(V\). But we quickly want to post-multiply by \(X'\) so use solve(I, JX')
. Let’s see how fast this is when there are 299,999 intercepts.
set.seed(1)
<- 300000; p <- 10; x <- matrix(runif(n * p), ncol=p); y <- 1 : n
n <- length(unique(y)) - 1
k <- orm(y ~ x)
f # Get intercept number corresponding to median of y
<- f$interceptRef
j # Compute all parameter numbers needed
<- c(j, (k + 1) : (k + p))
h <- f$info.matrix
info mdim(info) # show dimensions of submatrices
a b ab
rows 299999 10 299999
columns 2 10 10
# Form covariate values for 2 observations for predicting
<- rbind(c(1, rep(.2, p)), c(1, rep(.6, p)))
X X
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
[2,] 1 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
# Get VX' the slow way
# system.time(infoMxop(info, invert=TRUE)[h, h] %*% t(X)) 5.4s for k=9999
<- stime(a <- infoMxop(info, i=h, B=t(X))) et
Elapsed time: 0.33s
a
[,1] [,2]
y>=150000 1.933352e-04 -4.675175e-05
x[1] -3.589965e-05 1.195570e-05
x[2] -3.618740e-05 1.220834e-05
x[3] -3.635284e-05 1.205578e-05
x[4] -3.595975e-05 1.195893e-05
x[5] -3.587224e-05 1.212975e-05
x[6] -3.568397e-05 1.180846e-05
x[7] -3.607189e-05 1.206576e-05
x[8] -3.581018e-05 1.194599e-05
x[9] -3.613558e-05 1.194314e-05
x[10] -3.610594e-05 1.212468e-05
stime(infoMxop(info, i=h)) # time required to compute needed submatrix of V
Elapsed time: 0.418s
stime(infoMxop(info, i=100)) # time required to retrieve a single intercept
Elapsed time: 0.311s
stime(infoMxop(info, i='x')) # time to get beta part of v
Elapsed time: 0.134s
0.33 seconds to get the variance of predicted values when there are 299999 intercepts is quite sufficient! The time required to compute the portion of the inverse needed is only 0.1s longer however.
Note that in the source code for lrm
and orm
you’ll see a shortcut for computing the diagonal elements:
<- ncol(X)
nx <- cbind(1, X)
X <- infoMxop(info, i=c(f$interceptRef, (nrp + 1) : (nrp + nx)), B=t(X))
v <- drop(sqrt((t(v) * X) %*% rep(1, nx + 1))) se
What is Fast and What is Slow When \(k\) is Large
For continuous Y when there is a large number \(k\) of intercepts, here is a breakdown of what kind of computations involving ordinal regression models are fast:
- solving for the MLEs
- computing the covariance matrix for the \(\beta\)s alone and using computing it for \(\beta\) and a small number of intercepts
- any assessment that is relative (e.g., odds ratios as opposed to absolute risk estimates)
- contrasts
- Wald \(\chi^2\) tests
- likelihood ratio \(\chi^2\) tests (which don’t require covariance matrices)
- standard errors (SEs) and confidence bands for differences on the link (e.g., logit) scale
- predicted absolute quantities (exceedance probabilities, cell probabilities, means, quantiles) without SEs or confidence intervals
Slow operations for very large \(k\), e.g., \(k > 10000\):
- computing the covariance matrix for all the intercepts or for all parameters combined
For some computations it will be faster to bootstrap the model fit rather than to compute SEs and CLs. The rms
bootcov
function does this efficiently since it uses lrm.fit
or orm.fit
with compstats=FALSE
to streamline the computation of MLEs from bootstrap samples.
bootcov
needs all the intercepts to be represented in all bootstrap samples. To minimally group Y values to make this happen, see the new rms
ordGroupBoot
function.But for some “absolute” computations, run time is still exceptionally fast for large \(k\), because the entire information matrix does not need inverting, but instead the inverse is multiplied by a vector as was done above, so that solve
can be used to quickly solve a system of equations instead of fully inverting. As an example let’s time the calculation of the estimation of mean Y without confidence limits, and then with them, for \(k=15000\). The Mean
function uses the \(\delta\)-method to estimate the needed standard error for the normal approximation for the confidence interval for a covariate-specific population mean..
set.seed(1)
<- 15000
n <- 1:n
y <- rnorm(n)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- datadist(x1, x2, x3); options(datadist='dd')
dd <- orm(y ~ x1 + x2 + x3, x=TRUE)
f <- data.frame(x1=0, x2=0, x3=0)
d <- predict(f, d, type='x') # need original design matrix for accurate CLs of means
X X
x1 x2 x3
1 0 0 0
<- Mean(f)
M stime(print(M(predict(f, d))))
1
7500.703
Elapsed time: 0.022s
stime(print(M(predict(f, d), conf.int=0.95, X=X)))
1
7500.703
attr(,"limits")
attr(,"limits")$lower
1
7431.4
attr(,"limits")$upper
1
7570.006
Elapsed time: 0.043s
stime(print(Predict(f, x1=0, x2=0, x3=0, fun=M)))
x1 x2 x3 yhat lower upper
1 0 0 0 7500.703 7431.4 7570.006
Response variable (y):
Limits are 0.95 confidence limits
Elapsed time: 0.049s
This is fast because the limiting step is like this, inside the M
R function:
<- f$info.matrix
info mdim(info) # show dimensions of 3 submatrices
a b ab
rows 14999 3 14999
columns 2 3 3
<- sum(dim(info$ab))
np # total no. parameters = # rows and cols of info matrix np
[1] 15002
# Multiply info inverse times B
stime(infoMxop(info, B=matrix(runif(np), ncol=1)))
Elapsed time: 0.009s
Other Resources
- Video by Richad McElreath explaining numerical accuracy issues in log likelihood calculations.
- R CRAN Task Views on optimizers
- Vignettes by John Nash et al.
- R
ucminf
package - R
nloptr
package - R code:
lrm
andlrm.fit
- Fortran code:
lrmll
Computing Environment
::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
gratefulcite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))
We used R version 4.4.0 (R Core Team 2024) and the following R packages: Hmisc v. 5.2.2 (Harrell Jr 2025a), Matrix v. 1.7.1 (Bates, Maechler, and Jagan 2024), orms v. 1.0.0 (Harrell Jr 2010), rms v. 7.0.0 (Harrell Jr 2025b), survival v. 3.8.3 (Terry M. Therneau and Patricia M. Grambsch 2000; Therneau 2024).
The code was run on macOS 15.2 on a Macbook Pro M2 Max, running on a single core.