Chapter 1 Introduction
Inferential statistics consists of two primary components: estimation and prediction. Estimation refers to the process by which we infer values, properties and behavior about individual parameters in our model. Prediction, on the other hand, refers to the process by which we draw inference about an event that has not-yet occurred. The latter component is abundant in industry where the demand to predict stock prices, movies individuals prefer, and even to predict new friend connections in a network has sparked huge investments in research and given rise to research groups whose primary goal is pushing the bounds and developing new predictive models.
However, though most attention is directed towards prediction, these two components are not distinct. In fact, the success and reliability of many predictive models is contingent upon good, efficient parameter estimation. Many of these models require the estimation of a precision matrix - the inverse of the covariance matrix (frequently denoted as \(\Omega\)) - that establishes the interaction and covariance between random variables. For this reason, the last decade has seen an ever-expanding community devoted to precision matrix estimation.
Among this community of researchers is Professor Adam Rothman and Aaron Molstad, Ph.D. whose research will be a focal point of this manuscript. The two have published work on indirect multivariate linear regression (Molstad and Rothman (2016)) and classification with matrix-valued predictors (Molstad and Rothman (2018)) but the focus of this manuscript is their 2017 paper titled Shrinking Characteristics of Precision Matrix Estimators (Molstad and Rothman (2017)). In it, they outline a framework to shrink a characteristic of a precision matrix - a concept that exploits the fact that in many predictive models estimation of the precision matrix is only necessary through its product with another feature. They write in their manuscript that “to fit many predictive models, only a characteristic of the population precision matrix needs to be estimated… In binary linear discriminant analysis, the population precision matrix is needed for prediction only through the product of the precision matrix and the difference between the two conditional distribution mean vectors.” The purpose of the research detailed here began with the desire to expand on this concept and to explore avenues that were mentioned but were not further investigated.
One of the research directions mentioned in the original paper was the application of their framework to regression. Utilizing the fact that the population regression coefficient matrix \(\beta \equiv \Omega_{x}\Sigma_{xy}\) (where \(\Sigma_{xy}\) is the cross-covariance matrix between the predictors, \(X\), and the responses, \(Y\), and \(\Omega_{x}\) is the precision matrix for \(X\)), their framework allows for the simultaneous estimation of \(\beta\) and \(\Omega_{x}\) with an embedded assumption potentially useful for superior prediction performance. In close communication and collaboration with Professor Rothman, we wanted to explore this research direction further. However, in order to build upon their work and contribute new material, there were a number of concepts that needed to be learned along the way and this document will follow that journey.
We will begin chapter two with a brief introduction to precision matrix estimation and the gaussian log-likelihood function. This section will mention popular estimation methods and algorithms but most discussion will be directed towards the ADMM algorithm. Discussion of the ADMM algorithm will be useful as we begin detailing the shrinking characteristics of precision matrix estimators framework (which may be referred to as SCPME), the so-called augmented ADMM algorithm, and later the framework’s application to regression. Lastly, the document will end with two brief tutorials for the R packages ADMMsigma
and SCPME
. These packages were developed by myself to aid in simulation experiments and make it easier to branch into related research directions. Both packages have since been published on CRAN.
1.0.1 Notation and Definitions
For strictly positive integers \(n\) and \(p\), we will denote \(\mathbb{R}^{n \times p}\) as the class of real matrices with dimenson \(n \times p\). The class of real, symmetric matrices with dimension \(p \times p\) will be denoted as \(\mathbb{S}^{p}\) and \(\mathbb{S}^{p}_{+}\) if we further require the object to be positive definite. The sample size and dimension of the predictor vector in a given data set will most often be denoted as \(n\) and \(p\), respectively. If the dimension of the response vector exceeds one, we will denote it as \(r\).
Most matrices will take the form of either \(\Sigma\), the population covariance matrix, or \(\Omega\), the population precision matrix. Note that the precision matrix is simply the inverse of the covariance matrix (\(\Omega \equiv \Sigma^{-1}\)) and a subscript may be added to each if more than two random variables are considered in a problem (\(\Omega_{x}\)). A subscript star may also be added if the object is oracle - or known - a priori (\(\Omega_{*}\)). The oracle’s estimator that optimizes a pre-specified objective function will be denoted with a hat (\(\hat{\Omega}\)).
There will be significant matrix algebra notation throughout the manuscript. The trace operator sums the diagonal elements of a matrix and will take the form \(tr\left(\cdot\right)\) and the exponential trace operator will be denoted similarly as \(etr\left(\cdot\right)\). The vector operator, \(vec\left(\cdot\right)\), stacks the columns of a matrix into a column vector. The determinant of a matrix \(\mathbf{A}\) will be denoted as \(\left|\mathbf{A}\right|\) but may also take the form \(det\left(\mathbf{A}\right)\). The kronecker product of two matrices \(\mathbf{A}\) and \(\mathbf{B}\) will be denoted as \(\mathbf{A} \otimes \mathbf{B}\) and the element-wise product will be denoted as \(\mathbf{A} \circ \mathbf{B}\). Lastly, the Frobenius norm which sums the square of all entries in a matrix will be denoted as \(\left\|\mathbf{A}\right\|_{F}\) and we will define \(\left\|\mathbf{A}\right\|_{1} := \sum_{i, j}\left|\mathbf{A}_{ij}\right|\) where the \(i\)-\(j\)th element in matrix \(\mathbf{A}\) is typically denoted as \(\left(\mathbf{A}\right)_{ij}\) or simply \(\mathbf{A}_{ij}\).
References
Molstad, Aaron J, and Adam J Rothman. 2016. “Indirect Multivariate Response Linear Regression.” Biometrika 103 (3). Oxford University Press: 595–607.
Molstad, Aaron J, and Adam J Rothman. 2017. “Shrinking Characteristics of Precision Matrix Estimators.” Biometrika.
Molstad, Aaron J, and Adam J Rothman. 2018. “A Penalized Likelihood Method for Classification with Matrix-Valued Predictors.” Journal of Computational and Graphical Statistics. Taylor & Francis, 1–12.