Some notes for matrix algebra.

Basics of matrix

  • Trace of matrix: square matrix

  • Null matrix

  • $Rank(A)=dim(ran(A))$, dimension of range space of the matrix

  • Hadamard products: $A \cdot B = $

  • Products with diag(): premultiplication weights the rows and post multiplication weights the column

  • Present a graph using matrix

  • Special matrix

  • Projection of a vector: least square is actually a projection. we have the form $\hat{y}=X(X^T X)^{-1}X^T y$, consider a space $S$ which is spanned by columns of $X$, least square project the vector $y$ onto the space $S$ and the projection in $S$ space is $\hat{y}$

  • Span of a set of vectors $(v_1,v_2,…,v_n)$ is all the linear combinations of vectors within the set

  • Range space of a matrix = column space of a matrix, if $A=[a_1,a_2,…,a_n]$, then colsp $(A)$ = span $(a_1,a_2,…,a_n)$

  • Given a matrix $A$, the action of $Ax$ will return a linear combination of columns of $A$ weighted by the coordinates of $x$, thus $y=Ax$ must be in the column space of $A$.

  • The columns of a matrix spans the column space, but they are not necessarily linear independent, some subset of them can form a basis (as for basis we mean linear independent). The dimension of the column space is the Rank of the matrix.

  • Rank can also be interpreted as the number of pivots in the reduced row echelon form, which is the maximum number of linearly independent columns that can be chosen.

  • Work with cells

  • Matrix operation: center, scale, reorder

  • Center: for matrix $A$, matrix $\bar{J}$ can derive the colMeans and rowMeans of $A$. colMeans by pre-multiply and rowMeans by post-multiply. Thus we can easily get the centering matrix for any arbitrary matrix, which is $C=I-\bar{J}_n=I-\frac{1}{n}J$

  • Scale: if we want to scale a matrix, it can be achieved by multiplying by a diagonal matrix $D$, scale columns by post-multiply, rows by pre-multiply. Thus for a given matrix $A$ if we want to center and scale it, we can first get the diagonal matrix $D$ with diagonals being the standard deviation of each column and then multiple with $CA$ which is $A^*=CAD$ where $C=I-\frac{1}{n}J$, $D=diag(\frac{1}{sd_1}, \frac{1}{sd_2},…\frac{1}{sd_p})$. $A^* \in n \times p$.

  • Reorder: pre-multiple row exchange; post-multiple column exchange.

  • Idempotent: $H=X(X’X)^{-1}X’$ is idempotent, the centering matrix $C=I-\frac{1}{n}J$ is also idempotent, $(I-H)y$ which is the residual is orthogonal to $X$.

  • Quadratic form: $x’Ax$ to be simple, quadratic form is the weighted sum of products. usually we denote $A$ as a symmetric matrix. it is very useful in statistics for instance, the distribution of quadratic from with x from normal distribution follows $\chi ^2$ distribution; $X’ \Sigma X$ is the weighted sum of square etc.

  • Positive definite: we can easily verify that even if $A$ is semi-p.d., $(A+I)$ will be p.d. this property is very important for making a matrix invertible. In a linear regression model, if the design matrix $X$ has dimension with $p\times n$, $X’X$ is singular and not invertible. thus we will use the form $(A+\lambda I)$ to make it invertible. In shrinkage method like ridge regression or penalized polynomial we use this form a lot too. Other ways of solving this problem is to use the generalized inverse.

  • Orthogonal matrix: orthogonal in columns $A’A=I$, orthogonal in rows $AA’=I$

  • Why row/column operations do not change the determinant of a matrix? row/column operations can be achieved by pre-multiple (row operation) or post-multiple (column operation) a elementary operator matrix $E$. the determinant of $E$ is always equal to 1. So the determinant will not change by row/column operation.

  • Row/column exchange (odd times) will change the sign of the determinant.

  • The inverse of matrix comes from solving system of linear equations. For instance ordinary least square is obtained by $\hat{\beta}_{ols}=argmin_{\beta} (y-x\beta)’(y-x\beta) $, take derivative with respect to each $\beta_i$ gives the equation $(X’X)\beta=X’y$

  • Property of projection matrix: given arbitrary vector $a$, a projection matrix $P$ can project $a$ to the space spanned by the column/row vectors of matrix $P$. Thus $PPa = Pa$ should holds for a projection matrix. [An idempotent matrix can be projection matrix]

  • Full-rank factorization: find the LIN (linear independent) columns of a matrix $A$ and represent that matrix by a full rank matrix $R$ multiple a transform matrix $T$ [ie: $A=RT$] which means that matrix $A$ is the linear combinations of the full rank matrix $R$.

  • Elementary operator matrix:
    • i) $E$, interchange rows/colums;

    • ii)$R$, diagonal matrix, scale rows/columns;

    • iii)$P$, add scale of a row/column to another row/column.

  • Canonical Form:
    • i) the initial concept of canonical form can be derived from finding the rank of a matrix $A$. First we do row operations to make $A$ a upper triangular matrix. this can be done by pre-multiply $A$ by a set of elementary operator matrices $E_1E_2E_3…$. Then we do column operation in which we post-multiply $A$ by a set of elementary operator matrices $E_1E_2E_3…$ getting a diagonal matrix $D$.The row/column number of that diagonal matrix is the rank of $A$ which is $r(A)$. It can be written as $D=PAQ$ Further if we multiple on both side by scaling matrix after that we can get a identity matrix $C$ which is denoted as canonical form under equivalence. This can be written as $C=R_1AR_2$ where $R_1$ and $R_2$ are multiplication of a set of elementary operator matrices.

    • ii) From $D=PAQ$ and the invertibility of $P\&Q$, we can also derive the form $A=P^{-1}DQ^{-1}$ which is the foundation for matrix decomposition.

    • iii) also if we get identity matrix at the end, $A=KL$ will yields the full rank factorization of a matrix.

Matrix decomposition

  • Singular value decomposition

$G=UDV’$ where $U$, $V$ are orthonormal matrix;


The inverse of a diagonal matrix is easy to calculate.

  • Eigen value decomposition

$G=Q \Lambda Q’=Q \Lambda Q^{-1}$

$G^{-1}=(Q \Lambda Q’)^{-1}=Q \Lambda^{-1} Q’$

The diagonal elements of $\Lambda$ is the eigen values of $G$

  • Cholesky decomposition

$G=LL’$ where $L$ is a triangular matrix


It is also easy to calculate the inverse of a triangular system.

  • Scaling does not change the rank of a matrix, but centering will change the rank of a matrix. To think intuitively, we subtract a mean matrix from the original matrix which will decrease the degree of freedom of the original matrix.Thus the rank of centered matrix $G_{(c)}$ will be $n-1$. but after centering the matrix becomes singular, how can we calculate the inverse? we solve it by adding a small scale to the diagonal elements of the $G_{(c)}$, which is $(G+\lambda I)$. Then the matrix have an inverse.

  • Inverse: $AA^{-1}=A^{-1}A=I$.

  • Generalized Inverse: $AA^{-}\not=I, A^{-}A\not=I$, $AGA=A$.

  • linear equations: for linear equation $Ax=y$, the existence of solution of the linear equation is defined by the linear relationship between rows of $A$ should be the same as between the rows of $y$, i.e., if we add one $y$ as a column to $A$, the rank will not change.

  • All possible solutions can be written as $\tilde{x}=Gy+(GA-I)z$ where G is the generalized inverse of $A$.

  • For least square, generalized inverse generate different solutions for $\beta$, the estimate is not stable, however there is invariant property of those $\beta$ because some linear combination of $\beta$ is always a constant, ie $a’\beta=C$, that is called estimable function in which case even though solution of $\beta$ is not unique, the function is still estimable.

  • one case is the fitted value for $y$. we know that $\beta$ is not unique since $\hat{\beta}=(X’X)^{-}X’y$ and $(X’X)^{-}$ is not unique, however, the fitted value won’t change for all $\beta$ as $\hat{y}=X(X’X)^{-}X’y$ and the hat matrix $H=X(X’X)^{-}X’$ is always equals to $X_1(X_1’X_1)^{-1}X_1’$ where $X_1$ is the submatrix of $X$ with full column rank, i.e., $X=[X_1,X_2]$. Thus even if $\beta$ is not unique, the fitted value is still estimable.

  • $\hat{y}=Hy=X(X’X)^{-}X’y$ is actually projection $y$ on vector spaces of $X$, where $H$ is the projection matrix.

  • $\hat{y_i}=\sum_{j=1}^{n}{H[i,j]y[j]}=H[i,i]y[i]+\sum_{k \neq i}{H[i,k]y[k]}$ means the fitted value of $y_i$ is the weighted sum of all the observations, weighted on itself plus information from other observations.

Partition Matrix

PM include properties of partition matrix (det & inverse) and some special operations.

  • Given a partition matrix $P$, We study its properties by decomposing it into two triangular matrices $P=LU$. Since the determinant and inverse of a lower(upper) triangular matrix is easy to calculate, it will help us find the det/inverse of the partition matrix

  • $E(X_1 | X_2) = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (X_2-\mu_2)$

  • The expectation of $X_1$ given $X_2$ can be considered as regression between $\mu_1$ and $\mu_2$. It is the mean of $X_1$ plus a deviation of $X_1$ apart from $\mu_1$ scaled by the variance and covariance between $X_1$ and $X_2$.

  • Schur complements: $A-BD^{-1}C$ schur complement to $D$ or $D-CA^{-1}B$ schur complement to $A$ for a matrix partitioned as components $A,B//C,D$

  • $Var(X_1 | X_2) = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1}\Sigma_{21}=\Sigma_{11}(I-\frac{\Sigma_{12} \Sigma_{21}}{\Sigma_{11} \Sigma_{22}})=\Sigma_{11}(1-r^2)$

  • The variance of $X_1$ conditional on $X_2$ can be seen as the variance explained by $X_1$ minus the variance explained by the regression of $X_1$ conditional on $X_2$.

  • Linear space: we can project higher dimensional space to lower dimensional space but we cannot go from lower dimensional space to higher dimensional space.

Eigen Value and Eigen vectors

  • The difference between EVD and SVD is that EVD only works for square matrix while SVD works for arbitrary matrix.

  • SVD: $X=UDV’$. $U$ is the orthonormal basis for row spaces and $V’$ span the column space of $X$. Think of $XX’$ and $X’X$. $XX’=UD^{2}U$ is correlation matrix among individuals (rows). $X’X=VD^{2}V’$ is correlation matrix among predictors (columns). It means that in order to describe rows we only need info about $U$ while to describe columns we need $V$.

  • Singular values are the square root of the eigen values. If symmetric, $U$ and $V$ will be the same since the vector space for rows and columns are the same.

Differential Operators

  • Dimension will increase when we take derivative with respect to vector variables: the derivative of a scalar is a vector and the derivative of a vector will be a matrix.

  • .