Matrix Algebra
Some notes for matrix algebra.
Basics of matrix

Trace of matrix: square matrix

Null matrix

$Rank(A)=dim(ran(A))$, dimension of range space of the matrix

Hadamard products: $A \cdot B = $

Products with diag(): premultiplication weights the rows and post multiplication weights the column

Present a graph using matrix

Special matrix

Projection of a vector: least square is actually a projection. we have the form $\hat{y}=X(X^T X)^{1}X^T y$, consider a space $S$ which is spanned by columns of $X$, least square project the vector $y$ onto the space $S$ and the projection in $S$ space is $\hat{y}$

Span of a set of vectors $(v_1,v_2,…,v_n)$ is all the linear combinations of vectors within the set

Range space of a matrix = column space of a matrix, if $A=[a_1,a_2,…,a_n]$, then colsp $(A)$ = span $(a_1,a_2,…,a_n)$

Given a matrix $A$, the action of $Ax$ will return a linear combination of columns of $A$ weighted by the coordinates of $x$, thus $y=Ax$ must be in the column space of $A$.

The columns of a matrix spans the column space, but they are not necessarily linear independent, some subset of them can form a basis (as for basis we mean linear independent). The dimension of the column space is the Rank of the matrix.

Rank can also be interpreted as the number of pivots in the reduced row echelon form, which is the maximum number of linearly independent columns that can be chosen.

Work with cells

Matrix operation: center, scale, reorder

Center: for matrix $A$, matrix $\bar{J}$ can derive the colMeans and rowMeans of $A$. colMeans by premultiply and rowMeans by postmultiply. Thus we can easily get the centering matrix for any arbitrary matrix, which is $C=I\bar{J}_n=I\frac{1}{n}J$

Scale: if we want to scale a matrix, it can be achieved by multiplying by a diagonal matrix $D$, scale columns by postmultiply, rows by premultiply. Thus for a given matrix $A$ if we want to center and scale it, we can first get the diagonal matrix $D$ with diagonals being the standard deviation of each column and then multiple with $CA$ which is $A^*=CAD$ where $C=I\frac{1}{n}J$, $D=diag(\frac{1}{sd_1}, \frac{1}{sd_2},…\frac{1}{sd_p})$. $A^* \in n \times p$.

Reorder: premultiple row exchange; postmultiple column exchange.

Idempotent: $H=X(X’X)^{1}X’$ is idempotent, the centering matrix $C=I\frac{1}{n}J$ is also idempotent, $(IH)y$ which is the residual is orthogonal to $X$.

Quadratic form: $x’Ax$ to be simple, quadratic form is the weighted sum of products. usually we denote $A$ as a symmetric matrix. it is very useful in statistics for instance, the distribution of quadratic from with x from normal distribution follows $\chi ^2$ distribution; $X’ \Sigma X$ is the weighted sum of square etc.

Positive definite: we can easily verify that even if $A$ is semip.d., $(A+I)$ will be p.d. this property is very important for making a matrix invertible. In a linear regression model, if the design matrix $X$ has dimension with $p\times n$, $X’X$ is singular and not invertible. thus we will use the form $(A+\lambda I)$ to make it invertible. In shrinkage method like ridge regression or penalized polynomial we use this form a lot too. Other ways of solving this problem is to use the generalized inverse.

Orthogonal matrix: orthogonal in columns $A’A=I$, orthogonal in rows $AA’=I$

Why row/column operations do not change the determinant of a matrix? row/column operations can be achieved by premultiple (row operation) or postmultiple (column operation) a elementary operator matrix $E$. the determinant of $E$ is always equal to 1. So the determinant will not change by row/column operation.

Row/column exchange (odd times) will change the sign of the determinant.

The inverse of matrix comes from solving system of linear equations. For instance ordinary least square is obtained by $\hat{\beta}_{ols}=argmin_{\beta} (yx\beta)’(yx\beta) $, take derivative with respect to each $\beta_i$ gives the equation $(X’X)\beta=X’y$

Property of projection matrix: given arbitrary vector $a$, a projection matrix $P$ can project $a$ to the space spanned by the column/row vectors of matrix $P$. Thus $PPa = Pa$ should holds for a projection matrix. [An idempotent matrix can be projection matrix]

Fullrank factorization: find the LIN (linear independent) columns of a matrix $A$ and represent that matrix by a full rank matrix $R$ multiple a transform matrix $T$ [ie: $A=RT$] which means that matrix $A$ is the linear combinations of the full rank matrix $R$.
 Elementary operator matrix:

i) $E$, interchange rows/colums;

ii)$R$, diagonal matrix, scale rows/columns;

iii)$P$, add scale of a row/column to another row/column.

 Canonical Form:

i) the initial concept of canonical form can be derived from finding the rank of a matrix $A$. First we do row operations to make $A$ a upper triangular matrix. this can be done by premultiply $A$ by a set of elementary operator matrices $E_1E_2E_3…$. Then we do column operation in which we postmultiply $A$ by a set of elementary operator matrices $E_1E_2E_3…$ getting a diagonal matrix $D$.The row/column number of that diagonal matrix is the rank of $A$ which is $r(A)$. It can be written as $D=PAQ$ Further if we multiple on both side by scaling matrix after that we can get a identity matrix $C$ which is denoted as canonical form under equivalence. This can be written as $C=R_1AR_2$ where $R_1$ and $R_2$ are multiplication of a set of elementary operator matrices.

ii) From $D=PAQ$ and the invertibility of $P\&Q$, we can also derive the form $A=P^{1}DQ^{1}$ which is the foundation for matrix decomposition.

iii) also if we get identity matrix at the end, $A=KL$ will yields the full rank factorization of a matrix.

Matrix decomposition
 Singular value decomposition
$G=UDV’$ where $U$, $V$ are orthonormal matrix;
$G^{1}=(UDV’)^{1}=VD^{1}U’$
The inverse of a diagonal matrix is easy to calculate.
 Eigen value decomposition
$G=Q \Lambda Q’=Q \Lambda Q^{1}$
$G^{1}=(Q \Lambda Q’)^{1}=Q \Lambda^{1} Q’$
The diagonal elements of $\Lambda$ is the eigen values of $G$
 Cholesky decomposition
$G=LL’$ where $L$ is a triangular matrix
$G^{1}=(LL’)^{1}=(L’)^{1}L^{1}$
It is also easy to calculate the inverse of a triangular system.

Scaling does not change the rank of a matrix, but centering will change the rank of a matrix. To think intuitively, we subtract a mean matrix from the original matrix which will decrease the degree of freedom of the original matrix.Thus the rank of centered matrix $G_{(c)}$ will be $n1$. but after centering the matrix becomes singular, how can we calculate the inverse? we solve it by adding a small scale to the diagonal elements of the $G_{(c)}$, which is $(G+\lambda I)$. Then the matrix have an inverse.

Inverse: $AA^{1}=A^{1}A=I$.

Generalized Inverse: $AA^{}\not=I, A^{}A\not=I$, $AGA=A$.

linear equations: for linear equation $Ax=y$, the existence of solution of the linear equation is defined by the linear relationship between rows of $A$ should be the same as between the rows of $y$, i.e., if we add one $y$ as a column to $A$, the rank will not change.

All possible solutions can be written as $\tilde{x}=Gy+(GAI)z$ where G is the generalized inverse of $A$.

For least square, generalized inverse generate different solutions for $\beta$, the estimate is not stable, however there is invariant property of those $\beta$ because some linear combination of $\beta$ is always a constant, ie $a’\beta=C$, that is called estimable function in which case even though solution of $\beta$ is not unique, the function is still estimable.

one case is the fitted value for $y$. we know that $\beta$ is not unique since $\hat{\beta}=(X’X)^{}X’y$ and $(X’X)^{}$ is not unique, however, the fitted value won’t change for all $\beta$ as $\hat{y}=X(X’X)^{}X’y$ and the hat matrix $H=X(X’X)^{}X’$ is always equals to $X_1(X_1’X_1)^{1}X_1’$ where $X_1$ is the submatrix of $X$ with full column rank, i.e., $X=[X_1,X_2]$. Thus even if $\beta$ is not unique, the fitted value is still estimable.

$\hat{y}=Hy=X(X’X)^{}X’y$ is actually projection $y$ on vector spaces of $X$, where $H$ is the projection matrix.

$\hat{y_i}=\sum_{j=1}^{n}{H[i,j]y[j]}=H[i,i]y[i]+\sum_{k \neq i}{H[i,k]y[k]}$ means the fitted value of $y_i$ is the weighted sum of all the observations, weighted on itself plus information from other observations.
Partition Matrix
PM include properties of partition matrix (det & inverse) and some special operations.

Given a partition matrix $P$, We study its properties by decomposing it into two triangular matrices $P=LU$. Since the determinant and inverse of a lower(upper) triangular matrix is easy to calculate, it will help us find the det/inverse of the partition matrix

$E(X_1  X_2) = \mu_1 + \Sigma_{12} \Sigma_{22}^{1} (X_2\mu_2)$

The expectation of $X_1$ given $X_2$ can be considered as regression between $\mu_1$ and $\mu_2$. It is the mean of $X_1$ plus a deviation of $X_1$ apart from $\mu_1$ scaled by the variance and covariance between $X_1$ and $X_2$.

Schur complements: $ABD^{1}C$ schur complement to $D$ or $DCA^{1}B$ schur complement to $A$ for a matrix partitioned as components $A,B//C,D$

$Var(X_1  X_2) = \Sigma_{11}  \Sigma_{12} \Sigma_{22}^{1}\Sigma_{21}=\Sigma_{11}(I\frac{\Sigma_{12} \Sigma_{21}}{\Sigma_{11} \Sigma_{22}})=\Sigma_{11}(1r^2)$

The variance of $X_1$ conditional on $X_2$ can be seen as the variance explained by $X_1$ minus the variance explained by the regression of $X_1$ conditional on $X_2$.

Linear space: we can project higher dimensional space to lower dimensional space but we cannot go from lower dimensional space to higher dimensional space.
Eigen Value and Eigen vectors

The difference between EVD and SVD is that EVD only works for square matrix while SVD works for arbitrary matrix.

SVD: $X=UDV’$. $U$ is the orthonormal basis for row spaces and $V’$ span the column space of $X$. Think of $XX’$ and $X’X$. $XX’=UD^{2}U$ is correlation matrix among individuals (rows). $X’X=VD^{2}V’$ is correlation matrix among predictors (columns). It means that in order to describe rows we only need info about $U$ while to describe columns we need $V$.

Singular values are the square root of the eigen values. If symmetric, $U$ and $V$ will be the same since the vector space for rows and columns are the same.
Differential Operators

Dimension will increase when we take derivative with respect to vector variables: the derivative of a scalar is a vector and the derivative of a vector will be a matrix.

.