## I Introduction

### I-a Motivation and Background

One of the most fundamental of problems in statistical data analysis is to
learn the relationship between the samples of a *dependent* or
*response* variable (e.g., the malignancy of a tumor, the health of a
network) and the samples of *independent* or *predictor* variables
(e.g., the expression data of genes, the traffic data in the network). This
problem was relatively easy in the data-starved world of yesteryears. We had
samples and predictors, and our inability to observe too many
variables meant that we lived in the “ greater than or equal to ”
world. Times have changed now. The data-rich world of today has enabled us to
simultaneously observe an unprecedented number of variables per sample. It is
nearly impossible in many of these instances to collect as many, or more,
samples as the number of predictors. Imagine, for example, collecting
hundreds of thousands of thyroid tumors in a clinical setting. The “
smaller than ” world is no longer a theoretical construct in statistical
data analysis. It has finally arrived; and it is here to stay.

This paper concerns statistical inference in the “ smaller than ” setting for the case when the response variable depends linearly on the predictors. Mathematically, a model of this form can be expressed as

(1) |

Here, denotes the -th sample of the response variable,
denotes the -th sample of the -th predictor, denotes the
error in the model, and the parameters are called
*regression coefficients*

. This relationship between the samples of the response variable and those of the predictors can be expressed compactly in matrix-vector form as

. The matrix in this form, termed the*design matrix*, is an matrix whose -th column comprises the samples of the -th predictor. In tumor classification, for example, an entry in the response variable could correspond to the malignancy (expressed as a numerical number) of a tumor sample, while the corresponding row in would correspond to the expression level of genes in that tumor sample.

The linear model , despite its mathematical
simplicity, continues to make profound impacts in countless application areas
[1]. Such models are used for various
inferential purposes. In this paper, we focus on the problem of *model
selection* in high-dimensional linear models, which involves determining a
small subset of predictors that are responsible for majority (or all) of
the variation in the response variable . High-dimensional model
selection can be used to implicate a small number of genes in the development
of cancerous tumors, identify a small number of genes that primarily affect
prognosis of a disease, etc.

### I-B Group Model Selection and Our Contributions

There exist many applications in statistical model selection where the
implication of a single predictor in the response variable implies presence
of other related predictors in the true model. This happens, for instance, in
the case of microarray data when the genes (predictors) share the same
biological pathway [2]. In such situations, it is
better to reformulate the problem of model selection in a “group” setting.
Specifically, the response variable in
high-dimensional linear models in group settings can be best explained by a
small number of *groups* of predictors:

(2) |

where , an submatrix of , denotes the -th group of predictors, denotes the group of regression coefficients associated with the group of predictors , and the set denotes the underlying true (group) model, corresponding to the groups of predictors that explain .

One of the main contributions of this paper is comprehensive understanding of
a polynomial-time algorithm, which we term *Group Thresholding* (GroTh),
that returns an estimate of the true (group) model for the
general case of arbitrary (random or deterministic) design matrices and
arbitrary nonzero regression coefficients. To this end, we make use of two
computable geometric measures of *group coherence* of a design
matrix—the *worst-case group coherence* and the
*average group coherence* —to provide a *nonasymptotic*
analysis of GroTh (Algorithm 1). We in particular establish
that if satisfies a verifiable *group coherence property* then, for
all but a vanishingly small fraction of possible models , GroTh: ()
handles linear scaling of the total number of predictors contributing to the
response, ,^{1}^{1}1Recall that
if there exists positive and such that for all , . Also, if , and
if and . and () returns indices of the
groups of predictors whose contributions to the response,
{, are above a certain *self-noise
floor* that is a function of both and .

### I-C Relationship to Previous Work

The basic idea of using grouped predictors for inference in linear models has been explored by various researchers in recent years. Some notable works in this direction in the “” setting include [3, 4, 5, 6, 7, 8, 9]. Despite these inspiring results, more work needs to be done for high-dimensional group model selection. This is because the results reported in [3, 4, 5, 6, 7, 8, 9] do not guarantee linear scaling of the total number of predictors contributing to the response for the case of arbitrary design matrices and nonzero regression coefficients.

The work in this paper is also related to another body of work in statistics and signal processing literature that studies the high-dimensional linear model for the restrictive case of having a Kronecker structure: for some matrix , where denotes the Kronecker product. An incomplete list of works in this direction includes [10, 11, 12, 13, 14, 15, 16]. These restrictive works, however, also fail to guarantee linear scaling of the total number of predictors contributing to the response for the case of arbitrary nonzero regression coefficients.

Finally, note that the group model selection procedure studied in this paper
is based on analyzing the *marginal correlations*, , of
predictors with the response variable. Therefore, our work is algorithmically
similar to the group thresholding approaches of
[13, 14, 8].
The main appeal of such approaches is their low computational complexity of
, which is much smaller than the typical computational complexity
associated with other model selection procedures [17]. In
addition to the scaling limitations of the total number of influential
predictors discussed earlier, however, the works in
[13, 14, 8]
also incorrectly conclude that performance of thresholding-based approaches
is inversely proportional to the dynamic range, , of the nonzero groups of
regression coefficients.

### I-D Mathematical Convention

The predictors and the response variable are assumed to be real valued
throughout the paper, with the understanding that extensions to a
complex-valued setting can be carried out in a straightforward manner.
Uppercase letters are reserved for matrices, while lowercase letters are used
for both vectors and scalars. Constants that do not depend upon the problem
parameters (such as , , , and ) are denoted by , , etc.
The notation for is a shorthand for the set
, while the notation signifies *equality
in distribution*. The transpose operation is denoted by and the
spectral norm of a matrix is denoted by . Finally, the
norm of a vector with each is defined as for ,
where denotes the usual norm. Note that
and
for .

### I-E Organization

In Section II, we mathematically formulate the problem of group model selection, rigorously define the notions of worst-case group coherence, average group coherence and the group coherence property, and state and discuss the main result of the paper. In Section III, we prove the main result of the paper. Finally, we present some numerical results in Section IV and conclude in Section V.

## Ii Group Model Selection Using GroTh

### Ii-a Problem Formulation

The object of attention in this paper is the high-dimensional linear model relating the response variable to the predictors comprising the columns of the design matrix . Since scalings of the columns of can be absorbed into the regression vector , we assume without loss of generality that the columns of have unit norms. There are three simplifying assumptions we make in this paper that will be relaxed in a sequel to this work. First, the modeling error is zero, , and thus the response variable is exactly equal to a parsimonious linear combination of grouped predictors: . Second, the groups of predictors are characterized by the same number of predictors per group: with . Third, the groups of predictors are orthonormalized: .

The main goal of this paper is characterization of the performance of a group
model selection procedure, termed GroTh, that returns an estimate of
the true model by sorting the groups of marginal correlations according to their -norms, , in descending
order and setting to be indices of the first sorted groups of
marginal correlations (see Algorithm 1). Instead of focusing on
the worst-case performance of GroTh, however, we seek to characterize its
performance for an arbitrary (but fixed) set of nonzero (grouped) regression
coefficients supported on *most* models. Specifically, we do not impose
any statistical prior on the set of nonzero regression coefficients, while we
assume that the true (group) model is a uniformly random -subset of . Finally, the metrics of
goodness we use in this paper are the *false-discovery proportion*
(fdp) and the *non-discovery proportion* (ndp), defined as

(3) |

respectively. These two metrics have gained widespread usage in multiple
hypotheses testing problems in recent years. In particular, the expectation
of the fdp is the well-known *false-discovery rate* (fdr)
[18, 19].

### Ii-B Main Result and Discussion

Heuristically, successful group model selection requires the groups of predictors contributing to the response variable to be sufficiently distinguishable from the ones outside the true model. In this paper, we capture the notion of distinguishability of predictors through two easily computable, global geometric measures of the design matrix, namely, the worst-case group coherence and the average group coherence. The worst-case group coherence of is defined as

(4) |

while the average group coherence of is defined as

(5) |

Note that is a trivial upper bound on . It is also worth pointing that the worst-case group coherence and its variants have existed in earlier literature [7, 20], but the average group coherence is defined for the first time in here.

The central thesis of this paper is that group model selection using GroTh
can be successful if these two measures of group coherence of are small
enough. In particular, we address the question of how small should these two
measures be in terms of the *group coherence property*.

###### Definition 1 (The Group Coherence Property).

The design matrix is said to satisfy the group coherence property if the following two conditions hold for some positive constants and :

(GroCP-1) | ||||

(GroCP-2) |

It is straightforward to observe from the above definition that the group coherence property is a global property of that can be explicitly verified in polynomial time. Finally, we define to be the -th largest group of nonzero regression coefficients: . We are now ready to state the main result of this paper.

###### Theorem 1 (Group Model Selection Using GroTh).

Suppose the design matrix satisfies the group coherence property with parameters and . Next, fix parameters , , and define parameter . Then, under the assumptions , , and

, we have with probability exceeding

that(6) |

resulting in and , where is defined to be the largest integer for which the inequality

holds. Here, the probability is with respect to the uniform distribution of the true model

over all possible models.A proof of this theorem is given in Section III. We now provide a brief discussion of the significance of this result. First, Theorem 1 indicates that a polynomial-time verifiable property, namely, the group coherence property, of the design matrix can be checked to ascertain whether GroTh, which has computational complexity of , is well suited for group model selection. Second, it states that if satisfies the group coherence property then GroTh handles linear scaling of the total number of predictors contributing to the response, , for all but a vanishingly small fraction of models. This is in stark contrast to the earlier works [13, 14, 8] on thresholding-based approaches in high-dimensional linear models, which do not guarantee such linear scaling for the case of arbitrary nonzero regression coefficients. Note that while we do not provide in this paper explicit examples of design matrices satisfying the group coherence property, numerical results in Section IV show that the set of design matrices satisfying the group coherence property is not empty.

Finally, Theorem 1 offers a nice interpretation of the price
one might have to pay in estimating the true model using only marginal
correlations. Specifically, (6) in the theorem implies
group thresholding of marginal correlations effectively gives rise to a
*self-noise floor* of . In words, the estimate returned by GroTh is
guaranteed to return the indices of all the groups of predictors whose
contributions to the response variable (in the sense) are above the
self-noise floor of
(cf. 6). This is again a significant improvement over
the earlier works
[13, 14, 8],
which suggest that performance of thresholding-based approaches is inversely
proportional to the dynamic range, , of the nonzero groups of
regression coefficients. In order to expand on this, we observe from
(6) that

(7) |

Theorem 1 and the left-hand side of (7)
indicate that inclusion of the -th group of predictors in the estimate
is in fact related to the ratio of the *energy contributed by
the -th group of predictors* to the *average energy contributed per
group of nonzero predictors*: .
Further, this implies that an increase in the dynamic range that comes from a
decrease in cannot affect the performance
of GroTh too much since
increases for most groups of predictors in this case. This is indeed
confirmed by the numerical experiments reported in Section IV.

## Iii Proof of the Main Result

We begin by developing some notation to facilitate the forthcoming analysis. Notice that the -dimensional vector of marginal correlations, , can be written as groups of -dimensional marginal correlations: with the vector . In the following, we use (an submatrix of ), (an subvector of ), and (an subvector of ) to denote the groups of predictors, groups of regression coefficients, and the marginal correlations corresponding to the true model , respectively. Similarly, we use and to denote the groups of predictors and the marginal correlations corresponding to the complement set , respectively.

### Iii-a Lemmata

Proof of Theorem 1 requires understanding the behaviors of the group vector and the group vector . In this subsection, we state and prove two lemmas that help us toward this goal. We will then leverage these two lemmas to provide a proof of Theorem 1.

Before proceeding, recall that is taken to be a uniformly random -subset of , while the set of nonzero group regression coefficients is considered to be deterministic (and fixed) but unknown. It therefore follows that the -dimensional group vector can be equivalently expressed as

(8) |

where is a random permutation of , denotes the first elements of , is an submatrix of , and is an (group) vector of nonzero regression coefficients. Similarly, the -dimensional group vector can be expressed as

(9) |

where denotes the last elements of and is an submatrix of .

###### Lemma 1.

Fix and . Next, assume and let denote the first elements of a random permutation of . Then for any fixed group vector

(10) |

where is an absolute constant.

###### Proof.

The proof of this lemma relies heavily on Banach-space-valued Azuma’s inequality stated in the Appendix. To begin, note that

(11) |

We next fix an and define the event for . Then conditioned on , we have

(12) |

In order to make use of the concentration inequality in Proposition 1 in the Appendix for upper bounding (12), we construct an -valued Doob martingale on . We first define and then define the Doob martingale as follows:

where denotes the first elements of . The next step involves showing that the constructed martingale has bounded differences. In order for this, we use to denote the -th element of and define

(13) |

for and . It can then be established using
techniques very similar to the ones used in the *method of bounded
differences* for scalar-valued martingales that
[21, 22]

(14) |

In order to upper bound , we first define an random matrix

(15) |

Next, we notice that for every

, the random variable

conditioned on has a uniform distribution over , while conditioned on has a uniform distribution over . Therefore, we get(16) |

In order to evaluate for , we consider three cases for the index . In the first case of , it can be seen that for every and for . In the second case of , it can similarly be seen that for every and , while for . In the final case of , it can be argued that for every , for , and for . Consequently, regardless of the initial choice of , we have

(17) |

where is due to the triangle inequality and the submultiplicative nature of the induced norm, while primarily follows since . We now have from (14) and (17) that with

(18) |

The next step needed to upper bound (12) involves providing an upper bound on . To this end, note that

(19) |

where follows since conditioned on has a uniform distribution over and is a consequence of the definition of average group coherence. Finally, we note from [23, Lemma B.1] that defined in Proposition 1 satisfies for . Consequently, under the assumption that , it can be seen from our construction of the Doob martingale that

(20) |

where follows from Banach-space-valued Azuma’s inequality stated in the Appendix. Further, it can be established using (18) through tedious algebraic manipulations that

(21) |

where follows from the condition . Combining all these facts together, we finally obtain from (20) and (21) the following concentration inequality:

(22) |

where , follows from the union bound and the fact that ’s are identically distributed, while follows since has a uniform distribution over the set . ∎

###### Lemma 2.

Fix and . Next, assume , and let and denote the first elements and the last elements of a random permutation of , respectively. Then for any fixed group vector

(23) |

where is an absolute constant.

Comments

There are no comments yet.