Author: Marta Cousido-Rocha
Centro Oceanográfico de Vigo, SiDOR Research Group and CINBIO, Vigo, Spain
Abstract: Nowadays a recurring theme within Statistics is dealing with low sample size and large dimensional data whose main property is that the data dimension, i.e., the number of variables or features, is large while the sample size is relatively small. This type of data arises in many different areas of
science, such as genetics, medicine, pharmacy and social sciences. In microarray data, for example,
the variables typically represent the expression levels of a large set of genes for a small number of
individuals. The high data dimension has forced statisticians to renovate or rewrite existing methods, or even to propose new ones, since classical methods may be inappropriate in such a setting. In our research
we have developed and investigated new statistical procedures to solve classical k-sample problems in the context of high-dimensional data. In many applications the large number of variables are not independent; this is why we focus, more precisely, on solving classical k-sample problems in the context of dependent high-dimensional data. Our first proposal is a formal test addressing the classical k-sample problem but in the dependent high-dimensional framework (see Cousido-Rocha et al. 2019a). The test can be useful, for example, in cluster analysis since it allows to verify if clustering is indeed necessary preventing a spurious
clustering. The test statistic is based on a comparison of the k sample-specific kernel density estimates with a kernel density estimate computed using all the data pooled together. After an intuitive explanation of the proposed test, including its asymptotic null distribution and power properties, we focus on the practical use of the test through our R package Equalden.HD available on CRAN. The usage of Equalden.HD is illustrated through real data examples, and its performance is compared with standard methods (e.g. Anderson-Darling and Kruskal- Wallis) providing evidences of the disadvantages and limitations of such methods. Another k-sample problem with high dimensional data appears when the researcher aims to compare two groups of individuals for which a large number of variables are recorded. For example, in genetics, the first data set may contain the gene expression levels measured on patients with a certain type of tumor, whereas the second may refer to these variables but measured on patients with a different type of tumor (which maybe related to the first one). One relevant problem here is to test the null hypothesis of equality of the marginal distributions for the two groups (e.g. two tumors). Then, we have a k-sample problem with k “ 2, but the number of variables p whose marginal distributions are compared goes to infinity. In this context we propose a formal test for this global null hypothesis whose test statistic is motivated by the simple idea of comparing, for each of the p variables, the empirical characteristic functions computed from the two samples (see Cousido-Rocha et al. 2019b). After an intuitive explanation of the test and a discussion of its advantages over existing statistical procedures, we focus on illustrating its use through our
user-friendly TwoSampleTest.HD R package available on CRAN. Our new methodology and the corresponding R packages opens new doors for the statistical analysis of dependent low sample size and large dimensional data overcoming the limitations of existing methods, also discussed through practical illustrations.