Beijing Institute of Technology Achieved Research Results in Optimal Sampling of Big Data


  A few days ago, Assistant Professor Yu Jun from School of Mathematics and Statistics, Beijing Institute of Technology and his collaborators published research paper titled "Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators with Massive" in "Journal of the American Statistical Association" Data", one of the four top international journals of statistics. From the perspective of sampling based on optimal experimental design theory, this paper proposes a quick solution to the problem of how to extract useful information from distributed storage of massive data.

      With the advent of the era of big data, the data sources available to people continue to increase exponentially. Analyzing these data as information carriers and extracting useful information from them has always been one of the core research topics in statistics and data science. When performing statistical analysis on massive data, there are usually two particularly challenging problems. One is that the amount of data, therefore, it is too large to store the entire data set in a computer, resulting in the problem that traditional statistical analysis algorithm cannot be directly applied to the corresponding data set; the second is that despite the moderate amount of data, due to the limitation of the computing speed and computing capacity of existing computers, and statistical analysis often takes a long time, and the statistical analysis results desired by researchers cannot be obtained within a limited time.

      In order to overcome these two challenging problems, statistical analysis methods for large data sets can be roughly divided into the following two categories: the first type is the parallel computing method. First, the entire large data set is divided into several sub-data sets, and each sub-data set is calculated separately. Finally, the calculation results of these sub-data sets are organically combined to obtain the inferred result of the entire data set. The second type of analysis method is the sub-sampling method. A set of effective sub-samples is skillfully drawn from the entire data set. Only the sub-samples are statistically inferred. The sample is used to replace the whole idea. The sub-samples are used to infer the estimation of the entire sample. As a result, calculation time is saved. Although a large number of research results show that the sub-sampling method can effectively solve the statistical inference problem of big data, however, it is still one of the urgent problems in the analysis of big data to efficiently select data that carries a large amount of information and can improve the accuracy of statistical inference as a subsample for statistical inference.

      The above paper based on the idea of optimal design by Assistant Professor Yu Jun and collaborators gives a scientific method on how to efficiently select data rich in statistical model information. To take advantage of distributed computing, first extract subsamples of data sets stored on different computers, and then skillfully fuse the estimates obtained from each subsample to form an optimal approximate estimate of the entire data set. The paper proves the scientificity and feasibility of this method from the two aspects of theory and simulation.

      This research work was completed in collaboration with Assistant Professor Yu Jun, Professor Ai Mingyao of Peking University, Assistant Professor Wang Haiying of the Department of Statistics, University of Connecticut. Assistant Professor Yu Jun is the first author, and this research was funded by BIT’s Young Teachers’ Academic Startup Program.

Paper link:
A profile of research team and individuals:

      The experimental design team of the School of Mathematics and Statistics of Beijing Institute of Technology actively carries out cooperative research and academic exchanges at home and abroad. The team leader Professor Tian Yubin and team members Dr. Kong Xiangshun, Dr. Wang Dianpeng, Dr. Yu Jun have established long-term cooperative relations with well-known domestic and foreign experimental design scholars, such as Academician C.F. Jeff Wu, Professor Ai Mingyao and Professor Roshan Vengazhiyil Joseph. The team members respectively carried out research on the theory and application of experimental design, showing strong development momentum.

      Yu Jun, assistant professor, a main member of the experimental design team of the School of Mathematics and Statistics, Beijing Institute of Technology. He graduated from Nankai University with a bachelor's degree and a Ph.D. from Peking University. He was a visiting scholar at Georgia State University in the United States. He is mainly engaged in experimental design, sampling theory and related statistical applied research work. He has published many high-level academic papers in authoritative statistical journals such as "Journal of the American Statistical Association", "Computational Statistics & Data Analysis", "Statistica Sinica", "Journal of Statistical Planning and Inference".