题目：RSP Data Model and Technology for Analysis of Big Data in Terabytes and Beyond
摘要：The big values in big data can only be dug out through deep analysis of data. In the era of big data, datasets with hundreds of millions objects and thousands of features become a phenomenon rather than an exceptional case. Such datasets are often in the size of terabytes and can easily exceed the size of the memory of the cluster system. Current big data analysis technologies are not scalable to such data sets because of the memory limitation. Therefore, effectively processing and analyzing big data in terabytes and beyond is becoming a big challenge.
In this talk, I will present a new approach for big data processing and analysis that is based on the divide-and-conquer strategy and statistical estimation, ensemble learning, approximate and distributed computing. Firstly, I present the RSP data model to represent a big data set as a set of distributed random sample data blocks where each random sample data block is a random sample of the big dataset and it can be used to estimate the statistics of the big dataset and build a classification or prediction model for the big data. Secondly, I introduce an asymptotic ensemble learning framework, named as Alpha Framework that stepwise builds ensemble models from selected random sample data blocks to model the big data. Using this set of new technologies, we are able to analyze big data in terabytes effectively on a small cluster without memory limit. In this new architecture, data analysis engines can be separated from the storage of big data in data centers and analysis of big data cross multiple data centers is made possible.
Dr. Joshua Zhexue Huang is a distinguished professor at College of Computer Science and Software in Shenzhen University. He is the founding director of Big Data Institute of Shenzhen University, and Deputy Director of National Engineering Laboratory for Big Data System Computing Technology. Prof. Huang is known for his contributions to the development of a series of k-means type clustering algorithms in data mining, such as k-modes, fuzzy k-modes, k-prototypes and w-k-means that are widely cited and used, and some of which have been included in commercial software. He has extensive industry expertise in business intelligence, data mining and big data analysis. He has been involved in numerous consulting projects in Australia, Hong Kong, Taiwan and mainland China. Dr Huang received his PhD degree from the Royal Institute of Technology in Sweden. He has published over 200 research papers in conferences and journals. In 2006, he received the first PAKDD Most Influential Paper Award. He was the program chair of PAKDD 2011, the local organization chair of ICDM 2014 and the conference co-chair of PAKDD 2016.