Contents |
|
I am a researcher at IBM TJ Watson Research Center. My research focus is on data mining, especially, large-scale data analysis in high dimensional data such as time series, data cubes and social networks. I have received ICDM best research paper in 2007 and KDD Dissertation runner-up award in 2008 and SDM best research paper in 2007. I graduated from CMU on the fall 2007. I was a PhD student in Computer science department at Carnegie Mellon University from 2003 to 2007. I feel extremely honored to study at CMU. It is a GREAT place for students. On top of that, I was very lucky to have worked with my advisor Prof. Christos Faloutsos. I spent four fruitful years there. Before that, I got my bachelor degree at the Hong Kong University of Science and Technology (HKUST) in June 2002 (CS major and Math minor). I was a recipient of the HKUST Academic Achievement Awards. After that, I was lucky to work with Prof. Dimitris Papadias on spatio-temporal databases for another year, and obtained the Master of Philosophy in Computer Science in July 2003. My publication can be found at #Publications or on my DBLP entry. |
| ||||||||||||
Oct 2007–Present IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Staff Member, Service Research
Jun–Aug 2006 IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Intern, Software Tools and Techniques
Jun-Aug 2005 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate
Jun-Aug 2004 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate
The key challenge of data mining has always been the scalability. Now with the Map-Reduce and fault-tolerant large clusters, the scalability is feasible to reach (at least in a constraint sense). Many traditional data mining algorithms can be implemented in the Map-Reduce framework (see Chu et al. NIPS'06)
In general, a distributed data mining process involves several steps: data gathering, pre-processing, analysis and post-processing,many of which involve distributed processing through either a data storage layer (such as GFS, HDFS or KFS), or a higher-level data access and job description language, such as Sawzall, Pig, or Cascading.
Our [4] gives a case-study on co-clustering application on Map-Reduce. The implementation can be downloaded from Spiros' page.
We presented a tutorial on Large-scale data mining: MapReduce and Beyond:
Given a data matrix A(e.g., document-by-term matrix), how to select the representative features(columns) or samples(rows) such that some key characteristics of A is preserved. One key characteristics of a numeric matrix A is the principal row or column subspace. Based on this key characteristics, the problem can be described as a (row or column) subset selection problem: Formally, given a N-by-M matrix A, how to select K columns C such that ||A-CC+A||.
[5] proposes a method to cluster features columns first then select one representative feature from each cluster. The performance beats many classic subset selection methods using rank-revealing QR-decomposition.
[3][6] introduce CMD, a more efficient method in terms of space and time compare to CUR decomposition. Our [7] propose an incremental computation of CUR.
In our [8], we address the scalability becomes a key challenge of social media and data. There are two main aspects of the problems that arise: 1) data volume: how to manage and analyze huge datasets to efficiently extract patterns, 2) data understanding: how to facilitate understanding of the patterns by users?
To address both aspects of the scalability challenge, we present a hybrid approach that leverages two complementary disciplines, data mining and information visualization. In particular, we propose 1) an analytic data model for content-based networks using tensors; 2) an efficient high-order clustering framework for analyzing the data; 3) a scalable context-sensitive graph visualization to present the clusters. We evaluate the proposed methods using both synthetic and real datasets. In terms of computational efficiency, the proposed methods are an order of magnitude faster compared to the baseline. In terms of effectiveness, we present several case studies of real corporate social networks.
In our [1], we propose a Memory Efficient Tucker (MET) decomposition to do efficient Tucker decompositions for large sparse tensors. Here is the abstract of that work
One major challenge is how to deal with high-dimensional, sparse data. In other words, how do we compute decompositions of tensors where most of the entries of the tensor are zero. Specialized techniques are needed for computing the Tucker decompositions for sparse tensors because standard algorithms do not account for the sparsity of the data. As a result, a surprising phenomenon is observed by practitioners: Despite the fact that there is enough memory to store both the input tensors and the factorized output tensors, memory overflows occur during the tensor factorization process. To address this intermediate blowup problem, we propose Memory-Efficient Tucker (MET). Based on the available memory, MET adaptively selects the right execution strategy during the decomposition. We provide quantitative and qualitative evaluation of MET on real tensors. It achieves over 1000X space reduction without sacrificing speed; it also allows us to work with much larger tensors that were too big to handle before. Finally, we demonstrate a data mining case-study using MET.
In our [9], we propose multi-model tensor analysis (2-heads method). Here are the abstract of this work:
Data stream values are often associated with multiple aspects. For example, each value observed at a given time-stamp from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Time-stamp, type and location are the three aspects, which can be modeled using a tensor (high-order array). However, the time aspect is special, with a natural ordering, and with successive time-ticks having usually correlated values. Standard multiway analysis ignores this structure. To capture it, we propose 2 Heads Tensor Analysis (2-heads), which provides a qualitatively different treatment on time. Unlike most existing approaches that use a PCA-like summarization scheme for all aspects, 2-heads treats the time aspect carefully. 2-heads combines the power of classic multilinear analysis (PARAFAC, Tucker, DTA/STA [10][11], WTA [12][11]) with wavelets, leading to a powerful mining tool. Furthermore, 2-heads has several other advantages as well: (a) it can be computed incrementally in a streaming fashion, (b) it has a provable error guarantee and, (c) it achieves significant compression ratio against competitors. Finally, we show experiments on real datasets, and we illustrate how 2-heads reveals interesting trends in the data.
Another interesting extension is the "2Heads" method[9] that utilizes two different summarization schemes for different modes, e.g., wavelet for the time mode, maximal variance approach for the other modes.
Related Publications:[10][12][11][9] Note that there is a typo in Line 5 of the OTA pseudocode in [10] should be U_l instead of U_l*U_l^T.
In [13], we propose GraphScope, a method using parameter-free mining of time-evolving graphs. Here is the abstract.
How can we find communities in dynamic networks of social interactions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity timepoints in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both problems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined parameters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness of our GraphScope on real datasets from several diverse domains. In all cases it produces meaningful time-evolving patterns that agree with human intuition.
In large social networks, nodes (users, entities) are influenced by others for various reasons. For example, the colleagues have strong influence on one’s work, while the friends have strong influence on one’s daily life. How to differentiate the social influences from different angles(topics)? How to quantify the strength of those social influences? How to estimate the model on real large networks?
To address these fundamental questions, we propose Topical Affinity Propagation (TAP) [14] to model the topic-level social influence on large networks. In particular, TAP can take results of any topic modeling and the existing network structure to perform topic-level influence propagation. With the help of the influence analysis, we present several important applications on real data sets such as 1) what are the representative nodes on a given topic? 2) how to identify the social influences of neighboring nodes on a particular node? To scale to real large networks, TAP is designed with efficient distributed learning algorithms that is implemented and tested under the Map-Reduce framework. We further present the common characteristics of distributed learning algorithms for Map-Reduce. Finally, we demonstrate the effectiveness and efficiency of TAP on real large data sets.
This paper [15] aims at discovering community structure in rich media social networks, through analysis of time-varying, multi-relational data. Community structure represents the latent social context of user actions. It has important applications in information tasks such as search and recommendation. Social media has several unique challenges. (a) In social media, the context of user actions is constantly changing and co-evolving; hence the social context contains time-evolving multi-dimensional relations. (b) The social context is determined by the available system features and is unique in each social media website. In this paper we propose MetaFac (MetaGraph Factorization), a framework that extracts community structures from various social contexts and interactions. Our work has three key contributions: (1) metagraph, a novel relational hypergraph representation for modeling multi-relational and multi-dimensional social data; (2) an efficient factorization method for community extraction on a given metagraph; (3) an on-line method to handle time-varying relations through incremental metagraph factorization. Extensive experiments on real-world social data collected from the Digg social media website suggest that our technique is scalable and is able to extract meaningful communities based on the social media contexts. We illustrate the usefulness of our framework through prediction tasks. We outperform baseline methods (including aspect model and tensor analysis) by an order of magnitude.
Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.
In summary, SPIRIT does the following for a large number of streams:
Related publications: [16][17][18]
Software: it is a standard-alone matlab source code Download
InteMon is a prototype monitoring and mining system for data centers. It uses the SNMP protocol to monitor a new data data using a JSP web-based frontend interface for system administrators. What sets InteMon apart from other cluster monitoring systems is its ability to automatically analyze correlations in the monitoring data in real time and alert administrators of potential anomalies. It uses efficient, state of the art stream mining methods to report broken correlations among input streams. It also uses these methods to intelligently compress historical data and avoid the need for administrators to configure threshold-based monitoring bands.
Related publications: [19][20]
Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem:
To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods.