|
I am a researcher at IBM TJ Watson Research Center. My research focus is on data mining, especially, large-scale data analysis in high dimensional data such as time-series, data cubes and social networks and health care analytics. I have received ICDM best research paper in 2007 and KDD Dissertation runner-up award in 2008 and SDM best research paper in 2007. I graduated from CMU on the fall 2007. I was a PhD student in Computer science department at Carnegie Mellon University from 2003 to 2007. I feel extremely honored to study at CMU. It is a GREAT place for students. On top of that, I was very lucky to have worked with my advisor Prof. Christos Faloutsos. I spent four fruitful years there. Before that, I got my bachelor degree at the Hong Kong University of Science and Technology (HKUST) in June 2002 (CS major and Math minor). I was a recipient of the HKUST Academic Achievement Awards. After that, I was lucky to work with Prof. Dimitris Papadias on spatio-temporal databases for another year, and obtained the Master of Philosophy in Computer Science in July 2003. My publication can be found at #Publications or on my DBLP entry. My research social network is here, and my thesis summary is in the following picture:
|
| ||||||||||||
Oct 2007–Present IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Staff Member, Service Research
Jun–Aug 2006 IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Intern, Software Tools and Techniques
Jun-Aug 2005 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate
Jun-Aug 2004 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate
[4] Documents in rich text corpora usually contain multiple facets of information. For example, an article about a specific disease often consists of different facets such as symptom, treatment, cause, diagnosis, prognosis, and prevention. Thus, documents may have different relations based on different facets. Powerful search tools have been developed to help users locate lists of individual documents that are most related to specific keywords. However, there is a lack of effective analysis tools that reveal the multifaceted relations of documents within or cross the document clusters. In this paper, we present FacetAtlas, a multifaceted visualization technique for visually analyzing rich text corpora. FacetAtlas combines search technology with advanced visual analytical tools to convey both global and local patterns simultaneously. We describe several unique aspects of FacetAtlas, including (1) node cliques and multifaceted edges, (2) an optimized density map, and (3) automated opacity pattern enhancement for highlighting visual patterns, (4) interactive context switch between facets. In addition, we demonstrate the power of FacetAtlas through a case study that targets patient education in the health care domain. Our evaluation shows the benefits of this work, especially in support of complex multifaceted data analysis.
Effective patient similarity assessment is important for clinical decision support. It enables the capture of past experience as manifested in the collective longitudinal medical records of patients to help clinicians assess the likely outcomes resulting from their decisions and actions. However, it is challenging to devise a patient similarity metric that is clinically relevant and semantically sound. Patient similarity is highly context sensitive: it depends on factors such as the disease, the particular stage of the disease, and co-morbidities. One way to discern the semantics in a particular context is to take advantage of physicians' expert knowledge as reflected in labels assigned to some patients. In this paper we present a method that leverages localized supervised metric learning to effectively incorporate such expert knowledge to arrive at semantically sound patient similarity measures. Experiments using data obtained from the MIMIC II database demonstrate the effectiveness of this approach.
In our [6], we address the scalability becomes a key challenge of social media and data. There are two main aspects of the problems that arise: 1) data volume: how to manage and analyze huge datasets to efficiently extract patterns, 2) data understanding: how to facilitate understanding of the patterns by users?
To address both aspects of the scalability challenge, we present a hybrid approach that leverages two complementary disciplines, data mining and information visualization. In particular, we propose 1) an analytic data model for content-based networks using tensors; 2) an efficient high-order clustering framework for analyzing the data; 3) a scalable context-sensitive graph visualization to present the clusters. We evaluate the proposed methods using both synthetic and real datasets. In terms of computational efficiency, the proposed methods are an order of magnitude faster compared to the baseline. In terms of effectiveness, we present several case studies of real corporate social networks.
In our [1], we propose a Memory Efficient Tucker (MET) decomposition to do efficient Tucker decompositions for large sparse tensors. Here is the abstract of that work
One major challenge is how to deal with high-dimensional, sparse data. In other words, how do we compute decompositions of tensors where most of the entries of the tensor are zero. Specialized techniques are needed for computing the Tucker decompositions for sparse tensors because standard algorithms do not account for the sparsity of the data. As a result, a surprising phenomenon is observed by practitioners: Despite the fact that there is enough memory to store both the input tensors and the factorized output tensors, memory overflows occur during the tensor factorization process. To address this intermediate blowup problem, we propose Memory-Efficient Tucker (MET). Based on the available memory, MET adaptively selects the right execution strategy during the decomposition. We provide quantitative and qualitative evaluation of MET on real tensors. It achieves over 1000X space reduction without sacrificing speed; it also allows us to work with much larger tensors that were too big to handle before. Finally, we demonstrate a data mining case-study using MET.
In our [7], we propose multi-model tensor analysis (2-heads method). Here are the abstract of this work:
Data stream values are often associated with multiple aspects. For example, each value observed at a given time-stamp from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Time-stamp, type and location are the three aspects, which can be modeled using a tensor (high-order array). However, the time aspect is special, with a natural ordering, and with successive time-ticks having usually correlated values. Standard multiway analysis ignores this structure. To capture it, we propose 2 Heads Tensor Analysis (2-heads), which provides a qualitatively different treatment on time. Unlike most existing approaches that use a PCA-like summarization scheme for all aspects, 2-heads treats the time aspect carefully. 2-heads combines the power of classic multilinear analysis (PARAFAC, Tucker, DTA/STA [8][9], WTA [10][9]) with wavelets, leading to a powerful mining tool. Furthermore, 2-heads has several other advantages as well: (a) it can be computed incrementally in a streaming fashion, (b) it has a provable error guarantee and, (c) it achieves significant compression ratio against competitors. Finally, we show experiments on real datasets, and we illustrate how 2-heads reveals interesting trends in the data.
Another interesting extension is the "2Heads" method[7] that utilizes two different summarization schemes for different modes, e.g., wavelet for the time mode, maximal variance approach for the other modes.
Related Publications:[8][10][9][7] Note that there is a typo in Line 5 of the OTA pseudocode in [8] should be U_l instead of U_l*U_l^T.
Huge amounts of rich context social network data are generated everyday from various applications such as FaceBook and Twitter. These data involve multiple social relations which are community-driven and dynamic in nature. The complex interplay of these characteristics poses tremendous challenges on the users who try to understand the underlying patterns in the social media.
In this work [11], We introduce an exploratory analytical framework, ContexTour, which generates visual representations for exploring multiple dimensions of community activities, including relevant topics, representative users and the community-generated content, as well as their evolutions. ContexTour consists of two novel and complementary components: (1) Dynamic Relational Clustering (DRC) that efficiently tracks the community evolution and smoothly adapts to the community changes, and (2) Dynamic Network Contour-map (DNC) that visualizes the community activities and evolutions in various dimensions. In our experiments, we demonstrate ContexTour through case studies on the DBLP dataset. The visual results capture interesting and reasonable evolution in Computer Science research communities. Quantitatively, we show 85-165X performance gain of our DRC algorithm over the baseline method.
In large social networks, nodes (users, entities) are influenced by others for various reasons. For example, the colleagues have strong influence on one’s work, while the friends have strong influence on one’s daily life. How to differentiate the social influences from different angles(topics)? How to quantify the strength of those social influences? How to estimate the model on real large networks?
To address these fundamental questions, we propose Topical Affinity Propagation (TAP) [12] to model the topic-level social influence on large networks. In particular, TAP can take results of any topic modeling and the existing network structure to perform topic-level influence propagation. With the help of the influence analysis, we present several important applications on real data sets such as 1) what are the representative nodes on a given topic? 2) how to identify the social influences of neighboring nodes on a particular node? To scale to real large networks, TAP is designed with efficient distributed learning algorithms that is implemented and tested under the Map-Reduce framework. We further present the common characteristics of distributed learning algorithms for Map-Reduce. Finally, we demonstrate the effectiveness and efficiency of TAP on real large data sets.
Users’ behaviors (actions) in a social network are influenced by various factors such as personal interests, social influence, and global trends. However, few publications systematically study how social actions evolve in a dynamic social network and to what extent different factors affect the user actions. In this paper, we propose a Noise Tolerant Time-varying Factor Graph Model (NTT-FGM) for modeling and predicting social actions. NTT-FGM simultaneously models social network structure, user attributes and user action history for better prediction of the users’ future actions. More specifically, a user’s action at time t is generated by her latent state at t, which is influenced by her at- tributes, her own latent state at time t − 1 and her neighbors’ states at time t and t − 1. Based on this intuition, we formalize the social action tracking problem using the NTT-FGM model; then present an efficient algorithm to learn the model, by combining the ideas from both continuous linear system and Markov random field. Finally, we present a case study of our model on predicting future social actions. We validate the model on three different types of real-world data sets. Qualitatively, our model can discover interesting patterns of the social dynamics. Quantitatively, experimental results show that the proposed method outperforms several baseline methods for social action prediction.
This paper [14] aims at discovering community structure in rich media social networks, through analysis of time-varying, multi-relational data. Community structure represents the latent social context of user actions. It has important applications in information tasks such as search and recommendation. Social media has several unique challenges. (a) In social media, the context of user actions is constantly changing and co-evolving; hence the social context contains time-evolving multi-dimensional relations. (b) The social context is determined by the available system features and is unique in each social media website. In this paper we propose MetaFac (MetaGraph Factorization), a framework that extracts community structures from various social contexts and interactions. Our work has three key contributions: (1) metagraph, a novel relational hypergraph representation for modeling multi-relational and multi-dimensional social data; (2) an efficient factorization method for community extraction on a given metagraph; (3) an on-line method to handle time-varying relations through incremental metagraph factorization. Extensive experiments on real-world social data collected from the Digg social media website suggest that our technique is scalable and is able to extract meaningful communities based on the social media contexts. We illustrate the usefulness of our framework through prediction tasks. We outperform baseline methods (including aspect model and tensor analysis) by an order of magnitude.
In [15], we propose GraphScope, a method using parameter-free mining of time-evolving graphs. Here is the abstract.
How can we find communities in dynamic networks of social interactions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity timepoints in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both problems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined parameters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness of our GraphScope on real datasets from several diverse domains. In all cases it produces meaningful time-evolving patterns that agree with human intuition.
Many real applications can be modeled using bipartite graphs, such as users vs. files in a P2P system, traders vs. stocks in a financial trading system, conferences vs. authors in a scientific publication network, and so on.
In this work [16] , we introduce two operations on bipartite graphs: 1) identifying similar nodes (Neighborhood formation), and 2) finding abnormal nodes (Anomaly detection). And we propose algorithms to compute the neighborhood for each node using random walk with restarts and graph partitioning; we also propose algorithms to identify abnormal nodes, using neighborhood information. We evaluate the quality of neighborhoods based on semantics of the datasets, and we also measure the performance of the anomaly detection algorithm with manually injected anomalies. Both effectiveness and efficiency of the methods are confirmed by experiments on several real datasets.
Accurately capturing user preferences over time is a great practical challenge in recommender systems. Simple correlation over time is typically not meaningful, since users change their preferences due to different external events. User be- havior can often be determined by individual’s long-term and short-term preferences. How to represent users’ long- term and short-term preferences? How to leverage them for temporal recommendation? To address these challenges, we propose Session-based Temporal Graph (STG) which si- multaneously models users’ long-term and short-term pref- erences over time. Based on the STG model framework, we propose a novel recommendation algorithm Injected Pref- erence Fusion (IPF) and extend the personalized Random Walk for temporal recommendation. Finally, we evaluate the effectiveness of our method using two real datasets on citations and social bookmarking, in which our proposed method IPF gives 15%-34% improvement over the previous state-of-the-art.
Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.
In summary, SPIRIT does the following for a large number of streams:
Related publications: [18][19][20]
Software: it is a standard-alone matlab source code Download
InteMon is a prototype monitoring and mining system for data centers. It uses the SNMP protocol to monitor a new data data using a JSP web-based frontend interface for system administrators. What sets InteMon apart from other cluster monitoring systems is its ability to automatically analyze correlations in the monitoring data in real time and alert administrators of potential anomalies. It uses efficient, state of the art stream mining methods to report broken correlations among input streams. It also uses these methods to intelligently compress historical data and avoid the need for administrators to configure threshold-based monitoring bands.
Related publications: [21][22]
The key challenge of data mining has always been the scalability. Now with the Map-Reduce and fault-tolerant large clusters, the scalability is feasible to reach (at least in a constraint sense). Many traditional data mining algorithms can be implemented in the Map-Reduce framework (see Chu et al. NIPS'06)
In general, a distributed data mining process involves several steps: data gathering, pre-processing, analysis and post-processing,many of which involve distributed processing through either a data storage layer (such as GFS, HDFS or KFS), or a higher-level data access and job description language, such as Sawzall, Pig, or Cascading.
Our [23] gives a case-study on co-clustering application on Map-Reduce. The implementation can be downloaded from Spiros' page.
We presented a tutorial on Large-scale data mining: MapReduce and Beyond:
Here is a summary about our work on "Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification" [24].
Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem:
To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods.
Given a data matrix A(e.g., document-by-term matrix), how to select the representative features(columns) or samples(rows) such that some key characteristics of A is preserved. One key characteristics of a numeric matrix A is the principal row or column subspace. Based on this key characteristics, the problem can be described as a (row or column) subset selection problem: Formally, given a N-by-M matrix A, how to select K columns C such that ||A-CC+A||.
[25] proposes a method to cluster features columns first then select one representative feature from each cluster. The performance beats many classic subset selection methods using rank-revealing QR-decomposition.
[3][26] introduce CMD, a more efficient method in terms of space and time compare to CUR decomposition. Our [27] propose an incremental computation of CUR.