Jimeng Sun

From TTXXWiki
Jump to: navigation, search

Contents

About me

I am a researcher at IBM TJ Watson Research Center. My research focus is on data mining, especially, large-scale data analysis in high dimensional data such as time series, data cubes and social networks. I have received ICDM best research paper in 2007 and KDD Dissertation runner-up award in 2008 and SDM best research paper in 2007.

I graduated from CMU on the fall 2007. I was a PhD student in Computer science department at Carnegie Mellon University from 2003 to 2007. I feel extremely honored to study at CMU. It is a GREAT place for students. On top of that, I was very lucky to have worked with my advisor Prof. Christos Faloutsos. I spent four fruitful years there.

Before that, I got my bachelor degree at the Hong Kong University of Science and Technology (HKUST) in June 2002 (CS major and Math minor). I was a recipient of the HKUST Academic Achievement Awards. After that, I was lucky to work with Prof. Dimitris Papadias on spatio-temporal databases for another year, and obtained the Master of Philosophy in Computer Science in July 2003.

My publication can be found at #Publications or on my DBLP entry.

My thesis summary is in the following picture:

At Bear Mountain NY Fall 2009
Name Jimeng Sun
Institution IBM TJ Watson
Area Data Mining
Address FIRSTNAME@cs.cmu.edu

Awards

Curriculum Vitae

My CV

Professional Services

Experiences

Oct 2007–Present IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Staff Member, Service Research

  • Research on the next generation of IT service delivery systems.
  • Research on cooperate social networks, visualization and recommendation.

Jun–Aug 2006 IBM Thomas J. Watson Research Center, Hawthorne, NY, USA Research Intern, Software Tools and Techniques

  • Developed techniques to summarize huge volume of monitoring streams
  • Published three technical papers
  • Filed three patents

Jun-Aug 2005 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate

  • Worked on analysis of business transactional data
  • Developed techniques to summarize huge volume of monitoring streams

Jun-Aug 2004 PricewaterhouseCoopers, Center for advanced research, San Jose, CA Senior Research Associate

  • Worked on analytic tools for financial reporting
  • Developed visualization tool for financial data
  • Filed a patent on automatic analytic approach on financial data

Research

Parallel Data Mining using Map-Reduce

The key challenge of data mining has always been the scalability. Now with the Map-Reduce and fault-tolerant large clusters, the scalability is feasible to reach (at least in a constraint sense). Many traditional data mining algorithms can be implemented in the Map-Reduce framework (see Chu et al. NIPS'06)

In general, a distributed data mining process involves several steps: data gathering, pre-processing, analysis and post-processing,many of which involve distributed processing through either a data storage layer (such as GFS, HDFS or KFS), or a higher-level data access and job description language, such as Sawzall, Pig, or Cascading.

Our [4] gives a case-study on co-clustering application on Map-Reduce. The implementation can be downloaded from Spiros' page.

We presented a tutorial on Large-scale data mining: MapReduce and Beyond:

Unsupervised Sample/Feature Selection

Given a data matrix A(e.g., document-by-term matrix), how to select the representative features(columns) or samples(rows) such that some key characteristics of A is preserved. One key characteristics of a numeric matrix A is the principal row or column subspace. Based on this key characteristics, the problem can be described as a (row or column) subset selection problem: Formally, given a N-by-M matrix A, how to select K columns C such that ||A-CC+A||.

[5] proposes a method to cluster features columns first then select one representative feature from each cluster. The performance beats many classic subset selection methods using rank-revealing QR-decomposition.

[3][6] introduce CMD, a more efficient method in terms of space and time compare to CUR decomposition. Our [7] propose an incremental computation of CUR.

Tensor Analysis

Multi-way Visual Analysis

In our [8], we address the scalability becomes a key challenge of social media and data. There are two main aspects of the problems that arise: 1) data volume: how to manage and analyze huge datasets to efficiently extract patterns, 2) data understanding: how to facilitate understanding of the patterns by users?

To address both aspects of the scalability challenge, we present a hybrid approach that leverages two complementary disciplines, data mining and information visualization. In particular, we propose 1) an analytic data model for content-based networks using tensors; 2) an efficient high-order clustering framework for analyzing the data; 3) a scalable context-sensitive graph visualization to present the clusters. We evaluate the proposed methods using both synthetic and real datasets. In terms of computational efficiency, the proposed methods are an order of magnitude faster compared to the baseline. In terms of effectiveness, we present several case studies of real corporate social networks.

Scalable Tensor Analysis

In our [1], we propose a Memory Efficient Tucker (MET) decomposition to do efficient Tucker decompositions for large sparse tensors. Here is the abstract of that work

One major challenge is how to deal with high-dimensional, sparse data. In other words, how do we compute decompositions of tensors where most of the entries of the tensor are zero. Specialized techniques are needed for computing the Tucker decompositions for sparse tensors because standard algorithms do not account for the sparsity of the data. As a result, a surprising phenomenon is observed by practitioners: Despite the fact that there is enough memory to store both the input tensors and the factorized output tensors, memory overflows occur during the tensor factorization process. To address this intermediate blowup problem, we propose Memory-Efficient Tucker (MET). Based on the available memory, MET adaptively selects the right execution strategy during the decomposition. We provide quantitative and qualitative evaluation of MET on real tensors. It achieves over 1000X space reduction without sacrificing speed; it also allows us to work with much larger tensors that were too big to handle before. Finally, we demonstrate a data mining case-study using MET.

Multi-Model Tensor Analysis

In our [9], we propose multi-model tensor analysis (2-heads method). Here are the abstract of this work:

Data stream values are often associated with multiple aspects. For example, each value observed at a given time-stamp from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Time-stamp, type and location are the three aspects, which can be modeled using a tensor (high-order array). However, the time aspect is special, with a natural ordering, and with successive time-ticks having usually correlated values. Standard multiway analysis ignores this structure. To capture it, we propose 2 Heads Tensor Analysis (2-heads), which provides a qualitatively different treatment on time. Unlike most existing approaches that use a PCA-like summarization scheme for all aspects, 2-heads treats the time aspect carefully. 2-heads combines the power of classic multilinear analysis (PARAFAC, Tucker, DTA/STA [10][11], WTA [12][11]) with wavelets, leading to a powerful mining tool. Furthermore, 2-heads has several other advantages as well: (a) it can be computed incrementally in a streaming fashion, (b) it has a provable error guarantee and, (c) it achieves significant compression ratio against competitors. Finally, we show experiments on real datasets, and we illustrate how 2-heads reveals interesting trends in the data.

Incremental Tensor Analysis
Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion? In this paper, all these problems are addressed through a general data model, tensor streams, and a suite of online summarization algorithms including Dynamic tensor analysis (DTA), Streaming tensor analysis (STA)[10] and window-based tensor analysis (WTA)[12]. The high level objective is to summarize the tensor stream as smaller core tensor streams in different space. A comprehensive journal paper is written to describe all of them[11].

Another interesting extension is the "2Heads" method[9] that utilizes two different summarization schemes for different modes, e.g., wavelet for the time mode, maximal variance approach for the other modes.

Related Publications:[10][12][11][9] Note that there is a typo in Line 5 of the OTA pseudocode in [10] should be U_l instead of U_l*U_l^T.

Social Network Analysis

Time-evolving community detection on graphs

In [13], we propose GraphScope, a method using parameter-free mining of time-evolving graphs. Here is the abstract.

How can we find communities in dynamic networks of social interactions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity timepoints in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both problems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined parameters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness of our GraphScope on real datasets from several diverse domains. In all cases it produces meaningful time-evolving patterns that agree with human intuition.


Social Influence Analysis

In large social networks, nodes (users, entities) are influenced by others for various reasons. For example, the colleagues have strong influence on one’s work, while the friends have strong influence on one’s daily life. How to differentiate the social influences from different angles(topics)? How to quantify the strength of those social influences? How to estimate the model on real large networks?

To address these fundamental questions, we propose Topical Affinity Propagation (TAP) [14] to model the topic-level social influence on large networks. In particular, TAP can take results of any topic modeling and the existing network structure to perform topic-level influence propagation. With the help of the influence analysis, we present several important applications on real data sets such as 1) what are the representative nodes on a given topic? 2) how to identify the social influences of neighboring nodes on a particular node? To scale to real large networks, TAP is designed with efficient distributed learning algorithms that is implemented and tested under the Map-Reduce framework. We further present the common characteristics of distributed learning algorithms for Map-Reduce. Finally, we demonstrate the effectiveness and efficiency of TAP on real large data sets.


Community Discovery via Relational Hypergraph Factorization

This paper [15] aims at discovering community structure in rich media social networks, through analysis of time-varying, multi-relational data. Community structure represents the latent social context of user actions. It has important applications in information tasks such as search and recommendation. Social media has several unique challenges. (a) In social media, the context of user actions is constantly changing and co-evolving; hence the social context contains time-evolving multi-dimensional relations. (b) The social context is determined by the available system features and is unique in each social media website. In this paper we propose MetaFac (MetaGraph Factorization), a framework that extracts community structures from various social contexts and interactions. Our work has three key contributions: (1) metagraph, a novel relational hypergraph representation for modeling multi-relational and multi-dimensional social data; (2) an efficient factorization method for community extraction on a given metagraph; (3) an on-line method to handle time-varying relations through incremental metagraph factorization. Extensive experiments on real-world social data collected from the Digg social media website suggest that our technique is scalable and is able to extract meaningful communities based on the social media contexts. We illustrate the usefulness of our framework through prediction tasks. We outperform baseline methods (including aspect model and tensor analysis) by an order of magnitude.

Stream Mining

Data stream is an important model for many different applications: network traffic analysis, sensor network monitoring, moving object tracking, financial data analysis. The main challenge comes from both speed requirement and space constraint. In stream monitoring applications where huge volume of real-time data streaming into the system need to be monitored for trend analysis and anomaly detection, it is crucial to detect patterns and correlations that may exist in co-evolving data streams. Streams often are inherently correlated (e.g., temperatures in the same building, host traffic in the same network, prices in the same market, etc.) and it is possible to reduce the number of streams dramatically exploiting the correlation.

Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.

In summary, SPIRIT does the following for a large number of streams:

  • automatically detect the patterns
  • adapt to the changes in real-time
  • no magic parameter for users to set
  • work in distributed environments

Related publications: [16][17][18]

Software: it is a standard-alone matlab source code Download

InteMon: Intelligent Monitoring System for Data Center

Modern data centers have a large number of components that must be monitored, including servers, switches/routers, and environmental control systems.

InteMon is a prototype monitoring and mining system for data centers. It uses the SNMP protocol to monitor a new data data using a JSP web-based frontend interface for system administrators. What sets InteMon apart from other cluster monitoring systems is its ability to automatically analyze correlations in the monitoring data in real time and alert administrators of potential anomalies. It uses efficient, state of the art stream mining methods to report broken correlations among input streams. It also uses these methods to intelligently compress historical data and avoid the need for administrators to configure threshold-based monitoring bands.

Related publications: [19][20]

Other works

Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification [21]

Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem:

  • ticket descriptions from different classes are of highly diverse characteristics, which invalidates most standard distance metrics;
  • it is very expensive to obtain high-quality labeled data.

To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods.

Spatio-temporal databases

[22][23][24][25][26][27]

Publications

  1. 1.0 1.1 T. G. Kolda and J. Sun, "Scalable Tensor Decompositions for Multi-aspect Data Mining," in Proc. IEEE International Conference on Data Mining (ICDM), 2008, Bib
  2. J. SUN, "Incremental Pattern Discovery on Streams, Graphs and Tensors," PhD thesis, CMU-CS-07-149, 2007. Bib
  3. 3.0 3.1 J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, "Less is More: Compact Matrix Decomposition for Large Sparse Graphs," in Proc. Proceedings of the 2007 SIAM International Conference on Data Mining (SDM), 2007, Bib
  4. S. Papadimitriou and J. Sun, "DisCo: Distributed Co-clustering with Map-Reduce," in Proc. IEEE International Conference on Data Mining (ICDM), 2008, Bib
  5. C. Boutsidis, J. Sun, and N. Anerousis, "Clustered Subset Selection and its Applications on IT Service Metric," in Proc. CIKM, 2008, Bib
  6. J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, "Less is More: Sparse Graph Mining with Compact Matrix Decomposition," Stat. Anal. Data Min. vol. 1, iss. 1, p. 6, 2008. Bib
  7. H. Tong, S. Papadimitriou, J. Sun, P. Yu, and C. Faloutsos, "Colibri: Fast Mining of Large Static and Dynamic Graphs," in Proc. SIGKDD, 2008, Bib
  8. J. Sun, S. Papadimitriou, C. Lin, N. Cao, S. Liu, and W. Qian, "MultiVis: Content-based Social Network Exploration Through Multi-way Visual Analysis," in Proc. SIAM Data Mining conference, 2009, Bib
  9. 9.0 9.1 9.2 J. Sun, C. E. Tsourakakis, E. Hoke, C. Faloutsos, and T. Eliassi-Rad, "Two heads better than one: Pattern Discovery in Time-evolving Multi-Aspect Data," in Proc. PKDD, 2008, Bib
  10. 10.0 10.1 10.2 10.3 J. Sun, D. Tao, and C. Faloutsos, "Beyond streams and graphs: dynamic tensor analysis," in Proc. KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, 2006, p. 374. Bib
  11. 11.0 11.1 11.2 11.3 J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, "Incremental Tensor Analysis: theory and applications," ACM Transactions on Knowledge Discovery from Data, 2008. Bib
  12. 12.0 12.1 12.2 J. Sun, S. Papadimitriou, and P. Yu, "Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams," in Proc. Proceedings of the International Conference on Data Mining (ICDM), 2006, Bib
  13. J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu, "GraphScope: parameter-free mining of large time-evolving graphs," in Proc. KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, 2007, p. 687. Bib
  14. J. Tang, J. Sun, C. Wang, and Z. Yang, "Social Influence Analysis in Large-scale Networks," in Proc. KDD, 2009, Bib
  15. Y. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher, "MetaFac: Community Discovery via Relational Hypergraph Factorization," in Proc. KDD, 2009, Bib
  16. S. Papadimitriou, J. Sun, and C. Faloutsos, "Streaming pattern discovery in multiple time-series," in Proc. VLDB '05: Proceedings of the 31st international conference on Very large data bases, 2005, p. 697. Bib
  17. J. Sun, S. Papadimitriou, and C. Faloutsos, "Distributed Pattern Discovery in Multiple Streams," in Proc. PAKDD, 2006, pp. 713–718. Bib
  18. J. Sun, S. Papadimitriou, and C. Faloutsos, "Online Latent Variable Detection in Sensor Networks," in Proc. Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2005, Bib
  19. E. Hoke, J. Sun, and C. Faloutsos, "InteMon: Intelligent System Monitoring on Large Clusters," in Proc. Proceedings of the Very Large Data Bases Conference (VLDB), 2006, Bib
  20. E. Hoke, J. Sun, J. D. Strunk, G. R. Ganger, and C. Faloutsos, "InteMon: continuous mining of sensor data in large-scale self-infrastructures," Operating Systems Review, vol. 40, iss. 3, pp. 38–44, 2006. Bib
  21. F. Wang, J. Sun, T. Li, and N. Anerousis, "Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification," in Proc. ICDM, 2009, Bib
  22. J. Sun, Y. Tao, D. Papadias, and G. Kollios, "Spatio-temporal Join Selectivity," Information Systems, vol. 31, iss. 8, 2006. Bib
  23. J. Sun, D. Papadias, Y. Tao, and B. Liu, "Querying about the Past, the Present, and the Future in Spatio-Temporal," in Proc. ICDE, 2004, pp. 202–213. Bib
  24. Y. Tao, J. Sun, and D. Papadias, "Selectivity Estimation for Predictive Spatio-Temporal Queries," in Proc. ICDE, 2003, pp. 417–428. Bib
  25. Y. Tao, D. Papadias, and J. Sun, "The TPR*-Tree: An Optimized Spatio-Temporal Access Method for Predictive Queries," in Proc. VLDB, 2003, pp. 790–801. Bib
  26. Y. Tao, J. Sun, and D. Papadias, "Analysis of predictive spatio-temporal queries," ACM Trans. Database Syst. vol. 28, iss. 4, pp. 295–336, 2003. Bib
  27. D. Papadias, Y. Tao, J. Zhang, N. Mamoulis, Q. Shen, and J. Sun, "Indexing and Retrieval of Historical Aggregate Information about Moving Objects," IEEE Data Eng. Bull. vol. 25, iss. 2, pp. 10–17, 2002. Bib
Personal tools