Interactive Data Exploration and Analytics (IDEA 2016) - Workshop at ACM SIGKDD 2016

IDEA 2016 was the biggest IDEA ever, with 250 attendees! It featured 4 keynotes, 17 papers, posters and demos for presentation, a networking + poster + demo session sponsored by Microsoft Research!

Join us at IDEA 2017 at KDD in Halifax, Nova Scotia, Canada! (Site goes live soon!)

Download our poster!

The Interactive Data Exploration and Analytics (IDEA) workshop addresses the development of data mining techniques that allow users to interactively explore their data. We focus and emphasize on interactivity and effective integration of techniques from data mining, visualization and human-computer interaction (HCI). In other words, we explore how the best of these different but related domains can be combined such that the sum is greater than the parts. The IDEAs at KDD in Chicago 2013, in New York City 2014, and in Sydney 2015, were all a great success.

‹ ›

Impression of IDEA 2014 in New York City

Program & Attending IDEA

IDEA will be a full-day workshop on Sunday, Aug 14, at KDD 2016 in the Embarcadero room in Hotel Parc 55 (just across the street from the main conference venue). Register and book hotel rooms through KDD's registration site.

In total, 17 papers have been accepted for presentation at IDEA 2016. 7 have been accepted for oral presentation over the day, 10 have been accepted for interactive discussion at the poster + demo + networking session.

Download IDEA'16 proceedings (38MB)

8:15	Welcome to IDEA'16
8:30	Keynote 1 Prof. Jerome H. Friedman Stanford University Regression Location and Scale Estimation with Application to Censoring Slides Jerome H. Friedman is Professor Emeritus of Statistics, Stanford University. He received both bachelor's and Ph. D degrees in physics from the University of California, Berkeley. He was leader of the Computation Research Group at the Stanford Linear Accelerator Center from 1972 through 2006. He was Professor of Statistics, Stanford University, from 1982 through 2006, and served as Department Chair from 1988 through 1991. His primary interests center on machine learning and data mining. He has authored or coauthored over 100 papers in major statistical journals as well as three books on Data Mining, and has invented or co-invented several widely used machine learning and data mining procedures. Abstract The aim of regression analysis in machine learning is to estimate the location of the distribution of an outcome variable y, given the joint values of a set of predictor variables x. This location estimate is then used as a prediction for the value of y at x. The accuracy of this prediction depends on the scale of the distribution of y at x, which in turn, usually depends on x (heteroscedasticity). A robust procedure is presented for jointly estimating both the location and scale of the distribution of y given x, as functions of x, under no assumptions concerning the relationship between the two functions. The scale function can then be used to access the accuracy of individual predictions, as well as to improve accuracy especially in the presence of censoring.
9:20	Research Talks (time allocation: 15+5 each) On the Intuitiveness of Common Discretization Methods Mario Boley and Ankit Kariryaa Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models Josua Krause, Adam Perer and Kenney Ng
10:00	Coffee
10:30	Research Talks (time allocation: 15+5 each) Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation Akash Srivastava, James Zou and Charles Sutton Visual Quality Assessment of Subspace Clusterings Michael Hund, Ines Färber, Michael Behrisch, Andrada Tatu, Tobias Schreck, Daniel A. Keim and Thomas Seidl
11:10	Keynote 2 Prof. Jeffrey Heer University of Washington Trifacta Co-Founder & CXO Predictive Interaction Slides Jeffrey Heer is an Associate Professor of Computer Science & Engineering at the University of Washington, where he directs the Interactive Data Lab and conducts research on data visualization, human-computer interaction and social computing. The visualization tools developed by his lab (D3.js, Vega, Protovis, Prefuse) are used by researchers, companies and thousands of data enthusiasts around the world. His group's research papers have received awards at the premier venues in Human-Computer Interaction and Information Visualization (ACM CHI, ACM UIST, IEEE InfoVis, IEEE VAST, EuroVis). Other awards include MIT Technology Review's TR35 (2009), a Sloan Foundation Research Fellowship (2012), and a Moore Foundation Data-Driven Discovery Investigator award (2014). Jeff holds BS, MS and PhD degrees in Computer Science from UC Berkeley, whom he then betrayed to go teach at Stanford from 2009 to 2013. Jeff is also a co-founder of Trifacta, a provider of interactive tools for scalable data transformation. Abstract How might we architect interactive systems that have better models of the tasks we're trying to perform, learn over time, help refine ambiguous user intents, and scale to large or repetitive workloads? In this talk I will present Predictive Interaction, a framework for interactive systems that shifts some of the burden of specification from users to algorithms, while preserving human guidance and expressive power. The central idea is to imbue software with domain-specific models of user tasks, which in turn power predictive methods to suggest a variety of possible actions. I will illustrate these concepts with examples drawn from widely-deployed systems for data transformation and visualization (with reported order-of-magnitude productivity gains) and then discuss associated design considerations and future research directions.
12:00	Lunch
12:55	Re-welcome
1:00	Keynote 3 Dr. Saleema Amershi Microsoft Research Better Machine Learning Through Data Slides Saleema Amershi is a Researcher in the Machine Teaching Group at Microsoft Research. Her research lies at the intersection of human-computer interaction and machine learning. In particular, her work involves designing and developing tools to support both end-user and practitioner interaction with interactive machine learning systems. Saleema received her Ph.D. in computer science from the University of Washington's Computer Science & Engineering department in 2012. Abstract Machine learning is the product of both an algorithm and data. While machine learning research tends to focus on algorithmic advances, taking the data as given, machine learning practice is quite the opposite. Most of the influence practitioners have in using machine learning to build predictive models comes through interacting with data, including crafting the data used for training and examining results on new data to inform future iterations. In this talk, I will present tools and techniques we have been developing in the Machine Teaching Group at Microsoft Research to support the model building process. I will then discuss some of the open challenges and opportunities in improving the practice of machine learning.
1:45	Research Talks (time allocation: 10+5 each) Expressive Query Construction through Direct Manipulation of Nested Relational Results Eirik Bakke and David Karger Clustrophile: A Tool for Visual Clustering Analysis Cagatay Demiralp Direct-Manipulation Visualization of Deep Networks Daniel Smilkov, Shan Carter, D. Sculley, Fernanda Viegas and Martin Wattenberg
2:30	Invited Demo Visualization of Large Graphs on the Web Andriy Kashcha
2:45	Keynote 4 Prof. Eamonn Keogh UC Riverside At Last! Time Series Joins, Motifs, Discords and Shapelets at Interactive Speeds Slides Eamonn Keogh is a Full Professor at the Computer Science & Engineering Department of University of California - Riverside. His research areas include data mining, machine learning and information retrieval, specializing in techniques for solving similarity and indexing problems in time-series datasets. He has authored more than 120 papers. He received the IEEE ICDM 2007 best paper award, SIGMOD 2001 best paper award, and runner up best paper award in KDD 1997. He has given over two dozen well received tutorials in the premier conferences in data mining and databases. Abstract Given the ubiquity of time series, the last decade has seen a flurry of activity in time series data mining. Some of the most useful and frequently used primitives “reason” about the shapes of subsequences found in longer time series. Examples include Time Series Joins, Motifs, Discords and Shapelets. These primitives have found significant adoption, however they are all run in batch mode. For most non-trivial datasets, you start the process; you go to lunch (or on a short vacation!) and examine the results when you get back. What if you could solve such problems in interactive time? Well, now you can! With a new data structure call the Matrix Profile, interactive data mining of large datasets has become possible for the first time, and as we shall demonstrate, it is a game changer.
3:30	Coffee + Posters + Interactive Demo + Networking Session
4:00	Posters + Interactive Demo + Networking Session Interacting with Massive Behavioral Data Shih-Chieh Su "Why Should I Trust you?" Explaining the Predictions of Any Classifier Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin SIDE: A Web App for Interactive Visual Data Exploration with Subjective Feedback Jefrey Lijffijt, Bo Kang, Kai Puolamäki and Tijl De Bie Interactive Constrained Boolean Matrix Factorization Nelson Mukuze and Pauli Miettinen VIT-PLA: Visual Interactive Tool for Process Log Analysis Sen Yang, Xin Dong, Moliang Zhou, Xinyu Li, Shuhong Chen, Rachel Webman, Aleksandra Sarcevic, Ivan Marsic and Randall Burd Human-guided Flood Mapping on Satellite Images Jiongqian Liang, Peter Jacobs and Srinivasan Parthasarathy A Visual Approach for Interactive Co-Training Qi Han, Weimeng Zhu, Florian Heimerl, Steffen Koch and Thomas Ertl Interactive Web Content Exploration for Domain Discovery Yamuna Krishnamurthy, Kien Pham, Aécio Santos and Juliana Freire Peekquence: Visual Analytics for Event Sequence Data Bum Chul Kwon, Janu Verma and Adam Perer ReVACNN: Real-Time Visual Analytics for Convolutional Neural Network Sunghyo Chung, Sangho Suh, Cheonbok Park, Kyeongpil Kang, Jaegul Choo and Bum Chul Kwon
5:00	Workshop closing

What's the IDEA?

We have entered the era of big data. Massive datasets, surpassing terabytes and petabytes, are now commonplace. They arise in numerous settings in science, government, and enterprises. Today, technology exists by which we can collect and store such massive amounts of information. Yet, making sense of these data remains a fundamental challenge. We lack the means to exploratively analyze databases of this scale. Currently, few technologies allow us to freely "wander" around the data, and make discoveries by following our intuition, or serendipity. While standard data mining aims at finding highly interesting results, it is typically computationally demanding and time consuming, thus may not be well-suited for interactive exploration of large datasets.

Interactive data mining techniques that aptly integrate human intuition, by means of visualization and intuitive human-computer interaction (HCI) techniques, and machine computation support have been shown to help people gain significant insights into a wide range of problems. However, as datasets are being generated in larger volumes, higher velocity, and greater variety, creating effective interactive data mining techniques becomes a much harder task.

Our focus and emphasis is on interactivity and effective integration of techniques from data mining, visualization and human-computer interaction. In other words, we intend to explore how the best of these different but related domains can be combined such that the sum is greater than the parts.

Important Dates

Submission	~~Fri, May 27, 2016, 23:59 Hawaii Time~~
Notification	~~Wed, June 22, 2016~~
Camera-ready	~~Wed, July 6, 2015~~
Workshop	Sun, August 14, 2015

Call for Papers

Topics of interests for the workshop include, but are not limited to:

interactive data mining algorithms
visualizations for interactive data mining
demonstrations of interactive data mining
quick, high-level data analysis methods
any-time data mining algorithms
visual analytics
methods that allow meaningful intermediate results
data surrogates
on-line algorithms
adaptive stream mining algorithms
theoretical/complexity analysis of instant data mining
learning from user input for action replication/prediction
active learning / mining

Submission Information

All papers will be peer reviewed, single-blinded. We welcome many kinds of papers, such as (and not limited to):

Novel research papers
Demo papers
Work-in-progress papers
Visionary papers (white papers)
Relevant work that has been previously published
Work that will be presented at the main conference of KDD

Authors should clearly indicate in their abstracts the kinds of submissions that the papers belong to, to help reviewers better understand their contributions. Submissions must be in PDF, written in English, no more than 10 pages long — shorter papers are welcome — and formatted according to the standard double-column ACM Proceedings Style (Tighter Alternate style).

The accepted papers will be posted on the workshop website and will not appear in the KDD proceedings.

For accepted papers, at least one author must attend the workshop to present the work.

Keynotes

Prof. Jerome H. Friedman
Stanford University
Regression Location and Scale Estimation with Application to Censoring

Jerome H. Friedman is Professor Emeritus of Statistics, Stanford University. He received both bachelor's and Ph. D degrees in physics from the University of California, Berkeley. He was leader of the Computation Research Group at the Stanford Linear Accelerator Center from 1972 through 2006. He was Professor of Statistics, Stanford University, from 1982 through 2006, and served as Department Chair from 1988 through 1991. His primary interests center on machine learning and data mining. He has authored or coauthored over 100 papers in major statistical journals as well as three books on Data Mining, and has invented or co-invented several widely used machine learning and data mining procedures.

Abstract
The aim of regression analysis in machine learning is to estimate the location of the distribution of an outcome variable y, given the joint values of a set of predictor variables x. This location estimate is then used as a prediction for the value of y at x. The accuracy of this prediction depends on the scale of the distribution of y at x, which in turn, usually depends on x (heteroscedasticity). A robust procedure is presented for jointly estimating both the location and scale of the distribution of y given x, as functions of x, under no assumptions concerning the relationship between the two functions. The scale function can then be used to access the accuracy of individual predictions, as well as to improve accuracy especially in the presence of censoring.

Prof. Jeffrey Heer
University of Washington
Trifacta Co-Founder & CXO
Predictive Interaction

Jeffrey Heer is an Associate Professor of Computer Science & Engineering at the University of Washington, where he directs the Interactive Data Lab and conducts research on data visualization, human-computer interaction and social computing. The visualization tools developed by his lab (D3.js, Vega, Protovis, Prefuse) are used by researchers, companies and thousands of data enthusiasts around the world. His group's research papers have received awards at the premier venues in Human-Computer Interaction and Information Visualization (ACM CHI, ACM UIST, IEEE InfoVis, IEEE VAST, EuroVis). Other awards include MIT Technology Review's TR35 (2009), a Sloan Foundation Research Fellowship (2012), and a Moore Foundation Data-Driven Discovery Investigator award (2014). Jeff holds BS, MS and PhD degrees in Computer Science from UC Berkeley, whom he then betrayed to go teach at Stanford from 2009 to 2013. Jeff is also a co-founder of Trifacta, a provider of interactive tools for scalable data transformation.

Abstract
How might we architect interactive systems that have better models of the tasks we're trying to perform, learn over time, help refine ambiguous user intents, and scale to large or repetitive workloads? In this talk I will present Predictive Interaction, a framework for interactive systems that shifts some of the burden of specification from users to algorithms, while preserving human guidance and expressive power. The central idea is to imbue software with domain-specific models of user tasks, which in turn power predictive methods to suggest a variety of possible actions. I will illustrate these concepts with examples drawn from widely-deployed systems for data transformation and visualization (with reported order-of-magnitude productivity gains) and then discuss associated design considerations and future research directions.

Prof. Eamonn Keogh
UC Riverside
At Last! Time Series Joins, Motifs, Discords and Shapelets at Interactive Speeds

Eamonn Keogh is a Full Professor at the Computer Science & Engineering Department of University of California - Riverside. His research areas include data mining, machine learning and information retrieval, specializing in techniques for solving similarity and indexing problems in time-series datasets. He has authored more than 120 papers. He received the IEEE ICDM 2007 best paper award, SIGMOD 2001 best paper award, and runner up best paper award in KDD 1997. He has given over two dozen well received tutorials in the premier conferences in data mining and databases.

Abstract
Given the ubiquity of time series, the last decade has seen a flurry of activity in time series data mining. Some of the most useful and frequently used primitives “reason” about the shapes of subsequences found in longer time series. Examples include Time Series Joins, Motifs, Discords and Shapelets. These primitives have found significant adoption, however they are all run in batch mode. For most non-trivial datasets, you start the process; you go to lunch (or on a short vacation!) and examine the results when you get back. What if you could solve such problems in interactive time? Well, now you can! With a new data structure call the Matrix Profile, interactive data mining of large datasets has become possible for the first time, and as we shall demonstrate, it is a game changer.

Dr. Saleema Amershi
Microsoft Research
Better Machine Learning Through Data

Saleema Amershi is a Researcher in the Machine Teaching Group at Microsoft Research. Her research lies at the intersection of human-computer interaction and machine learning. In particular, her work involves designing and developing tools to support both end-user and practitioner interaction with interactive machine learning systems. Saleema received her Ph.D. in computer science from the University of Washington's Computer Science & Engineering department in 2012.

Abstract
Machine learning is the product of both an algorithm and data. While machine learning research tends to focus on algorithmic advances, taking the data as given, machine learning practice is quite the opposite. Most of the influence practitioners have in using machine learning to build predictive models comes through interacting with data, including crafting the data used for training and examining results on new data to inform future iterations. In this talk, I will present tools and techniques we have been developing in the Machine Teaching Group at Microsoft Research to support the model building process. I will then discuss some of the open challenges and opportunities in improving the practice of machine learning.

IDEA 2015 Keynotes

Prof. Geoff Webb
Monash University

Prof. Jure Leskovec
Stanford University

IDEA 2014 Keynotes

Prof. Ben Shneiderman
University of Maryland, College Park

Prof. Aditya Parameswaran
University of Illinois (UIUC)

IDEA 2013 Keynotes

Prof. Haesun Park
Georgia Tech

Prof. Marti Hearst
UC Berkeley

Organizers

Polo Chau
Georgia Tech

Jilles Vreeken
Max Planck Institute for Informatics,
and Saarland University

Matthijs van Leeuwen
Universiteit Leiden

Dafna Shahaf
The Hebrew University of Jerusalem

Christos Faloutsos
Carnegie Mellon

Sponsors, Supporters & Friends

Program Committee

Acar Tamersoy (Symantec, Georgia Tech, USA)
Alex Endert (Georgia Tech, USA)
Antti Oulasvirta (Aalto University, Finland)
Antti Ukkonen (Finnish Institute of Occupational Health, Finland)
Bahador Saket (Georgia Tech, USA)
Danai Koutra (CMU, USA)
Edith Law (Waterloo, Canada)
Esther Galbrun (Boston University, USA)
Geoff Webb (Monash University, Australia)
Hanghang Tong (Arizona State University, USA)
Hannah Kim (Georgia Tech, USA)
Jaakko Hollmen (Aalto University, Finland)
Jaegul Choo (Korea University, South Korea)
James Abello (Rutgers, USA)
Jefrey Lijffijt (Ghent, Belgium)
Kai Puolomäki (Aalto University, Finland)
Kevin Roundy (Symantec, USA)
Mario Boley (MPI-INF, FHI Berlin, Fraunhofer IAIS, Germany)

Marti Hearst (UC Berkeley, USA)
Michael Berthold (University of Konstanz, Germany)
Minsuk (Brian) Kahng (Georgia Tech, USA)
Nan Cao (NYU Tandon School of Engineering)
Nikolaj Tatti (Aalto University, Finland)
Pauli Mietinnen (Max Planck Institute for Informatics, Germany)
Saleema Amershi (Microsoft Research, USA)
Siegfried Nijssen (Leiden University, the Netherlands)
Stefan Kramer (U Mainz, Germany)
Steffen Koch (U Stuttgart, Germany)
Sucheta Soundarajam (Syracuse University, USA)
Thomas Gärtner (University of Nottingham)
Thomas Seidl (LMU Munich, Germany)
Tim (Jia-Yu) Pan (Google, USA)
U Kang (Seoul National University, South Korea)
Wouter Duivesteijn (Ghent University, Belgium)
Zhicheng 'Leo' Liu (Adobe Research, USA)