The Interactive Data Exploration and Analytics (IDEA) workshop addresses the development of data mining techniques that allow users to interactively explore their data. We focus and emphasize on interactivity and effective integration of techniques from data mining, visualization and human-computer interaction (HCI). In other words, we explore how the best of these different but related domains can be combined such that the sum is greater than the parts. The IDEAs at KDD in Chicago 2013, in New York City 2014, and in Sydney 2015, were all a great success.
IDEA will be a full-day workshop on Sunday, Aug 14, at KDD 2016 in the Embarcadero room in Hotel Parc 55 (just across the street from the main conference venue). Register and book hotel rooms through KDD's registration site.
In total, 17 papers have been accepted for presentation at IDEA 2016. 7 have been accepted for oral presentation over the day, 10 have been accepted for interactive discussion at the poster + demo + networking session.
|8:15||Welcome to IDEA'16|
Regression Location and Scale Estimation with Application to Censoring
Jerome H. Friedman is Professor Emeritus of Statistics, Stanford University. He received both bachelor's and Ph. D degrees in physics from the University of California, Berkeley. He was leader of the Computation Research Group at the Stanford Linear Accelerator Center from 1972 through 2006. He was Professor of Statistics, Stanford University, from 1982 through 2006, and served as Department Chair from 1988 through 1991. His primary interests center on machine learning and data mining. He has authored or coauthored over 100 papers in major statistical journals as well as three books on Data Mining, and has invented or co-invented several widely used machine learning and data mining procedures.
The aim of regression analysis in machine learning is to estimate the location of the distribution of an outcome variable y, given the joint values of a set of predictor variables x. This location estimate is then used as a prediction for the value of y at x. The accuracy of this prediction depends on the scale of the distribution of y at x, which in turn, usually depends on x (heteroscedasticity). A robust procedure is presented for jointly estimating both the location and scale of the distribution of y given x, as functions of x, under no assumptions concerning the relationship between the two functions. The scale function can then be used to access the accuracy of individual predictions, as well as to improve accuracy especially in the presence of censoring.
Research Talks (time allocation: 15+5 each)
Research Talks (time allocation: 15+5 each)
University of Washington
Trifacta Co-Founder & CXO
Jeffrey Heer is an Associate Professor of Computer Science & Engineering at the University of Washington, where he directs the Interactive Data Lab and conducts research on data visualization, human-computer interaction and social computing. The visualization tools developed by his lab (D3.js, Vega, Protovis, Prefuse) are used by researchers, companies and thousands of data enthusiasts around the world. His group's research papers have received awards at the premier venues in Human-Computer Interaction and Information Visualization (ACM CHI, ACM UIST, IEEE InfoVis, IEEE VAST, EuroVis). Other awards include MIT Technology Review's TR35 (2009), a Sloan Foundation Research Fellowship (2012), and a Moore Foundation Data-Driven Discovery Investigator award (2014). Jeff holds BS, MS and PhD degrees in Computer Science from UC Berkeley, whom he then betrayed to go teach at Stanford from 2009 to 2013. Jeff is also a co-founder of Trifacta, a provider of interactive tools for scalable data transformation.
How might we architect interactive systems that have better models of the tasks we're trying to perform, learn over time, help refine ambiguous user intents, and scale to large or repetitive workloads? In this talk I will present Predictive Interaction, a framework for interactive systems that shifts some of the burden of specification from users to algorithms, while preserving human guidance and expressive power. The central idea is to imbue software with domain-specific models of user tasks, which in turn power predictive methods to suggest a variety of possible actions. I will illustrate these concepts with examples drawn from widely-deployed systems for data transformation and visualization (with reported order-of-magnitude productivity gains) and then discuss associated design considerations and future research directions.
Better Machine Learning Through Data
Saleema Amershi is a Researcher in the Machine Teaching Group at Microsoft Research. Her research lies at the intersection of human-computer interaction and machine learning. In particular, her work involves designing and developing tools to support both end-user and practitioner interaction with interactive machine learning systems. Saleema received her Ph.D. in computer science from the University of Washington's Computer Science & Engineering department in 2012.
Machine learning is the product of both an algorithm and data. While machine learning research tends to focus on algorithmic advances, taking the data as given, machine learning practice is quite the opposite. Most of the influence practitioners have in using machine learning to build predictive models comes through interacting with data, including crafting the data used for training and examining results on new data to inform future iterations. In this talk, I will present tools and techniques we have been developing in the Machine Teaching Group at Microsoft Research to support the model building process. I will then discuss some of the open challenges and opportunities in improving the practice of machine learning.
Research Talks (time allocation: 10+5 each)
At Last! Time Series Joins, Motifs, Discords and Shapelets at Interactive Speeds
Eamonn Keogh is a Full Professor at the Computer Science & Engineering Department of University of California - Riverside. His research areas include data mining, machine learning and information retrieval, specializing in techniques for solving similarity and indexing problems in time-series datasets. He has authored more than 120 papers. He received the IEEE ICDM 2007 best paper award, SIGMOD 2001 best paper award, and runner up best paper award in KDD 1997. He has given over two dozen well received tutorials in the premier conferences in data mining and databases.
Given the ubiquity of time series, the last decade has seen a flurry of activity in time series data mining. Some of the most useful and frequently used primitives “reason” about the shapes of subsequences found in longer time series. Examples include Time Series Joins, Motifs, Discords and Shapelets. These primitives have found significant adoption, however they are all run in batch mode. For most non-trivial datasets, you start the process; you go to lunch (or on a short vacation!) and examine the results when you get back. What if you could solve such problems in interactive time? Well, now you can! With a new data structure call the Matrix Profile, interactive data mining of large datasets has become possible for the first time, and as we shall demonstrate, it is a game changer.
|3:30||Coffee + Posters + Interactive Demo + Networking Session|
Posters + Interactive Demo + Networking Session
We have entered the era of big data. Massive datasets, surpassing terabytes and petabytes, are now commonplace. They arise in numerous settings in science, government, and enterprises. Today, technology exists by which we can collect and store such massive amounts of information. Yet, making sense of these data remains a fundamental challenge. We lack the means to exploratively analyze databases of this scale. Currently, few technologies allow us to freely "wander" around the data, and make discoveries by following our intuition, or serendipity. While standard data mining aims at finding highly interesting results, it is typically computationally demanding and time consuming, thus may not be well-suited for interactive exploration of large datasets.
Interactive data mining techniques that aptly integrate human intuition, by means of visualization and intuitive human-computer interaction (HCI) techniques, and machine computation support have been shown to help people gain significant insights into a wide range of problems. However, as datasets are being generated in larger volumes, higher velocity, and greater variety, creating effective interactive data mining techniques becomes a much harder task.
Our focus and emphasis is on interactivity and effective integration of techniques from data mining, visualization and human-computer interaction. In other words, we intend to explore how the best of these different but related domains can be combined such that the sum is greater than the parts.
|Workshop||Sun, August 14, 2015|
All papers will be peer reviewed, single-blinded. We welcome many kinds of papers, such as (and not limited to):
Authors should clearly indicate in their abstracts the kinds of submissions that the papers belong to, to help reviewers better understand their contributions. Submissions must be in PDF, written in English, no more than 10 pages long — shorter papers are welcome — and formatted according to the standard double-column ACM Proceedings Style (Tighter Alternate style).
The accepted papers will be posted on the workshop website and will not appear in the KDD proceedings.
For accepted papers, at least one author must attend the workshop to present the work.