This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale.
It emphasizes on how to combine computation and visualization to perform effective analysis.
We will cover methods from each side, and hybrid ones that combine the best of both worlds.
Students will work in small teams to complete a research project exploring novel approaches for interactive data & visual analytics.
Piazza Discussion Forum
We will use
Piazza
for discussion (e.g., homework, project).
Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly.
You can also use Pizza to find project teammates.
T-square will only be used for submission of assignments and projects.
Office Hours
Schedule (tentative)
Video recordings of the lectures are available at
http://gtcourses.gatech.edu.
Date |
Topic |
Tue |
Thu |
Events |
Jan |
7, 9 |
* Course introduction
* Big data analytics building blocks, data Collection, and simple storage (SQLite)
|
Slides |
Slides |
|
|
14, 16 |
* Data cleaning & integration
* Data Mining Concepts & Tasks
|
Slides |
Slides |
HW1 out (Tue) |
|
21, 23 |
* Visualization fundamentals
* Data visualization for the web (D3)
|
Slides |
Slides |
by Chad Stolper |
|
28, 30 |
Snow days! |
X |
X |
HW1 due (Mon) |
Feb |
4, 6 |
* Visualization DOs and DON'Ts; Heilmeier Questions
* Graph analytics
- how to build and store graphs
- basics; power laws; centrality
- graph statistics and how to compute them
|
Slides |
Slides |
HW2 out (Sat) |
|
11, 13 |
Snow days again!
|
X |
X |
|
|
18, 20 |
* Graph analytics
- graph algorithms
- interactive tools
- applications
* Scaling up (Hadoop, Pig, HBase, Hive, Pegasus)
|
Slides |
Slides |
* Form proj teams by 2/21 * HW2 due 2/21 |
|
25, 27 |
* Scaling up (cont'd)
* Classification (techniques, visualization & interaction)
|
Slides |
Slides |
|
Mar |
4, 6 |
* Clustering
* Dimensionality Reduction: techniques, visualization, practitioner's guide
|
Slides |
Slides. Guest lecture by Dr. Jaegul Choo |
* Proj proposals due 3/8 (DL: due 3/15). * HW3 out (Sun). |
|
11, 13 |
Project proposal presentations |
|
|
|
|
18, 20 |
Spring Break |
X |
X |
|
|
25, 27 |
Time series: algorithms, visualization, & applications |
Slides |
Slides |
|
Apr |
1, 3 |
Text analytics: concepts, algorithms (LSI=SVD), visualization |
Slides |
Slides |
HW3 due (Mon, 3/31) |
Apr |
8, 10 |
* Ensemble Methods * Human Computation |
Slides |
Slides |
Progress report due Wed, 4/9, 5pm; HW4 out (Mon) |
|
15, 17 |
* Analytics in the real world * Closing words and course overview |
Guest lecture by Flavio Villanustre, VP at LexisNexis, HPCC Systems. Slides |
Slides |
|
|
22, 24 |
Project presentations |
|
|
Final report due Fri, 4/25, 5pm (DL: 5/2); HW4 due 4/30 (DL: 5/2) |
Grading
- 40% Homework
- 50% Project
- 10% Class and Piazza participation
Late Submissions Policy
- Homework: each student has 4 slip days total. No questions asked.
- Project: each team has 3 slip days total. No questions asked.
- Each slip day equals 24 hours. E.g., if your submission is late for 30 hours, that counts as 2 slip days
- After all slip days are used up, 5% deduction for every 24 hours of delay. (e.g., 5 points for a 100-point homework)
- No penalties for medical reasons or emergencies. You must submit a doctor's note or an official letter explaining the emergency.
- To use slip days, specify the number of days you have used in the textbox on T-Square (when you submit your work).
Homework (tentative)
Please note that while collaboration is allowed, individual collaborators *must* write up their own answers.
All GT students must observe the
honor code.
- [5%] HW1: Analyzing Rotten Tomatoes Data; SQLite; D3 Warmup
- [10%] HW2: D3 Graphs and Visualization
- [15%] HW3: Pig and Hive
- [10%] HW4: Tree and Weka
Project
Team project: 3-4 people.
Description and grading policy (proposal + presentation, progress report, final report + presentation).
Dataset Ideas (may need API, or scraping)
- Yahoo WebScope
- Data.gov: U.S. Government's open data
- Freebase
- Yelp
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Trulia, Zillow: real estate listing sites
- Numerous graph datasets (large and small): SNAP, Konect
- Movies data: Rotten Tomatoes, IMDB
- List of lists of datasets for recommendations.
Thanks Jon!
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
Thanks Minwei!
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Thanks Ding!
-
Quandl - a dataset search engine for time-series data.
Thanks Henry!
-
A collection of links to various datasets.
Thanks Vignesh!
-
UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
Thanks Vinodh!
-
Amazon AWS Public Data Sets (Thanks Jonathan!)
-
KDD Cup: annual competition in data mining, like Kaggle
-
Academic domain: Microsoft Academic Search, DBLP
-
Retrosheet: MLB statistics (Game/Play logs)
-
Classification datasets
Thanks Amish!
-
Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
Thanks Ryan!
-
Social trends (Thanks Jonathan!)
-
Beer data (Thanks Jonathan!)
-
Academic torrents (terabytes) (Thanks Vaibhav!)
-
Article Search API from the New York Times (all the way back to 1851!) (Thanks Guido!)
- (Kayak: flight, hotel, car, etc.)
Auditors
Auditors must first obtain instructor's permission of the instructor,
then enroll in the course.
The auditor must attend all lectures, and optionally complete the assignments.
Textbooks and reading materials
- None required.
- Highly recommended good reads:
Prerequisites
For both CSE 6242 (grad) and CX 4242 (undergrad)
Students are expected to complete significant programming assignments (homework, project) that
may involve higher-level languages or scripting (e.g., Java, R, Matlab, Python, C++, etc.).
Some assignments may involve web programming and
D3 (e.g., Javascript, CSS).
Basic algebra, probability knowledge is expected.
Additional formal prerequisites for CSE 6242
None.
Additional formal prerequisites for CX 4242
(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
and
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
and
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)
Previous offerings
See
https://poloclub.github.io/#cse6242 for all past course offerings.
Acknowledgements & Related Classes
We thank Amazon's
AWS in Education grant program for providing support for
Amazon Web Services.
Many thanks to my colleagues for sharing their course materials:
Prof. John Stasko - Information Visualization - Fall 2012
Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012