Polo Club of Data Science
This course has concluded. See https://poloclub.github.io/#cse6242 for all past course offerings.

This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale. It emphasizes on how to combine computation and visualization to perform effective analysis. We will cover methods from each side, and hybrid ones that combine the best of both worlds. Students will work in small teams to complete a research project exploring novel approaches for interactive data & visual analytics.

Piazza Discussion Forum

We will use Piazza for discussion (e.g., homework, project) and for ALL announcements. Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly. You can also use Pizza to find project teammates.

T-square will ONLY be used for submission of assignments and projects.

Office Hours

Instructor Polo Chau Tue, 11am-12pm, Klaus 1324
TA Brian Kahng Tue, 1-2pm, Klaus 1212
TA Seungyeon Kim Wed, 3-4pm, Klaus 1315
TA Alan Zhang Wed, 1:30-2:30pm, Klaus 2126
TA Drew Wei Wed, 11am-12pm, Klaus 2126

Schedule (tentative)

Video recordings of the lectures are available at http://gtcourses.gatech.edu.

13
Date Topic Tue Thu Events
Aug 19, 21 * Course introduction
* Big data analytics building blocks, data Collection, and simple storage (SQLite)
slides slides  
26, 28 * Data cleaning & integration
* Toward Coupling Cognition and Computation through Interactive Interfaces by Prof. Alex Endert
Guest lecture by Prof. Alex Endert slides HW1 out (Tue)
Sept 2, 4 * Visualization fundamentals
* Data visualization for the web (D3)
Slides Slides lectures by Chad Stolper
9, 11 * Data Mining Concepts & Tasks
* Visualization DOs and DON'Ts; Heilmeier Questions
Slides Slides HW1 due (Mon)
16, 18 * Deep learning
* Graph analytics
  • how to build and store graphs
  • basics; power laws; centrality
  • graph statistics and how to compute them
Guest lecture by Josh Patterson. Slides deck #1, deck #2 Slides HW2 out (Thu)
23, 25 * Graph analytics
  • graph algorithms
  • interactive tools
  • applications
Slides Same as Tue
Oct 30, 2 * Scaling up (Hadoop, Pig, HBase, Hive) Slides Slides HW2 due (Wed). Form proj team (Fri)
7, 9 * Scaling up (Spark, Spark SQL, etc)
* Dimensionality Reduction: techniques, visualization, practitioner's guide
Slides Slides Proj proposal due (Fri)
14, 16 * Classification (techniques, visualization & interaction)
No class (Fall student recess) Slides  
21, 23 Project proposal presentations     HW3 out (Mon)
28, 30 * Ensemble Methods Guest lecture by Prof. Jimeng Sun Slides  
Nov 4, 6 * Clustering
* Text analytics: concepts
Slides Slides HW3 due (Wed)
11, 13 * Text analytics: algorithms (LSI=SVD)
* Time series: algorithms
Slides Slides Proj progress report due (Mon), HW4 out (Tue)
18, 20 * Time series: algorithms, visualization, & applications
* Human Computation
Slides Slides  
25, 27 Thanksgiving week Canceled TG HW4 due (Mon)
Dec 2, 4 * Closing words and course overview
* Project poster presentations
Overview Poster session: 9:30-12 Proj final report due (Fri)

Grading

Late Submissions Policy

Distance Learning Sections (Q & Q3)

A standard 3-day lag applies to all homework and project deliverables.  For project presentation, a group that has DL student member can choose to:
  1. present in class without 3-day lag or 
  2. submit a video presentation with 3-day lag (e.g., screen capture)

Homework

We plan to have 4 assignments in total.

All assignments should be submitted via T-Square.

If you have questions the about assignments, the fastest way for us to help is if you post your questions on Piazza. If you prefer that your question addresses to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).

While collaboration is allowed, each student *must* write up their own answers. All GT students must observe the honor code.

Project

See project description. See the schedule table above for deliverable due dates.

Dataset Ideas (may need API, or scraping)

Auditors

Auditors must first obtain instructor's permission of the instructor, then enroll in the course. The auditor must attend all lectures, and optionally complete the assignments.

Textbooks and reading materials

Prerequisites

For both CSE 6242 (grad) and CX 4242 (undergrad)
Students are expected to complete significant programming assignments (homework, project) that may involve higher-level languages or scripting (e.g., Java, R, Matlab, Python, C++, etc.).

Some assignments may involve web programming and D3 (e.g., Javascript, CSS).

You are expected to quickly learn many new things. For example, an assignment on Hadoop programming may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++. Please make sure you are comfortable with this.

Please take a look at the assignments (homework and project) of the previous offerings of this course, which will give you some idea about the difficulty level of the assignments.

Basic linear algebra, probability knowledge is expected.

Additional formal prerequisites for CSE 6242
None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.
Additional formal prerequisites for CX 4242
(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
and
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
and
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)

Previous offerings

See https://poloclub.github.io/#cse6242 for all past course offerings.

Acknowledgements & Related Classes

We thank Amazon's AWS in Education grant program for providing support for Amazon Web Services.
Tableau's data visualization software is provided through the Tableau for Teaching program.

Many thanks to my colleagues for sharing their course materials:
Prof. John Stasko - Information Visualization - Fall 2012
Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012