This course has concluded. See https://poloclub.github.io/#cse6242 for all past course offerings.

CSE6242 / CX4242, Fall 2016
Data and Visual Analytics

Georgia Tech, College of Computing

4:35 - 5:55pm, Clough 152, Tue & Thu
Prof. Duen Horng (Polo) Chau

This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale. It emphasizes on how to complement computation and visualization to perform effective analysis. We will cover methods from each side, and hybrid ones that combine the best of both worlds. Students will work in small teams to complete a research project exploring novel approaches for interactive data & visual analytics.

Office Hours


Picture Polo Chau Tue, 3:30-4:00pm
(+ 30min after Tue's class at Clough Starbucks)
Klaus 1324
Picture Nilaksh Das Fri, 1-2PM CCB common area (1st floor)
Picture Pradeep Vairamani Rajendran Wed, 2-3PM Outside Klaus 3201
Picture Yanwei Zhang Mon, 1-2PM Klaus 3205
Picture Bhanu Verma Fri, 2-3PM CULC (3rd floor), Common area near 325, take right from stairs, walk few steps, common area is on the left
Picture Meghna Natraj Tue, 2-3PM CCB common area (1st floor)
Picture Vishakha Singh Wed, 11AM-12PM Outside Klaus 3100

Schedule, Lectures, In-class Announcements

Date Topic Tue Thu Events
Aug 23, 25 * Course introduction
* Big data analytics building blocks
intro building blocks  
30, 1 * Data Collection, and simple storage (SQLite)
* Data cleaning
collection, cleaning cancelled HW1 out
Sept 6, 8 * Data integration: knowledge graph/database; feldspar; data reconciliation/de-duplication; similarity functions
* Heilmeier questions; group project core requirements
* Example projects:
(1) Firebird: Predicting Fire Risks in Atlanta
(2) PASSAGE: A Travel Safety Assistant
integration project, Firebird, PASSAGE
13, 15 * Visualization 101
* Fixing common visualization issues
vis101 fix-vis HW1 due (Fri, 11:55pm)
20, 22 * Industry Talk
* Data visualization for the web (D3)
Yahoo Tech Talk + info session d3 Form project teams by Friday;
HW2 out
27, 29 * Data Mining Concepts & Tasks
* fixing visualization and presentation
hw2 walkthrough, javascript demo, data mining concepts concepts, teamwork tips, fix-vis
Oct 4, 6 * Intro to classification: k nearest neighbor (KNN), decision trees, cross validation
* Scaling up: Hadoop, Pig, Hive

classification-intro hadoop, pig, hive
11, 13 * Scaling up: Spark, Spark SQL
X (fall break) spark, backup code with git/github HW2 due (Mon, 11:55pm); HW3 out
18, 20 Project proposal presentations Show time! Show time! Project proposal & slides due (Mon, 11:55pm)
25, 27 * Scaling up: HBase
*MMap
* Graph analytics
  • how to build and store graphs
  • basics; power laws; centrality
  • (personalized) PageRank
  • interactive applications
  • evaluating apps
hbase graphs basics
Nov 1, 3 More graphs graphs centrality and algorithms graphs centrality and algorithms HW3 due (Fri, 11:55pm)
8, 10 * Ensemble method, bagging, random forests
* Classification (visualization & interaction)
* Memory-mapping/virtual memory to scale up algorihtms
bagging, random forests roc, auc, confusion matrix, mmap HW4 out
Project progress report due (Sun, 11:55pm EST)
15, 17 * Text analytics: concepts
* Text analytics: algorithms (LSI=SVD)
text analytics, clustering text analytics
22, 24 Thanksgiving X X
29, 1 * Time series: algorithms, visualization, & applications
(* Dimension reduction (PCA, MDS, LDA, IsoMap))
time series linear and non-linear forecasting review HW4 due (Sun, 11:55pm)
Dec 6 Project poster presentations
Poster presentation. 4:30pm to 6pm-ish. Klaus Atrium. Pizza + drinks served! X Proj final report due (Tue, 11:55pm EST)

Announcements and Discussion on Piazza

We use Piazza for discussion and all announcements.

Post your questions there. Our teaching staff and your fellow classmates will help answer them quickly. You can also use Pizza to find project teammates.

T-square will only be used for submission of assignments and projects.

While we welcome everyone to share their experiences in tackling issues and helping each other out, but please do not post your answers, as that may affect the learning experience of your fellow classmates.

Homework (50% of grade)

The fastest way to get help with homework assignments is to post your questions on Piazza. If you prefer that your question addresses to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).
While collaboration is allowed for homework assignments, each student must write up their own answers. All GT students must observe the honor code. Any suspected plagiarism and academic misconduct will be reported and directly handled by the Office of Student Integrity (OSI).
We plan to have 4 assignments in total.

Project (50% of grade)

See project description. See the schedule table above for deliverable due dates.

Late Submissions Policy

Distance Learning Sections (Q & Q3)

A standard 3-day lag applies to all homework and project deliverables.  For project presentation, a group that has DL student member can choose to:
  1. Present in class without 3-day lag; or 
  2. Submit a video presentation with 3-day lag (e.g., screen capture)

Dataset Ideas (may need API, or scraping)

Reading materials & Resources

Data Science

Visualization

SQL

Prerequisites & Expectation

For both CSE 6242 (grad) and CX 4242 (undergrad)

Students are expected to complete significant programming assignments (homework, project) that may involve higher-level languages or scripting (e.g., Java, R, Matlab, Python, C++, etc.).

Some assignments may involve web programming and D3 (e.g., Javascript, CSS).

You are expected to quickly learn many new things. For example, an assignment on Hadoop programming may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++. Please make sure you are comfortable with this.

Please take a look at the assignments (homework and project) of the previous offerings of this course, which will give you some idea about the difficulty level of the assignments.

Basic linear algebra, probability knowledge is expected.

Additional formal prerequisites for CSE 6242

None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.

Additional formal prerequisites for CX 4242

(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
and
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
and
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)

Auditing & Pass/Fail

Due to the class size, I am not offering auditing and pass/fail option this semester.

Previous offerings

See https://poloclub.github.io/#cse6242 for all past course offerings.

Acknowledgements & Related Classes

We thank Amazon's AWS in Education grant program for providing support for Amazon Web Services.
Tableau's data visualization software is provided through the Tableau for Teaching program.

Many thanks to my colleagues for sharing their course materials:
Prof. John Stasko - Information Visualization - Fall 2012
Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012