This course has concluded. See https://poloclub.github.io/#cse6242 for all past course offerings.

CSE6242A,Q / CX4242A, Spring 2018
Data and Visual Analytics

Georgia Tech, College of Computing

4:30 - 5:45pm, Clough 152, Tue & Thu
Prof. Duen Horng (Polo) Chau

This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale. It emphasizes on how to complement computation and visualization to perform effective analysis. We will cover methods from each side, and hybrid ones that combine the best of both worlds. Students will work in small teams to complete a project exploring novel approaches for interactive data & visual analytics.

Office Hours


Picture Polo Chau Tue, 3:30PM-4PM
+ FREE after-class coffee, at Clough Starbucks
Klaus 1324 (Polo's office)
Picture Neetha Ravishankar Tue, 12-1pm All TA office hours are held in the open area outside Polo's office
Picture Jennifer Ma
Head TA
Mon, 1-2pm  
Picture Mansi Mathur Mon, 1-2pm  
Picture Arathi Arivayutham Wed, 11am-12pm  
Picture Vineet Vinayak Pasupulety Wed, 11am-12pm  
Picture Siddharth Gulati Tue, 12-1pm  

Announcements and Discussion

We use Piazza for announcements and discussion.

Everyone must join this class's Piazza, at https://piazza.com/gatech/spring2018/cse6242aqcx4242a.

Double check that you are joining the right Piazza!

When you have questions about class, homework, project, etc., post your questions there. Our teaching staff and your fellow classmates will help answer them quickly. You can also use Piazza to find project teammates.

T-square will only be used for submission of assignments and projects.

While we welcome everyone to share their experiences in tackling issues and helping each other out, please do not post your answers, as that may affect the learning experience of your fellow classmates.

Class Schedule and Lecture Slides !!! Evolving !!!

* Publication-quality figures
Wk Dates Topics Tue Thu Events (eastern time)
1 Jan 9, 11 * Course introduction
* Big data analytics building blocks
* Data Collection
intro building blocks, buzz words, data collection  
2 16, 18 * simple storage (SQLite)
* Data cleaning
* Class Project overview; Heilmeier questions
* GT Github; one drive
SQLite, Data cleaning project overview; Github HW1 out
3   23, 25 * Example projects:
(1) Firebird: Predicting Fire Risks in Atlanta, by Shang-Tse Chen
(2) PASSAGE: A Travel Safety Assistant, by Nilaksh Das
* Data integration: knowledge graph; data reconciliation/de-duplication; similarity functions
Firebird, PASSAGE Data integration
4 Feb 30, 1 * Visualization 101
* Fixing common visualization issues
vis101 vis fix HW1 due (Fri, 2/2,11:55pm)
5 6, 8 * Fixing presentation issues
* Data visualization for the web (D3)
D3 Dr. Kevin Roundy, Symantec Research Labs
pub-quality figures
Form project teams by Fri, 2/9;
HW2 out
6 13, 15 * Data analytics concepts & tasks
* Overview of project proposal and presentation
* Scaling up: Hadoop, Pig, Hive
analytics concepts, project proposal and presentation Hadoop, Pig, Hive
7   20, 22 * Scaling up: Spark, Spark SQL
* Scaling up: HBase
spark hbase
8 Mar 27, 1 * Classification key concepts, k-NN, cross validation
classification cont'd HW2 due (Wed, 2/28, 11:55pm);
HW3 out
9 6, 8 Project proposal presentations Show time! Show time! Project proposal & slides due (Mon, 3/5, 11:55pm)
10 13, 15 * Ensemble method, bagging, random forests
* Classification: decision tree, vis (ROC, AUC, confusion matrix)
* Clustering: k-means, hierarchical clustering, DBSCAN
* Clustering vis
* Graph analytics
  • How to build and store graphs
  • Basics; power laws
classification vis, clustering vis, random forests graph basics
11   20, 22 Spring break X X  
12 27, 29

* Graph analytics

  • Centrality
  • Algorithms; (personalized) PageRank
  • Interactive applications; Apolo
  • Evaluating apps (user eval)
* Memory-mapping/virtual memory to scale up algorithms
graph centrality & algorithms mmap HW3 due (Fri, 3/30, 11:55pm)
HW4 out
13 Apr 3, 5 * Text analytics: concepts
text analytics canceled Proj progress report due (Fri, 4/6, 11:55pm)
14 10, 12 * Text analytics: algorithms (LSI=SVD)
* Time series: algorithms, visualization, & applications
cont'd time series: basics, linear forecast
15 17, 19 * Closing words
* Lessons learned
time series: non-linear forecast, vis course review; 10 lessons learned HW4 due (Fri, 4/20, 11:55pm)
16   24 * Project poster presentations Poster presentation. 4:30pm to 5:45pm-ish. Klaus Atrium. Pizza + drinks served! X Proj final report due (Tue, 4/24, 11:55pm)

Homework (50% of grade)

The fastest way to get help with homework assignments is to post your questions on Piazza. If you prefer to address your questions to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).
While collaboration is allowed for homework assignments, each student must write up their own answers. All GT students must observe the honor code. Any suspected plagiarism and academic misconduct will be reported and directly handled by the Office of Student Integrity (OSI).
We have 4 assignments in total. Tentative

Project (50% of grade)

See project description. See the schedule table above for deliverable due dates.

Late Submissions Policy

Distance Learning Sections (Q & Q3)

A standard 3-day lag applies to all homework and project deliverables.  For project presentation, a group that has DL student member can choose to:
  1. Present in class without 3-day lag; or 
  2. Submit a video presentation with 3-day lag (e.g., screen capture)

Dataset Ideas (may need API, or scraping)

Reading Materials & Resources

This course does NOT have any required textbook. We recommend the following books and resources.

Data science, machine learning, data mining

Visualization

SQL

Probability

Human Computation

Office of Disability Services

The Office of Disability Services offers accommodations for students with disabilities. Please contact the office should you need help.

Prerequisites & Expectation

For both CSE 6242 (grad) and CX 4242 (undergrad)

Students are expected to complete significant programming assignments (homework, project) that may involve higher-level languages or scripting (e.g., Java, R, Matlab, Python, C++, etc.).

Some assignments may involve web programming and D3 (e.g., Javascript, CSS).

You are expected to quickly learn many new things. For example, an assignment on Hadoop programming may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++. Please make sure you are comfortable with this.

Please take a look at the assignments (homework and project) of the previous offerings of this course, which will give you some idea about the difficulty level of the assignments.

Basic linear algebra, probability knowledge is expected.

Additional formal prerequisites for CSE 6242

None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.

Additional formal prerequisites for CX 4242

(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
and
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
and
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)

Auditing & Pass/Fail

Due to the class size, I am not offering auditing and pass/fail option this semester.

Previous offerings

See https://poloclub.github.io/#cse6242 for all past course offerings.

Acknowledgement & Related Classes

We thank Amazon's AWS in Education grant program for providing support for Amazon Web Services.
Tableau's data visualization software is provided through the Tableau for Teaching program.

Many thanks to my colleagues for sharing their course materials: