This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale.
It emphasizes on how to combine computation and visualization to perform effective analysis.
We will cover methods from each side, and hybrid ones that combine the best of both worlds.
Students will work in small teams to complete a research project exploring novel approaches for interactive data & visual analytics.
Piazza Discussion Forum
We will use
for discussion (e.g., homework, project) and for ALL announcements.
Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly.
You can also use Pizza to find project teammates.
T-square will ONLY be used for submission of assignments and projects.
Video recordings of the lectures are available at http://gtcourses.gatech.edu
||* Course introduction
* Big data analytics building blocks, data Collection, and simple storage (SQLite)
||* Data cleaning & integration
* Toward Coupling Cognition and Computation through Interactive Interfaces by Prof. Alex Endert
|Guest lecture by Prof. Alex Endert
||HW1 out (Tue)
* Visualization fundamentals
* Data visualization for the web (D3)
||lectures by Chad Stolper
* Data Mining Concepts & Tasks
* Visualization DOs and DON'Ts; Heilmeier Questions
||HW1 due (Mon)
* Deep learning
* Graph analytics
- how to build and store graphs
- basics; power laws; centrality
- graph statistics and how to compute them
|Guest lecture by Josh Patterson. Slides deck #1, deck #2
||HW2 out (Thu)
||* Graph analytics
- graph algorithms
- interactive tools
||Same as Tue
|| * Scaling up (Hadoop, Pig, HBase, Hive)
||HW2 due (Wed). Form proj team (Fri)
||* Scaling up (Spark, Spark SQL, etc)
* Dimensionality Reduction: techniques, visualization, practitioner's guide
||Proj proposal due (Fri)
* Classification (techniques, visualization & interaction)
|No class (Fall student recess)
||Project proposal presentations
||HW3 out (Mon)
||* Ensemble Methods
Guest lecture by Prof. Jimeng Sun
* Text analytics: concepts
||HW3 due (Wed)
||* Text analytics: algorithms (LSI=SVD)
* Time series: algorithms
||Proj progress report due (Mon), HW4 out (Tue)
||* Time series: algorithms, visualization, & applications
* Human Computation
||HW4 due (Mon)
||* Closing words and course overview
* Project poster presentations
||Poster session: 9:30-12
||Proj final report due (Fri)
- 40% Homework
- 50% Project
- 10% Class and Piazza participation
Late Submissions Policy
- Strictly enforced.
- For homework assignments (applies to HW2 and onwards): 10% deduction for every 24 hours of delay. (e.g., deduct 9 points per 24 hours, for a 90-point homework.)
- For project deliverables (excluding presentations), 5% deduction for every 24 hours of delay.
- No penalties for medical reasons or emergencies. You must submit a doctor's note or an official letter explaining the emergency.
Distance Learning Sections (Q & Q3)
A standard 3-day lag applies to all homework and project deliverables.
For project presentation, a group that has DL student member can choose to:
- present in class without 3-day lag or
- submit a video presentation with 3-day lag (e.g., screen capture)
We plan to have 4 assignments in total.
All assignments should be submitted via T-Square.
If you have questions the about assignments, the fastest way for us to help is if you post your questions on Piazza
If you prefer that your question addresses to only our TAs and the instructor, you can use the private post
feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).
While collaboration is allowed, each student *must* write up their own answers
All GT students must observe the honor code
See project description
See the schedule table above for deliverable due dates.
Dataset Ideas (may need API, or scraping)
- Google public datasets. Thanks Revant!
- NYC Taxi data for 2013 (FOILed by Chris Wong).
2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB).
Visualization for a days trip. Thanks Jitesh.
- Large datasets publicly available. Thanks Gopi!
- Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
- Yahoo WebScope
- Data.gov: U.S. Government's open data
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Trulia, Zillow: real estate listing sites
- Numerous graph datasets (large and small): SNAP, Konect
- Movies data: Rotten Tomatoes, IMDB
- List of lists of datasets for recommendations.
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Quandl - a dataset search engine for time-series data.
A collection of links to various datasets.
UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
Amazon AWS Public Data Sets (Thanks Jonathan!)
KDD Cup: annual competition in data mining, like Kaggle
Academic domain: Microsoft Academic Search, DBLP
Retrosheet: MLB statistics (Game/Play logs)
Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
Social trends (Thanks Jonathan!)
Beer data (Thanks Jonathan!)
Academic torrents (terabytes) (Thanks Vaibhav!)
Article Search API from the New York Times (all the way back to 1851!) (Thanks Guido!)
- (Kayak: flight, hotel, car, etc.)
Auditors must first obtain instructor's permission of the instructor,
then enroll in the course.
The auditor must attend all lectures, and optionally complete the assignments.
Textbooks and reading materials
- None required.
- Highly recommended good reads:
For both CSE 6242 (grad) and CX 4242 (undergrad)
Students are expected to complete significant
programming assignments (homework, project) that
may involve higher-level languages or scripting (e.g., Java, R, Matlab, Python, C++, etc.).
You are expected to quickly learn many new things. For example, an assignment on Hadoop programming may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++.
Please make sure you are comfortable with this.
Please take a look at the assignments (homework and project) of the previous offerings of this course, which will give you some idea about the difficulty level of the assignments.
Basic linear algebra, probability knowledge is expected.
Additional formal prerequisites for CSE 6242
None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.
Additional formal prerequisites for CX 4242
(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)
for all past course offerings.
Acknowledgements & Related Classes
We thank Amazon's AWS in Education
grant program for providing support for Amazon Web Services
Tableau's data visualization software
is provided through the Tableau for Teaching program.
Many thanks to my colleagues for sharing their course materials:
Prof. John Stasko - Information Visualization - Fall 2012
Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012