Broader Impacts of this Project
-
Faster I/O Operations
-
Less Overhead
-
Simpler Code
Large datasets in terabytes or petabytes are increasingly common, calling for new kinds of scalable machine learning approaches. While state-of-the-art techniques often use complex designs, specialized methods to store and work with large datasets, this project proposes a minimalist approach that forgoes such complexities, by leveraging the fundamental virtual memory capability found on all modern operating systems, to load into the virtual memory space the large datasets that are otherwise too large to fit in the computer's main memory. This main idea will allow developers to easily work with large datasets as if they were in-memory data, enabling them to create machine learning software that is significantly easier to develop and maintain, yet faster and more scalable. Developers will achieve higher work efficiency and make fewer programming errors; companies will reduce operating costs; and researchers will innovate methodology without getting bogged down by implementation details and scalability concerns. The proposed ideas could make a far-reaching impact on industry and academia, in science, education, and technology, as they face increasing challenges in applying machine learning on large datasets. The proposed ideas will also help train the next generation of scientists and engineers by allowing students to learn to work with large datasets in significantly simpler ways. As virtual memory is universally available on modern devices and operating systems, the proposed ideas will also work on mobile, low-power devices, enabling them to perform computation at unprecedented scales and speed.
Research Goals
This project investigates a fundamental, radical way to scale up machine learning algorithms based on virtual memory, one that may be easier to code and maintain, but currently under-utilized in by both single-machine and multi-machine distributed approaches. This research aims to develop deep understanding of this radical idea, its benefits and limitations, and to what extent these results apply in various settings, with respect to datasets, memory sizes, page sizes (e.g., from the default 4KB to the jumbo 2MB pages that enable terabyes of virtual memory space), and architectures (e.g., testing on distributed shared memory file systems like Lustre that support paging and virtual memory over large computer clusters). The researchers will build on their preliminary work on graph algorithms that already demonstrates significant speed-up over state-of-the-art approaches; they will extend their approach to a wide range of machine learning and data mining algorithms. They will also develop mathematical models and systematic approaches to profile and predict algorithm performance and energy usage based on extensive evaluation across platforms, datasets, and languages.
Significant speed-up over other single-machine approaches
Works for datasets as large as 200GB, at speed comparable to 8-node Spark
Publications
Most Relevant Papers
M-Flash Fast Billion-scale Graph Computation Using a Bimodal
Block Processing Model.
Hugo Gualdron, Robson Cordeiro, Jose Rodrigues, Duen Horng (Polo) Chau, Minsuk Kahng, U Kang. PKDD 2016. Sept 19–23, 2016. Riva del Garda, Italy.
M3: Scaling Up Machine Learning via Memory Mapping.
Dezhi (Andy) Fang, Duen Horng (Polo) Chau.
SIGMOD/PODS’16.
MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping.
Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng Chau, Ho Lee, and U Kang.
Proceedings of IEEE BigData 2014 conference.
Oct 27-30, Washington DC, USA.
Towards Scalable Graph Computation on Mobile Devices.
Yiqi Chen, Zhiyuan Lin, Robert Pienta, Minsuk Kahng, Duen Horng (Polo) Chau.
IEEE BigData 2014 Workshop on Scalable Machine Learning: Theory and Applications.
Oct 27, 2014. Washington DC, USA.
Related Papers
VISAGE: Interactive Visual Graph Querying.
Robert Pienta, Acar Tamersoy, Alex Endert, Shamkant B. Navathe, Hanghang Tong, Duen Horng (Polo) Chau
International Working Conference on Advanced Visual Interfaces (AVI 2016).
Dissertations
Robert Pienta. CSE. Spring, 2013 - present. Proposed: Apr 6, 2016.
Thesis:
Adaptive Network Exploration and Interactive Querying
Code
MMap Code (BigData'14)
We have implemented several algorithms using memory mapping. Our preliminary executable jar can be downloaded here and will be available on GitHub soon.
In the current executable jar, we have included:
- Connected Components
- PageRank
- Triangle Counting
- One-Step Neighbors
- Two-Step Neighbors
How to run the Jar:
- Please make sure Java 7 or higher is installed
- Download the Jar file using the button below and unzip it
- Carefully read README.txt contained in the folder
- Launch a command line window, change the directory that contains the Jar file
- Follow the instructions in README.txt to run algorithms
Below, we have included the links to download some large graph datasets and their binary versions (if allowed) on which the algorithms can run on.
The source code and executable jar are being distributed under MIT License
Download MMap Executable Jar
M-Flash Code (PKDD'16)
C++ code and documentation available at
M-Flash GitHub repository.
Large Graph Datasets
Below are some datasets we have used for evaluating our approach.
| #Node |
#Edge |
Source text file |
Binary edge file |
LiveJournal |
4,847,571 |
68,993,773 |
Link
|
Download |
Twitter |
41,652,230 |
1,468,365,182 |
Link
|
Download
|
YahooWeb |
1,413,511,391 |
6,636,600,779 |
Link (requires Yahoo's NDA)
|
Unavailable due to NDA |
People
This project's researchers and contributors include:
U Kang
Seoul National University, South Korea
Funding Agency and Acknowledgement
This material is based upon work supported by the National Science Foundation under Grant No. IIS-1408287. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.