Exploring the Scalability of Deep Learning on GPU Clusters

Abstract

In recent years, we have observed an unprecedented rise in popularity of AI-powered systems. They have become ubiquitous in modern life, being used by countless people every day. Many of these AI systems are powered, entirely or partially, by deep learning models. From language translation to image recognition, deep learning models are being used to build systems with unprecedented accuracy. The primary downside, is the significant time required to train the models. Fortunately, the time needed for training the models is reduced through the use of GPUs rather than CPUs. However, with model complexity ever increasing, training times even with GPUs are on the rise. One possible solution to ever-increasing training times is to use parallelization to enable the distributed training of models on GPU clusters. This thesis investigates how to utilise clusters of GPU-accelerated nodes to achieve the best scalability possible, thus minimising model training times.

Author Keywords: Compute Canada, Deep Learning, Distributed Computing, Horovod, Parallel Computing, TensorFlow

    Item Description
    Type
    Contributors
    Creator (cre): Williams, Taylor Alan
    Thesis advisor (ths): McConnell, Sabine
    Degree committee member (dgc): Hurley, Richard
    Degree granting institution (dgg): Trent University
    Date Issued
    2019
    Date (Unspecified)
    2019
    Place Published
    Peterborough, ON
    Language
    Extent
    131 pages
    Rights
    Copyright is held by the author, with all rights reserved, unless otherwise noted.
    Subject (Topical)
    Local Identifier
    TC-OPET-10633
    Publisher
    Trent University
    Degree