An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

doi:10.3390/electronics12143021

Sangkwon Lee, Syed Asif Raza Shah, Woojin Seok, Jeonghoon Moon, Kihyeon Kim, Syed Hasnain Raza Shah

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Electrical and Electronic Engineering
Computer Networks and Communications
Hardware and Architecture
Signal Processing
Control and Systems Engineering

Deep learning is a growing technique used to solve complex artificial intelligence (AI) problems. Large-scale deep learning has become a significant issue as a result of the expansion of datasets and the complexity of deep learning models. For training large-scale models, the cloud can be used as a distributed HPC (high-performance computing) tool with benefits in cost and flexibility. However, one of the major performance barriers in distributed deep learning in a distributed HPC environment is the network. The performance is often limited by heavy traffic like many stochastic gradient descent transfers for distributed communication. There are many network studies in distributed deep learning to solve these problems, but most research only focuses on improving communication performance and applying new methods or algorithms like overlapping parameter synchronization to minimize communication delay rather than considering the actual network. In this paper, we are focusing on the actual network, especially in a distributed HPC environment. In such an environment, if cluster nodes are assigned to different zones/regions which means a set of an appropriate number of distributed HPC nodes when performing distributed deep learning tasks, performance degradation due to network delay may occur. The proposed network optimization algorithm ensures that distributed work is placed in the same zone as much as possible to reduce network delay. Furthermore, scoring using network monitoring tools like loss, delay, and throughput is applied to select the optimal node within the zone. Our proposal has been validated on the Kubernetes platform, an open source orchestrator for the automatic management and deployment of micro-services. The performance of distributed deep learning is improved through the proposed scheduler.

Need a simple solution for managing your BibTeX entries? Explore CiteDrive!

Web-based, modern reference management
Collaborate and share with fellow researchers
Integration with Overleaf
Comprehensive BibTeX/BibLaTeX support
Save articles and websites directly from your browser
Search for new articles from a database of tens of millions of references

Try out CiteDrive

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Need a simple solution for managing your BibTeX entries? Explore CiteDrive!

More from our Archive

Prediction of the Remaining Useful Life of Supercapacitors at Different Temperatures Based on Improved Long Short-Term Memory

An omega‐<i>k</i> algorithm for multireceiver synthetic aperture sonar

Word-of-Mouth Engagement in Online Social Networks: Influence of Network Centrality and Density

Smart Cities—A Structured Literature Review

AI-Assisted Ultra-High-Sensitivity/Resolution Active-Coupled CSRR-Based Sensor with Embedded Selectivity

Modelling Smart Grid Technologies in Optimisation Problems for Electricity Grids

Sustainable Utilization of Biowaste Resources for Biogas Production to Meet Rural Bioenergy Requirements

Aromatic Fingerprints: VOC Analysis with E-Nose and GC-MS for Rapid Detection of Adulteration in Sesame Oil

Refined Quantum Gates for Λ‐Type Atom‐Photon Hybrid Systems

Technology Trends for Massive MIMO towards 6G