Ammar Ahmad Awan
Ammar Ahmad Awan
Microsoft
Verified email at osu.edu - Homepage
Title
Cited by
Cited by
Year
S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters
AA Awan, K Hamidouche, JM Hashmi, DK Panda
ACM PPoPP '17 52 (8), 193-205, 2017
1392017
An in-depth performance characterization of CPU-and GPU-based DNN training on modern architectures
AA Awan, H Subramoni, DK Panda
Proceedings of the Machine Learning on HPC Environments, 1-8, 2017
442017
Privacy-aware searching with oblivious term matching for cloud storage
Z Pervez, AA Awan, AM Khattak, S Lee, EN Huh
The Journal of Supercomputing 63 (2), 538-560, 2013
422013
Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning
AA Awan, K Hamidouche, A Venkatesh, DK Panda
Proceedings of the 23rd European MPI Users' Group Meeting, 15-22, 2016
412016
Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?
AA Awan, CH Chu, H Subramoni, DK Panda
Proceedings of the 25th European MPI Users' Group Meeting, 1-9, 2018
362018
Scalable distributed dnn training using tensorflow and cuda-aware mpi: Characterization, designs, and performance evaluation
AA Awan, J Bédorf, CH Chu, H Subramoni, DK Panda
2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2019
302019
Cuda kernel based collective reduction operations on large-scale gpu clusters
CH Chu, K Hamidouche, A Venkatesh, AA Awan, DK Panda
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2016
202016
OC-DNN: Exploiting advanced unified memory capabilities in CUDA 9 and volta GPUs for out-of-core DNN training
AA Awan, CH Chu, H Subramoni, X Lu, DK Panda
2018 IEEE 25th International Conference on High Performance Computing (HiPC …, 2018
192018
Designing non-blocking personalized collectives with near perfect overlap for rdma-enabled clusters
H Subramoni, AA Awan, K Hamidouche, D Pekurovsky, A Venkatesh, ...
International Conference on High Performance Computing, 434-453, 2015
172015
Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning
CH Chu, X Lu, AA Awan, H Subramoni, J Hashmi, B Elton, DK Panda
2017 46th International Conference on Parallel Processing (ICPP), 161-170, 2017
162017
Exploiting GPUDirect RDMA in designing high performance OpenSHMEM for NVIDIA GPU clusters
K Hamidouche, A Venkatesh, AA Awan, H Subramoni, CH Chu, ...
2015 IEEE International Conference on Cluster Computing, 78-87, 2015
142015
Intercloud message exchange middleware
MB Amin, WA Khan, AA Awan, S Lee
Proceedings of the 6th International Conference on Ubiquitous Information …, 2012
142012
Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems
CH Chu, P Kousha, AA Awan, KS Khorassani, H Subramoni, DK Panda
Proceedings of the 34th ACM International Conference on Supercomputing, 1-12, 2020
122020
Performance characterization of dnn training using tensorflow and pytorch on modern clusters
A Jain, AA Awan, Q Anthony, H Subramoni, DKDK Panda
2019 IEEE International Conference on Cluster Computing (CLUSTER), 1-11, 2019
122019
Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast
CH Chu, X Lu, AA Awan, H Subramoni, B Elton, DK Panda
IEEE Transactions on Parallel and Distributed Systems 30 (3), 575-588, 2018
102018
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects
AA Awan, A Jain, CH Chu, H Subramoni, DK Panda
IEEE Micro 40 (1), 35-43, 2019
92019
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters
K Hamidouche, A Venkatesh, AA Awan, H Subramoni, CH Chu, ...
Parallel Computing 58, 27-36, 2016
82016
GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training
A Jain, AA Awan, AM Aljuhani, JM Hashmi, QG Anthony, H Subramoni, ...
SC20: International Conference for High Performance Computing, Networking …, 2020
72020
Scaling tensorflow, pytorch, and mxnet using mvapich2 for high-performance deep learning on frontera
A Jain, AA Awan, H Subramoni, DK Panda
2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), 76-83, 2019
72019
Optimized large-message broadcast for deep learning workloads: MPI, MPI+ NCCL, or NCCL2?
AA Awan, KV Manian, CH Chu, H Subramoni, DK Panda
parallel computing 85, 141-152, 2019
72019
The system can't perform the operation now. Try again later.
Articles 1–20