Scaling Pandas with Ray and Modin + Alexa AI: Kubernetes and DeepSpeed Zero

10K subscribers

837 views

About
Share

Published On Jun 20, 2022

Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

Talk #1: Modin - Speed up your Pandas workflows by changing a single line of code

by Alejandro Herrera, Solution Architect at Ponder

Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.

GitHub: https://github.com/modin-project/modin

Talk #2: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS

by Justin Chiu, Software Engineer @ Amazon Alexa AI

Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions - or even trillions - of parameters.Training these models within a reasonable amount of time requires very large computing clusters - often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:

(1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup (https://github.com/aws-samples/aws-ef...) is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: https://github.com/NVIDIA/nccl-tests

(2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware.

Note: If (2) is successful, then you're good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.

References:

https://www.amazon.science/blog/makin...
https://github.com/aws-samples/aws-ef...
https://github.com/NVIDIA/nccl-tests

RSVP Webinar: https://www.eventbrite.com/e/webinark...

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com

Published On Jun 20, 2022

Share/Embed

Video Link