Meta PyTorch Team 2024 H2 Roadmaps - PyTorch Developer Mailing List

Excerpt

We’ve been thinking about how to share the roadmaps for the work we are doing on PyTorch here at Meta. We do planning on a half-year basis so these are some public versions of our 2024 H2 OSS plans for a number of key areas within PyTorch. Compiler Core Compiler Deployment Distributed Core Libraries Core Performance Developer Infrastructure torchtune TorchRec Torchvision Edge DataLoading


Great Roadmap! I’m very excited about [O2] in the Distributed section.

  • From what I’ve seen so far, FSDP is scaling very well on a 512/1k GPU scale (following LLM Foundry’s blogs 34). Since Meta is training on a 24k GPU cluster, do you think FSDP will automatically scale well on clusters (with good interconnect) outside of Meta, or will additional customization be required for distributed training? For example, if we want to scale beyond a 1k cluster.
  • Will you also support async checkpointing natively?
  • Does PyTorch have any plans to integrate changes from transformer_engine 13 as they did for FSDP, TP, PP with DeepSpeed and/or Megatron-LM?

Sorry if these features are already available and I’m unaware of them. Additionally, what is 5DParallel = HSDP2 + Async SP + CP + PP? What does HSDP2 stand for? Is there any link or documentation available?

I’m not so excited about torchtune stuffs. I think this is too small of a problem for the pytorch team to solve (Ofcouse it has the most impact and clout).

Typically for >1k GPUs, we start looking at composing with model parallelism techniques like tensor parallel and pipeline parallelism to scale to higher number of GPUs. These are notationally expressed as 2D/3D parallel.

We support native async distributed checkpointing in PT release 2.4. You can also look at examples in GitHub - pytorch/torchtitan: A native PyTorch Library for large model training 18 which is our reference framework for large scale distributed training including 2D and 3D parallel.

We have FSDP + native FP8 integration in torchtitan now. This also composes with 2DParallel. Dont have a plan for TE yet.

This is just a notation we are using to denote multiple ways to split a model and compose them with various data parallel and model parallel techniques.

HSDP2 is the HSDP implementation with FSDP2 (per-parameter sharding version of FSDP). torchtitan would be the best place to look at actual code in action.

The roadmaps are appreciated! Thanks, @gottbrath.

I’d be curious to learn more and contribute to upcoming scientific PyTorch features, if any. I’m especially interested in areas like complex numbers, integration, interpolation, etc. I’d also love to hear more about any forthcoming changes to optim or distributions, if any.

In addition, it’d be nice if the core members (@smth, @ezyang, @albanD, @Lezcano, etc.) could articulate their current vision for scientific PyTorch. I am especially interested to learn where the core members currently dilettante between features provided by PyTorch, first-party PyTorch packages (e.g., TorchAudio and TorchVision), and third-party PyTorch packages (e.g., PyTorch-Geometric and TorchMetrics) and their reasoning behind their thinking as it relates to infrastructural issues like compilation.

We have been depending on the torchdata datapipes in our new GNN dataloader called GraphBolt 8 for pytorch. It would be nice if the datapipes were not removed from torchdata at all or if there was an easy way to switch from datapipes to whatever new design will be implemented.

Hey @0x00b1 great to hear from you again. What features are you interested in contributing to?

Broadly speaking, we don’t have broad goals the next few months around areas like complex numbers or interploation. However, we’re always excited to partner on how PyTorch can better serve the scientific community. There’s much happening at the intersection of hybrid modeling, integration, higher order gradients, etc etc. My personal hope is that by driving performance, usability and composability wins within generative AI modeling, that will help provide improved tooling to accelerate exciting sub-fields such as Foundation Models for multi-omic biology.

That said, I’d love to collaborate on specific features! Feel free to submit ideas via Issues or reach out on Slack.

Hi @mfbalin thanks for raising this issue.

Happy to follow up on this offline to figure out the best way forward here such that we don’t break what you’ve built!
Could you ping me on the pytorch slack and we can continue the discussion there?

@0x00b1 this is a hard question you have here!
For the scientific PyTorch part, I would defer to @jisaacso ‘s great answer above.

My personal view on how and when we use different repos is as follows:

  • pytorch/pytorch is for features that are a) useful for the almost all of our users, b) battle-tested and stable and c) we are ready to support them for the next 5 years.
  • pytorch/* is for a variety of purposes, all driven (or used to be driven) by core contributors. First extensions for a large group of users: by domains (vision, audio, rl, torchrec), usecase (torchx, torchtune, torchtitan) or backend (executorch, xla, cpuinfo). Second, exploration (data, ao), supporting infra repos (tutorials, examples, rfcs, test-infra, builder, pytorch.github.io 1). And many more (we have 99 repos in the org right now haha).
  • third-party packages: basically exactly the same as above with the following delta: remove the “driven (or used to be driven) by core contributors”, add even more categories and diversity in the repos: Ecosystem | PyTorch (that has only a small shortlist).

Defining specific rules for these boundaries is quite hard as there are specific circumstances and other constraints that are important as well (related to infrastructure, technical dependencies, lack of extension points, etc) in the final decision. And so it is usually taken on a case-by-case basis.
Overall though, I think we are pushing hard to keep the core components small, stable and with all the necessary extension points. Recommending users leverage (and participate in building) the very rich Python ecosystem instead of making PyTorch into a monorepo that must contain every feature/idea/option.

Hope that answers your question a little bit!