Model Training Foundations (Ray Train)
Learn the foundations of distributed training of machine learning models with Ray Train.
Introduction
Imports & Loading the Dataset
Defining the Model
Train Loop Per Worker
Defining the Training Loop Configuration
Configuring the Scaling Config
Defining the Model Wrapper
Building the Dataloader
Reporting Metrics and Checkpointing
Persistent Storage
Putting it all together with TorchTrainer
Inspecting the Training Results
Inference with Your Trained Model
Full Chapter Notebook
Introduction
Train Loop Using Ray Data
Building a Ray Data-Backed Dataloader
Preparing and Loading the Dataset for Ray Data
Transformations with Ray Data
Configuring TorchTrainer and Launching Training
Full Chapter Notebook
Introduction
Checkpoint Loading for Fault Tolerance
Saving Fault Tolerant Checkpoints
Launching Fault Tolerant Training
Manually Restoring from Checkpoints
Cleaning up Cluster Storage and Conclusion
Concluding the Intro Tutorials and Next Steps
Full Chapter Notebook