Implementing DeepSpeed For Scalable Transformers: Superior Teaching With Gradient Checkpointing And Parallelism

Next Business 24

8 months ago

Implementing DeepSpeed For Scalable Transformers: Superior Teaching With Gradient Checkpointing And Parallelism

On this superior DeepSpeed tutorial, we provide a hands-on walkthrough of cutting-edge optimization methods for teaching huge language fashions successfully. By combining ZeRO optimization, mixed-precision teaching, gradient accumulation, and superior DeepSpeed configurations, the tutorial demonstrates the easiest way to maximise GPU memory utilization, in the reduction of teaching overhead, and permit scaling of transformer fashions in resource-constrained environments, much like Colab. Alongside model creation and training, it moreover covers effectivity monitoring, inference optimization, checkpointing, and benchmarking completely completely different ZeRO ranges, providing practitioners with every theoretical insights and wise code to hurry up model enchancment. Strive the FULL CODES proper right here.

import subprocess
import sys
import os
import json
import time
from pathlib import Path


def install_dependencies():
   """Arrange required packages for DeepSpeed in Colab"""
   print("🚀 Placing in DeepSpeed and dependencies...")
  
   subprocess.check_call([
       sys.executable, "-m", "pip", "install",
       "torch", "torchvision", "torchaudio", "--index-url",
       "https://download.pytorch.org/whl/cu118"
   ])
  
   subprocess.check_call([sys.executable, "-m", "pip", "install", "deepspeed"])
  
   subprocess.check_call([
       sys.executable, "-m", "pip", "install",
       "transformers", "datasets", "accelerate", "wandb"
   ])
  
   print("✅ Arrange full!")


install_dependencies()


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.info import DataLoader, Dataset
import deepspeed
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
from typing import Dict, Any
import argparse

We organize our Colab setting by placing in PyTorch with CUDA assist, DeepSpeed, and vital libraries like Transformers, Datasets, Velocity up, and Weights & Biases. We assure all of the items is ready so we’re capable of simply assemble and apply fashions with DeepSpeed. Strive the FULL CODES proper right here.

class SyntheticTextDataset(Dataset):
   """Synthetic dataset for demonstration capabilities"""
  
   def __init__(self, dimension: int = 1000, seq_length: int = 512, vocab_size: int = 50257):
       self.dimension = dimension
       self.seq_length = seq_length
       self.vocab_size = vocab_size
      
       self.info = torch.randint(0, vocab_size, (dimension, seq_length))
      
   def __len__(self):
       return self.dimension
  
   def __getitem__(self, idx):
       return {
           'input_ids': self.info[idx],
           'labels': self.info[idx].clone() 
       }

We create a SyntheticTextDataset the place we generate random token sequences to mimic precise textual content material info. We use these sequences as every inputs and labels, allowing us to quickly check out DeepSpeed teaching with out relying on an enormous exterior dataset. Strive the FULL CODES proper right here.

class AdvancedDeepSpeedTrainer:
   """Superior DeepSpeed coach with a variety of optimization methods"""
  
   def __init__(self, model_config: Dict[str, Any], ds_config: Dict[str, Any]):
       self.model_config = model_config
       self.ds_config = ds_config
       self.model = None
       self.engine = None
       self.tokenizer = None
      
   def create_model(self):
       """Create a GPT-2 style model for demonstration"""
       print("🧠 Creating model...")
      
       config = GPT2Config(
           vocab_size=self.model_config['vocab_size'],
           n_positions=self.model_config['seq_length'],
           n_embd=self.model_config['hidden_size'],
           n_layer=self.model_config['num_layers'],
           n_head=self.model_config['num_heads'],
           resid_pdrop=0.1,
           embd_pdrop=0.1,
           attn_pdrop=0.1,
       )
      
       self.model = GPT2LMHeadModel(config)
       self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
      
       self.tokenizer.pad_token = self.tokenizer.eos_token
      
       print(f"📊 Model parameters: {sum(p.numel() for p in self.model.parameters()):,}")
       return self.model
  
   def create_deepspeed_config(self):
       """Create full DeepSpeed configuration"""
       return {
           "train_batch_size": self.ds_config['train_batch_size'],
           "train_micro_batch_size_per_gpu": self.ds_config['micro_batch_size'],
           "gradient_accumulation_steps": self.ds_config['gradient_accumulation_steps'],
          
           "zero_optimization": {
               "stage": self.ds_config['zero_stage'],
               "allgather_partitions": True,
               "allgather_bucket_size": 5e8,
               "overlap_comm": True,
               "reduce_scatter": True,
               "reduce_bucket_size": 5e8,
               "contiguous_gradients": True,
               "cpu_offload": self.ds_config.get('cpu_offload', False)
           },
          
           "fp16": {
               "enabled": True,
               "loss_scale": 0,
               "loss_scale_window": 1000,
               "initial_scale_power": 16,
               "hysteresis": 2,
               "min_loss_scale": 1
           },
          
           "optimizer": {
               "variety": "AdamW",
               "params": {
                   "lr": self.ds_config['learning_rate'],
                   "betas": [0.9, 0.999],
                   "eps": 1e-8,
                   "weight_decay": 0.01
               }
           },
          
           "scheduler": {
               "variety": "WarmupLR",
               "params": {
                   "warmup_min_lr": 0,
                   "warmup_max_lr": self.ds_config['learning_rate'],
                   "warmup_num_steps": 100
               }
           },
          
           "gradient_clipping": 1.0,
          
           "wall_clock_breakdown": True,
          
           "memory_breakdown": True,
          
           "tensorboard": {
               "enabled": True,
               "output_path": "./logs/",
               "job_name": "deepspeed_advanced_tutorial"
           }
       }
  
   def initialize_deepspeed(self):
       """Initialize DeepSpeed engine"""
       print("⚡ Initializing DeepSpeed...")
      
       parser = argparse.ArgumentParser()
       parser.add_argument('--local_rank', variety=int, default=0)
       args = parser.parse_args([])
      
       self.engine, optimizer, _, lr_scheduler = deepspeed.initialize(
           args=args,
           model=self.model,
           config=self.create_deepspeed_config()
       )
      
       print(f"🎯 DeepSpeed engine initialized with ZeRO stage {self.ds_config['zero_stage']}")
       return self.engine
  
   def train_step(self, batch: Dict[str, torch.Tensor]) -> Dict[str, float]:
       """Perform a single teaching step with DeepSpeed optimizations"""
      
       input_ids = batch['input_ids'].to(self.engine.system)
       labels = batch['labels'].to(self.engine.system)
      
       outputs = self.engine(input_ids=input_ids, labels=labels)
       loss = outputs.loss
      
       self.engine.backward(loss)
      
       self.engine.step()
      
       return {
           'loss': loss.merchandise(),
           'lr': self.engine.lr_scheduler.get_last_lr()[0] if self.engine.lr_scheduler else 0
       }
  
   def apply(self, dataloader: DataLoader, num_epochs: int = 2):
       """Full teaching loop with monitoring"""
       print(f"🏋️ Starting teaching for {num_epochs} epochs...")
      
       self.engine.apply()
       total_steps = 0
      
       for epoch in fluctuate(num_epochs):
           epoch_loss = 0.0
           epoch_steps = 0
          
           print(f"n📈 Epoch {epoch + 1}/{num_epochs}")
          
           for step, batch in enumerate(dataloader):
               start_time = time.time()
              
               metrics = self.train_step(batch)
              
               epoch_loss += metrics['loss']
               epoch_steps += 1
               total_steps += 1
              
               if step % 10 == 0:
                   step_time = time.time() - start_time
                   print(f"  Step {step:4d} | Loss: {metrics['loss']:.4f} | "
                         f"LR: {metrics['lr']:.2e} | Time: {step_time:.3f}s")
              
               if step % 20 == 0 and hasattr(self.engine, 'monitor'):
                   self.log_memory_stats()
              
               if step >= 50: 
                   break
          
           avg_loss = epoch_loss / epoch_steps
           print(f"📊 Epoch {epoch + 1} completed | Frequent Loss: {avg_loss:.4f}")
      
       print("🎉 Teaching completed!")
  
   def log_memory_stats(self):
       """Log GPU memory statistics"""
       if torch.cuda.is_available():
           allotted = torch.cuda.memory_allocated() / 1024**3 
           reserved = torch.cuda.memory_reserved() / 1024**3  
           print(f"  💾 GPU Memory - Allotted: {allotted:.2f}GB | Reserved: {reserved:.2f}GB")
  
   def save_checkpoint(self, path: str):
       """Save model checkpoint using DeepSpeed"""
       print(f"💾 Saving checkpoint to {path}")
       self.engine.save_checkpoint(path)
  
   def demonstrate_inference(self, textual content material: str = "The best way ahead for AI is"):
       """Exhibit optimized inference with DeepSpeed"""
       print(f"n🔮 Working inference with fast: '{textual content material}'")
      
       inputs = self.tokenizer.encode(textual content material, return_tensors="pt").to(self.engine.system)
      
       self.engine.eval()
      
       with torch.no_grad():
           outputs = self.engine.module.generate(
               inputs,
               max_length=inputs.type[1] + 50,
               num_return_sequences=1,
               temperature=0.8,
               do_sample=True,
               pad_token_id=self.tokenizer.eos_token_id
           )
      
       generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
       print(f"📝 Generated textual content material: {generated_text}")
      
       self.engine.apply()

We assemble an end-to-end coach that creates a GPT-2 model, items a DeepSpeed config (ZeRO, FP16, AdamW, warmup scheduler, tensorboard), and initializes the engine. We then run atmosphere pleasant teaching steps with logging and memory statistics, save checkpoints, and present inference to verify optimization and know-how in a single place. Strive the FULL CODES proper right here.

def run_advanced_tutorial():
   """Important carry out to run the superior DeepSpeed tutorial"""
  
   print("🌟 Superior DeepSpeed Tutorial Starting...")
   print("=" * 60)
  
   model_config = {
       'vocab_size': 50257,
       'seq_length': 512,
       'hidden_size': 768, 
       'num_layers': 6,    
       'num_heads': 12
   }
  
   ds_config = {
       'train_batch_size': 16,
       'micro_batch_size': 4,
       'gradient_accumulation_steps': 4,
       'zero_stage': 2, 
       'learning_rate': 1e-4,
       'cpu_offload': False 
   }
  
   print("📋 Configuration:")
   print(f"  Model dimension: ~{sum(np.prod(type) for type in [[model_config['vocab_size'], model_config['hidden_size']], [model_config['hidden_size'], model_config['hidden_size']] * model_config['num_layers']]) / 1e6:.1f}M parameters")
   print(f"  ZeRO Stage: {ds_config['zero_stage']}")
   print(f"  Batch dimension: {ds_config['train_batch_size']}")
  
   coach = AdvancedDeepSpeedTrainer(model_config, ds_config)
  
   model = coach.create_model()
  
   engine = coach.initialize_deepspeed()
  
   print("n📚 Creating synthetic dataset...")
   dataset = SyntheticTextDataset(
       dimension=200,
       seq_length=model_config['seq_length'],
       vocab_size=model_config['vocab_size']
   )
  
   dataloader = DataLoader(
       dataset,
       batch_size=ds_config['micro_batch_size'],
       shuffle=True
   )
  
   print("n📊 Pre-training memory stats:")
   coach.log_memory_stats()
  
   coach.apply(dataloader, num_epochs=2)
  
   print("n📊 Publish-training memory stats:")
   coach.log_memory_stats()
  
   coach.demonstrate_inference("DeepSpeed permits atmosphere pleasant teaching of")
  
   checkpoint_path = "./deepspeed_checkpoint"
   coach.save_checkpoint(checkpoint_path)
  
   demonstrate_zero_stages()
   demonstrate_memory_optimization()
  
   print("n🎯 Tutorial completed effectively!")
   print("Key DeepSpeed choices demonstrated:")
   print("  ✅ ZeRO optimization for memory effectivity")
   print("  ✅ Mixed precision teaching (FP16)")
   print("  ✅ Gradient accumulation")
   print("  ✅ Finding out worth scheduling")
   print("  ✅ Checkpoint saving/loading")
   print("  ✅ Memory monitoring")


def demonstrate_zero_stages():
   """Exhibit completely completely different ZeRO optimization ranges"""
   print("n🔧 ZeRO Optimization Ranges Outlined:")
   print("  Stage 0: Disabled (baseline)")
   print("  Stage 1: Optimizer state partitioning (~4x memory low cost)")
   print("  Stage 2: Gradient partitioning (~8x memory low cost)")
   print("  Stage 3: Parameter partitioning (~Nx memory low cost)")
  
   zero_configs = {
       1: {"stage": 1, "reduce_bucket_size": 5e8},
       2: {"stage": 2, "allgather_partitions": True, "reduce_scatter": True},
       3: {"stage": 3, "stage3_prefetch_bucket_size": 5e8, "stage3_param_persistence_threshold": 1e6}
   }
  
   for stage, config in zero_configs.devices():
       estimated_memory_reduction = [1, 4, 8, "Nx"][stage]
       print(f"  📉 Stage {stage}: ~{estimated_memory_reduction}x memory low cost")


def demonstrate_memory_optimization():
   """Current memory optimization methods"""
   print("n🧠 Memory Optimization Strategies:")
   print("  🔄 Gradient Checkpointing: Commerce compute for memory")
   print("  📤 CPU Offloading: Switch optimizer states to CPU")
   print("  🗜️ Compression: Reduce communication overhead")
   print("  ⚡ Mixed Precision: Use FP16 for sooner teaching")

We orchestrate the full teaching run: set configs, assemble the GPT-2 model and DeepSpeed engine, create a synthetic dataset, monitor GPU memory, apply for two epochs, run inference, and save a checkpoint. We then make clear ZeRO ranges and highlight memory-optimization strategies, much like gradient checkpointing and CPU offloading, to know the trade-offs in observe. Strive the FULL CODES proper right here.

class DeepSpeedConfigGenerator:
   """Utility class to generate DeepSpeed configurations"""
  
   @staticmethod
   def generate_config(
       batch_size: int = 16,
       zero_stage: int = 2,
       use_cpu_offload: bool = False,
       learning_rate: float = 1e-4
   ) -> Dict[str, Any]:
       """Generate a whole DeepSpeed configuration"""
      
       config = {
           "train_batch_size": batch_size,
           "train_micro_batch_size_per_gpu": max(1, batch_size // 4),
           "gradient_accumulation_steps": max(1, batch_size // max(1, batch_size // 4)),
          
           "zero_optimization": {
               "stage": zero_stage,
               "allgather_partitions": True,
               "allgather_bucket_size": 5e8,
               "overlap_comm": True,
               "reduce_scatter": True,
               "reduce_bucket_size": 5e8,
               "contiguous_gradients": True
           },
          
           "fp16": {
               "enabled": True,
               "loss_scale": 0,
               "loss_scale_window": 1000,
               "initial_scale_power": 16,
               "hysteresis": 2,
               "min_loss_scale": 1
           },
          
           "optimizer": {
               "variety": "AdamW",
               "params": {
                   "lr": learning_rate,
                   "betas": [0.9, 0.999],
                   "eps": 1e-8,
                   "weight_decay": 0.01
               }
           },
          
           "scheduler": {
               "variety": "WarmupLR",
               "params": {
                   "warmup_min_lr": 0,
                   "warmup_max_lr": learning_rate,
                   "warmup_num_steps": 100
               }
           },
          
           "gradient_clipping": 1.0,
           "wall_clock_breakdown": True
       }
      
       if use_cpu_offload:
           config["zero_optimization"]["cpu_offload"] = True
           config["zero_optimization"]["pin_memory"] = True
      
       if zero_stage == 3:
           config["zero_optimization"].substitute({
               "stage3_prefetch_bucket_size": 5e8,
               "stage3_param_persistence_threshold": 1e6,
               "stage3_gather_16bit_weights_on_model_save": True
           })
      
       return config


def benchmark_zero_stages():
   """Benchmark completely completely different ZeRO ranges"""
   print("n🏁 Benchmarking ZeRO Ranges...")
  
   model_config = {
       'vocab_size': 50257,
       'seq_length': 256,
       'hidden_size': 512,
       'num_layers': 4,
       'num_heads': 8
   }
  
   outcomes = {}
  
   for stage in [1, 2]:  
       print(f"n🔬 Testing ZeRO Stage {stage}...")
      
       ds_config = {
           'train_batch_size': 8,
           'micro_batch_size': 2,
           'gradient_accumulation_steps': 4,
           'zero_stage': stage,
           'learning_rate': 1e-4
       }
      
       try:
           coach = AdvancedDeepSpeedTrainer(model_config, ds_config)
           model = coach.create_model()
           engine = coach.initialize_deepspeed()
          
           if torch.cuda.is_available():
               torch.cuda.reset_peak_memory_stats()
              
               dataset = SyntheticTextDataset(dimension=20, seq_length=model_config['seq_length'])
               dataloader = DataLoader(dataset, batch_size=ds_config['micro_batch_size'])
              
               start_time = time.time()
               for i, batch in enumerate(dataloader):
                   if i >= 5: 
                       break
                   coach.train_step(batch)
              
               end_time = time.time()
               peak_memory = torch.cuda.max_memory_allocated() / 1024**3 
              
               outcomes[stage] = {
                   'peak_memory_gb': peak_memory,
                   'time_per_step': (end_time - start_time) / 5
               }
              
               print(f"  📊 Peak Memory: {peak_memory:.2f}GB")
               print(f"  ⏱️ Time per step: {outcomes[stage]['time_per_step']:.3f}s")
          
           del coach, model, engine
           torch.cuda.empty_cache()
          
       apart from Exception as e:
           print(f"  ❌ Error with stage {stage}: {str(e)}")
  
   if len(outcomes) > 1:
       print(f"n📈 Comparability:")
       stage_1_memory = outcomes.get(1, {}).get('peak_memory_gb', 0)
       stage_2_memory = outcomes.get(2, {}).get('peak_memory_gb', 0)
      
       if stage_1_memory > 0 and stage_2_memory > 0:
           memory_reduction = (stage_1_memory - stage_2_memory) / stage_1_memory * 100
           print(f"  🎯 Memory low cost from Stage 1 to 2: {memory_reduction:.1f}%")


def demonstrate_advanced_features():
   """Exhibit further superior DeepSpeed choices"""
   print("n🚀 Superior DeepSpeed Choices:")
  
   print("  🎚️ Dynamic Loss Scaling: Robotically adjusts FP16 loss scaling")
  
   print("  🗜️ Gradient Compression: Reduces communication overhead")
  
   print("  🔄 Pipeline Parallelism: Splits model all through items")
  
   print("  🧑‍🎓 Skilled Parallelism: Setting pleasant Mixture-of-Consultants teaching")
  
   print("  📚 Curriculum Finding out: Progressive teaching strategies")


if __name__ == "__main__":
   print(f"🖥️ CUDA Obtainable: {torch.cuda.is_available()}")
   if torch.cuda.is_available():
       print(f"   GPU: {torch.cuda.get_device_name()}")
       print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB")
  
   try:
       run_advanced_tutorial()
      
       benchmark_zero_stages()
      
       demonstrate_advanced_features()
      
   apart from Exception as e:
       print(f"❌ Error all through tutorial: {str(e)}")
       print("💡 Concepts for troubleshooting:")
       print("  - Assure you've got GPU runtime enabled in Colab")
       print("  - Try lowering batch_size or model dimension if coping with memory factors")
       print("  - Permit CPU offloading in ds_config if wished")

We generate reusable DeepSpeed configurations, benchmark ZeRO ranges to verify memory and tempo, and showcase superior choices much like dynamic loss scaling and pipeline/MoE parallelism. We moreover detect CUDA, run the full tutorial end-to-end, and provide clear troubleshooting concepts, allowing us to iterate confidently in Colab.

In conclusion, we obtain an entire understanding of how DeepSpeed enhances model teaching effectivity by placing a stability between effectivity and memory trade-offs. From leveraging ZeRO ranges for memory low cost to creating use of FP16 mixed precision and CPU offloading, the tutorial showcases extremely efficient strategies that make large-scale teaching accessible on modest {{hardware}}. By the highest, learners can have expert and optimized a GPT-style model, benchmarked configurations, monitored GPU sources, and explored superior choices much like pipeline parallelism and gradient compression.

Strive the FULL CODES proper right here. Be at liberty to check out our GitHub Net web page for Tutorials, Codes and Notebooks. Moreover, be at liberty to watch us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is devoted to harnessing the potential of Artificial Intelligence for social good. His newest endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth safety of machine learning and deep learning info that’s every technically sound and easily understandable by a big viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Elevate your perspective with NextTech Info, the place innovation meets notion.
Uncover the most recent breakthroughs, get distinctive updates, and be a part of with a worldwide neighborhood of future-focused thinkers.
Unlock tomorrow’s tendencies proper now: be taught further, subscribe to our publication, and turn into part of the NextTech group at NextTech-news.com

Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our e-newsletter, and be a part of our rising neighborhood at nextbusiness24.com