1

Deepspeed fails to offload operations to the CPU, like I thought it should do when it runs out of GPU memory. I guess I have some setting wrong. When the batch size is increased it gives an error like

(https://i.sstatic.net/StcTz.png) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.04 GiB (GPU 1; 79.15 GiB total capacity; 68.07 GiB already allocated; 5.90 GiB free; 72.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(doesnt happen for smaller batch sizes).

Using Adam optimizer, and an AMD EPYC 7V13 64-Core Processor (an Azure VM).

The DeepSpeed config is -

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Training is done by HuggingFace Trainer, and the DeepSpeed config is used by adding the config dict to TrainingArguments

with open("./Multi_Modal_Model/zero_config/stage_2_config.json") as f:
    z_optimiser = json.load(f)
        
training_args = TrainingArguments(
    ...
    deepspeed=z_optimiser,
    ...
)

Using PyTorch 1.13, trying to train a HuggingFace CLIP model.

Anyone know what I'm doing wrong?

1
  • were you able to findout whats the problem? I am seeing the same issue Commented Jun 26, 2024 at 16:29

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.