Deepspeed not offloading to CPU

Ask Question

Asked 2 years, 1 month ago

Modified 2 years, 1 month ago

Viewed 849 times

Part of Microsoft Azure Collective

Deepspeed fails to offload operations to the CPU, like I thought it should do when it runs out of GPU memory. I guess I have some setting wrong. When the batch size is increased it gives an error like

(https://i.sstatic.net/StcTz.png) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.04 GiB (GPU 1; 79.15 GiB total capacity; 68.07 GiB already allocated; 5.90 GiB free; 72.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(doesnt happen for smaller batch sizes).

Using Adam optimizer, and an AMD EPYC 7V13 64-Core Processor (an Azure VM).

The DeepSpeed config is -

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Training is done by HuggingFace Trainer, and the DeepSpeed config is used by adding the config dict to TrainingArguments

with open("./Multi_Modal_Model/zero_config/stage_2_config.json") as f:
    z_optimiser = json.load(f)
        
training_args = TrainingArguments(
    ...
    deepspeed=z_optimiser,
    ...
)

Using PyTorch 1.13, trying to train a HuggingFace CLIP model.

Anyone know what I'm doing wrong?

edited Nov 1, 2023 at 2:52

asked Nov 1, 2023 at 2:42

paragon00

112 bronze badges

were you able to findout whats the problem? I am seeing the same issue

Khayam Gondal
– Khayam Gondal

2024-06-26 16:29:56 +00:00
Commented Jun 26, 2024 at 16:29

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Deepspeed not offloading to CPU

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest