What does "gradient_accumulation_steps" do in deepspeed?

Ask Question

Asked 1 year, 1 month ago

Modified 1 year, 1 month ago

Viewed 492 times

I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration.

In my knowledge, gradient_accumulation_steps usually means the size of mini-batchs in the actual training. For example, if this value is 5, then we need to call accelerator.backward(loss) function 5 times before calling optimizer.step(). But I can achieve this in my training script, why do I need to configure this in the json file? I tried setting gradient_accumulation_steps to a value more than 1, and as a result,the cross entropy loss was always 11.76 for some reason.

Additionally, train_micro_batch_size_per_gpu param is also very confusing. I have made some experiments changing this value but found that it does nothing to VRAM as I hoped. Deepspeed's official document is not helpful at all. What's the meaning of these two configurations and how to configure these two values in practice?

Here is my configuration:

{
    "fp16": {
        "enabled": true,
        "auto_cast": true,
        "loss_scale": 1.0
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false,
    "zero_allow_untested_optimizer": true
}

And here is my training script:

dataloader_config = DataLoaderConfiguration(dispatch_batches=True, split_batches=False)
accelerator = accelerate.Accelerator(project_config=proj_config, log_with="tensorboard", dataloader_config=dataloader_config)
...

opti = RAdam(model.parameters(), lr=1e-3)
model, opti, train_loader= accelerator.prepare(model, opti, train_loader)
mini_batch_idx = 0

for idx, data in enumerate(train_loader):
    loss = model(data)
    accelerator.backward(loss)
    mini_batch_idx += 1
    if mini_batch_idx > 31:
        opti.step()
        mini_batch_idx = 0

asked Oct 16, 2024 at 8:38

Yaoming Xuan

2162 silver badges7 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

What does "gradient_accumulation_steps" do in deepspeed?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest