1

I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration.

In my knowledge, gradient_accumulation_steps usually means the size of mini-batchs in the actual training. For example, if this value is 5, then we need to call accelerator.backward(loss) function 5 times before calling optimizer.step(). But I can achieve this in my training script, why do I need to configure this in the json file? I tried setting gradient_accumulation_steps to a value more than 1, and as a result,the cross entropy loss was always 11.76 for some reason.

Additionally, train_micro_batch_size_per_gpu param is also very confusing. I have made some experiments changing this value but found that it does nothing to VRAM as I hoped. Deepspeed's official document is not helpful at all. What's the meaning of these two configurations and how to configure these two values in practice?

Here is my configuration:

{
    "fp16": {
        "enabled": true,
        "auto_cast": true,
        "loss_scale": 1.0
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false,
    "zero_allow_untested_optimizer": true
}

And here is my training script:

dataloader_config = DataLoaderConfiguration(dispatch_batches=True, split_batches=False)
accelerator = accelerate.Accelerator(project_config=proj_config, log_with="tensorboard", dataloader_config=dataloader_config)
...

opti = RAdam(model.parameters(), lr=1e-3)
model, opti, train_loader= accelerator.prepare(model, opti, train_loader)
mini_batch_idx = 0

for idx, data in enumerate(train_loader):
    loss = model(data)
    accelerator.backward(loss)
    mini_batch_idx += 1
    if mini_batch_idx > 31:
        opti.step()
        mini_batch_idx = 0

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.