I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration.
In my knowledge, gradient_accumulation_steps usually means the size of mini-batchs in the actual training. For example, if this value is 5, then we need to call accelerator.backward(loss) function 5 times before calling optimizer.step(). But I can achieve this in my training script, why do I need to configure this in the json file? I tried setting gradient_accumulation_steps to a value more than 1, and as a result,the cross entropy loss was always 11.76 for some reason.
Additionally, train_micro_batch_size_per_gpu param is also very confusing. I have made some experiments changing this value but found that it does nothing to VRAM as I hoped. Deepspeed's official document is not helpful at all. What's the meaning of these two configurations and how to configure these two values in practice?
Here is my configuration:
{
"fp16": {
"enabled": true,
"auto_cast": true,
"loss_scale": 1.0
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 2,
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
And here is my training script:
dataloader_config = DataLoaderConfiguration(dispatch_batches=True, split_batches=False)
accelerator = accelerate.Accelerator(project_config=proj_config, log_with="tensorboard", dataloader_config=dataloader_config)
...
opti = RAdam(model.parameters(), lr=1e-3)
model, opti, train_loader= accelerator.prepare(model, opti, train_loader)
mini_batch_idx = 0
for idx, data in enumerate(train_loader):
loss = model(data)
accelerator.backward(loss)
mini_batch_idx += 1
if mini_batch_idx > 31:
opti.step()
mini_batch_idx = 0