Recently Active 'deepspeed' Questions

1 vote

1 answer

715 views

Deepspeed : AttributeError: 'DummyOptim' object has no attribute 'step'

I want to use deepspeed for training LLMs along with Huggingface Trainer. But when I use deepspeed along with trainer I get error "AttributeError: 'DummyOptim' object has no attribute 'step'&...

Refinath

695

modified Oct 9 at 19:42

0 votes

3 answers

5k views

Using uv to install packages in the bitnami/deepspeed:0.14.0 Docker image fails with 'uv: command not found'

If I use the following Dockerfile: FROM python:3.11-bullseye ENV APP_HOME /app WORKDIR $APP_HOME COPY requirements.txt /app RUN pip install uv && uv pip install --system --no-cache -r ...

Gwang-Jin Kim

11.1k

modified Mar 30 at 17:44

1 vote

1 answer

70 views

model.eval() return a NoneType object when using deepspeed

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet: def evaluate(self, ...

Karl

5,966

answered Mar 15 at 21:29

0 votes

0 answers

158 views

DeepSpeed model initialization memory overhead fix?

I'm trying to train a small llm on my local computer which has a single gpu with 16gb vram. I kept encoutering vram oom, so I was looking for a way to reduce vram use. DeepSpeed seemed interesting, so ...

Peppermint Addict

1

modified Feb 19 at 3:53

0 votes

0 answers

93 views

How can I log accuracy with wandb and deepspeed

I want to log my model's accuracy after each epoch and its final accuracy at the end but I cannot find a simple way of doing this. I am following this tutorial: https://www.youtube.com/watch?v=...

Daraan

5,166

modified Jan 28 at 0:12

1 vote

0 answers

492 views

What does "gradient_accumulation_steps" do in deepspeed?

I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration. In my knowledge, ...

Yaoming Xuan

216

asked Oct 16, 2024 at 8:38

3 votes

1 answer

6k views

DeepSpeed multi-GPU finetuning does not work

Currently, I am trying to fine tune the Korean Llama model(13B) on a private dataset through DeepSpeed and Flash Attention 2, TRL SFTTrainer. I am using 2 * A100 80G GPUs for the fine-tuning, however, ...

cjackal

119

answered Oct 10, 2024 at 14:29

0 votes

0 answers

2k views

pip install deepspeed ERROR: error: subprocess-exited-with-error/error: metadata-generation-failed

When I try to install the deepspeed library in the conda virtual environment, the following error occurs Collecting deepspeed Using cached deepspeed-0.12.6.tar.gz (1.2 MB) Preparing metadata (...

ANYUAN XU

1

modified Jul 19, 2024 at 7:41

2 votes

0 answers

863 views

How do I free up GPU memory when using Accelerate with Deepspeed

I am using accelerate launch with deepspeed zero stage 2 for multi gpu training and inference and am struggling to free up GPU memory. Basically, my programme has three parts Load first model... -...

Llmw123

21

asked Jun 21, 2024 at 14:21

2 votes

0 answers

342 views

Deepspeed JSON config file being ignored

this is my first time writing on this platform, I apologise if there is any issue with the way the question is being asked. I am trying to run a python file with certain deepspeed configurations such ...

Ulrich Eckhardt

17.7k

modified May 11, 2024 at 11:39

2 votes

1 answer

640 views

How can I use decaying learning rate in DeepSpeed?

I am training dolly2.0. When I do so, I get the following output from the terminal: If I use DeepSpeed to perform this training, I note that the learning rate didn't improve: Why didn't the learning ...

DoneForAiur

1,452

modified Apr 22, 2024 at 8:22

0 votes

1 answer

136 views

DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3

I want to come up with a very simple Lightning example using DeepSpeed, but it refused to parallelize layers even when setting to stage 3. I'm just blowing up the model by adding FC layers in the hope ...

Indra

26

answered Apr 11, 2024 at 12:41

1 vote

1 answer

530 views

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to ...

Benjamin Buch

6,415

modified Mar 10, 2024 at 17:44

0 votes

1 answer

3k views

LLava: deepspeed can not detect editable installed python package/module

I have installed a package (llava model from github) as python install -e . In my conda env, I have load llava as: >>python >>import llava I put import in a .py file, when I used "...

Vincent

11

answered Mar 6, 2024 at 1:26

1 vote

1 answer

3k views

Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)

Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well? Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for ...

George

171

answered Jan 3, 2024 at 10:23

1 vote

0 answers

849 views

Deepspeed not offloading to CPU

Deepspeed fails to offload operations to the CPU, like I thought it should do when it runs out of GPU memory. I guess I have some setting wrong. When the batch size is increased it gives an error like ...

paragon00

11

modified Nov 1, 2023 at 2:52

2 votes

0 answers

226 views

Training time for dolly-v2-12b on a custom dataset with an A10 gpu

Hi I am trying to train the dolly-v2-12b or any of the dolly model using a custom dataset using A10 gpu. I am coding in pycharm, windows os. The task is similar to a Q&A. I am trying to use this ...

Sneha T S

33

modified Aug 30, 2023 at 4:11

1 vote

1 answer

454 views

Does Vertex AI Training for Distributed Training Across Multi-Nodes Work With HuggingFace Trainer + Deepspeed?

I am wondering if Vertex AI Training can be used for distributed training using Huggingface Trainer and deepspeed? All I have seen are examples with the native torch distribution strategy. It would be ...

Joevanie

615

answered Aug 8, 2023 at 18:41

1 vote

0 answers

530 views

Deepspeed tensor parallel gets problem in tensor alignment when using tokenizer

I tried to use deepspeed to conduct tensor parallel on starcoder as I had multiple small GPUs and each of which cannot singly hold the whole model. from transformers import AutoModelForCausalLM, ...

ddaa

49

asked Jul 30, 2023 at 6:32

1 vote

0 answers

118 views

Why does the DeepSpeed `estimate_zero2_model_states_mem_needs_…` API report the same memory per CPU with different `offload_optimizer` option values?

The example provided in Memory Requirements - DeepSpeed 0.10.1 documentation is as follows: python -c 'from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold; \...

Shawn Yuxuan Tong

11

asked Jul 29, 2023 at 14:10

1 vote

0 answers

103 views

how to set max gpu memory use for each device when using deepspeed for distributed training?

I am newer to deepspeed, and have some experience in deeplearning. I want to know how to set the max gpu memory to use for each device when using deepspeed?. I have done nothong. I have no thoughts my ...

hjc

15

asked Jul 24, 2023 at 7:39

1 vote

0 answers

69 views

DeepSpeed: no operator matches operands error

When I try to use DeepSpeed example to finetune a OPT 1.3b model on my local machine, I have an unexpected error, which related to following code snippet: template <typename T> __global__ ...

coderLMN

3,086

asked Jun 15, 2023 at 6:46

Collectives™ on Stack Overflow

Deepspeed : AttributeError: 'DummyOptim' object has no attribute 'step'

Using uv to install packages in the bitnami/deepspeed:0.14.0 Docker image fails with 'uv: command not found'

model.eval() return a NoneType object when using deepspeed

DeepSpeed model initialization memory overhead fix?

How can I log accuracy with wandb and deepspeed

What does "gradient_accumulation_steps" do in deepspeed?

DeepSpeed multi-GPU finetuning does not work

pip install deepspeed ERROR: error: subprocess-exited-with-error/error: metadata-generation-failed

How do I free up GPU memory when using Accelerate with Deepspeed

Deepspeed JSON config file being ignored

How can I use decaying learning rate in DeepSpeed?

DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

LLava: deepspeed can not detect editable installed python package/module

Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)

Deepspeed not offloading to CPU

Training time for dolly-v2-12b on a custom dataset with an A10 gpu

Does Vertex AI Training for Distributed Training Across Multi-Nodes Work With HuggingFace Trainer + Deepspeed?

Deepspeed tensor parallel gets problem in tensor alignment when using tokenizer

Why does the DeepSpeed `estimate_zero2_model_states_mem_needs_…` API report the same memory per CPU with different `offload_optimizer` option values?

how to set max gpu memory use for each device when using deepspeed for distributed training?

DeepSpeed: no operator matches operands error

Hot Network Questions