Newest 'deepspeed' Questions

1 vote

1 answer

70 views

model.eval() return a NoneType object when using deepspeed

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet: def evaluate(self, ...

external

11

asked Mar 15 at 17:28

0 votes

0 answers

158 views

DeepSpeed model initialization memory overhead fix?

I'm trying to train a small llm on my local computer which has a single gpu with 16gb vram. I kept encoutering vram oom, so I was looking for a way to reduce vram use. DeepSpeed seemed interesting, so ...

Peppermint Addict

1

asked Feb 19 at 3:51

0 votes

0 answers

93 views

How can I log accuracy with wandb and deepspeed

I want to log my model's accuracy after each epoch and its final accuracy at the end but I cannot find a simple way of doing this. I am following this tutorial: https://www.youtube.com/watch?v=...

user22631788

1

asked Jan 28 at 0:10

1 vote

0 answers

492 views

What does "gradient_accumulation_steps" do in deepspeed?

I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration. In my knowledge, ...

Yaoming Xuan

216

asked Oct 16, 2024 at 8:38

1 vote

1 answer

715 views

Deepspeed : AttributeError: 'DummyOptim' object has no attribute 'step'

I want to use deepspeed for training LLMs along with Huggingface Trainer. But when I use deepspeed along with trainer I get error "AttributeError: 'DummyOptim' object has no attribute 'step'&...

Refinath

695

asked Jul 2, 2024 at 14:53

2 votes

0 answers

863 views

How do I free up GPU memory when using Accelerate with Deepspeed

I am using accelerate launch with deepspeed zero stage 2 for multi gpu training and inference and am struggling to free up GPU memory. Basically, my programme has three parts Load first model... -...

Llmw123

21

asked Jun 21, 2024 at 14:21

2 votes

0 answers

342 views

Deepspeed JSON config file being ignored

this is my first time writing on this platform, I apologise if there is any issue with the way the question is being asked. I am trying to run a python file with certain deepspeed configurations such ...

Eliza

21

asked May 11, 2024 at 5:16

0 votes

1 answer

136 views

DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3

I want to come up with a very simple Lightning example using DeepSpeed, but it refused to parallelize layers even when setting to stage 3. I'm just blowing up the model by adding FC layers in the hope ...

Romeo Kienzler

3,563

asked Apr 9, 2024 at 9:18

0 votes

3 answers

5k views

Using uv to install packages in the bitnami/deepspeed:0.14.0 Docker image fails with 'uv: command not found'

If I use the following Dockerfile: FROM python:3.11-bullseye ENV APP_HOME /app WORKDIR $APP_HOME COPY requirements.txt /app RUN pip install uv && uv pip install --system --no-cache -r ...

BioGeek

23k

asked Mar 13, 2024 at 11:25

1 vote

1 answer

530 views

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to ...

상현박

13

asked Feb 20, 2024 at 11:28

0 votes

0 answers

2k views

pip install deepspeed ERROR: error: subprocess-exited-with-error/error: metadata-generation-failed

When I try to install the deepspeed library in the conda virtual environment, the following error occurs Collecting deepspeed Using cached deepspeed-0.12.6.tar.gz (1.2 MB) Preparing metadata (...

TTTyz

1

asked Dec 27, 2023 at 5:12

0 votes

1 answer

3k views

LLava: deepspeed can not detect editable installed python package/module

I have installed a package (llava model from github) as python install -e . In my conda env, I have load llava as: >>python >>import llava I put import in a .py file, when I used "...

Mohbat Tharani

570

asked Nov 22, 2023 at 18:38

3 votes

1 answer

6k views

DeepSpeed multi-GPU finetuning does not work

Currently, I am trying to fine tune the Korean Llama model(13B) on a private dataset through DeepSpeed and Flash Attention 2, TRL SFTTrainer. I am using 2 * A100 80G GPUs for the fine-tuning, however, ...

kopilot100

31

asked Nov 8, 2023 at 5:08

1 vote

0 answers

849 views

Deepspeed not offloading to CPU

Deepspeed fails to offload operations to the CPU, like I thought it should do when it runs out of GPU memory. I guess I have some setting wrong. When the batch size is increased it gives an error like ...

paragon00

11

asked Nov 1, 2023 at 2:42

1 vote

1 answer

3k views

Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)

Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well? Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for ...

NeuralAI

113

asked Aug 13, 2023 at 18:33

1 vote

1 answer

454 views

Does Vertex AI Training for Distributed Training Across Multi-Nodes Work With HuggingFace Trainer + Deepspeed?

I am wondering if Vertex AI Training can be used for distributed training using Huggingface Trainer and deepspeed? All I have seen are examples with the native torch distribution strategy. It would be ...

esdy

11

asked Aug 2, 2023 at 13:28

1 vote

0 answers

530 views

Deepspeed tensor parallel gets problem in tensor alignment when using tokenizer

I tried to use deepspeed to conduct tensor parallel on starcoder as I had multiple small GPUs and each of which cannot singly hold the whole model. from transformers import AutoModelForCausalLM, ...

ddaa

49

asked Jul 30, 2023 at 6:32

1 vote

0 answers

118 views

Why does the DeepSpeed `estimate_zero2_model_states_mem_needs_…` API report the same memory per CPU with different `offload_optimizer` option values?

The example provided in Memory Requirements - DeepSpeed 0.10.1 documentation is as follows: python -c 'from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold; \...

Shawn Yuxuan Tong

11

asked Jul 29, 2023 at 14:10

2 votes

0 answers

226 views

Training time for dolly-v2-12b on a custom dataset with an A10 gpu

Hi I am trying to train the dolly-v2-12b or any of the dolly model using a custom dataset using A10 gpu. I am coding in pycharm, windows os. The task is similar to a Q&A. I am trying to use this ...

Sneha T S

33

asked Jul 28, 2023 at 4:51

1 vote

0 answers

103 views

how to set max gpu memory use for each device when using deepspeed for distributed training?

I am newer to deepspeed, and have some experience in deeplearning. I want to know how to set the max gpu memory to use for each device when using deepspeed?. I have done nothong. I have no thoughts my ...

hjc

15

asked Jul 24, 2023 at 7:39

2 votes

1 answer

640 views

How can I use decaying learning rate in DeepSpeed?

I am training dolly2.0. When I do so, I get the following output from the terminal: If I use DeepSpeed to perform this training, I note that the learning rate didn't improve: Why didn't the learning ...

AndyLinOuO

51

asked Jul 18, 2023 at 9:12

1 vote

0 answers

69 views

DeepSpeed: no operator matches operands error

When I try to use DeepSpeed example to finetune a OPT 1.3b model on my local machine, I have an unexpected error, which related to following code snippet: template <typename T> __global__ ...

coderLMN

3,086

asked Jun 15, 2023 at 6:46

Collectives™ on Stack Overflow

model.eval() return a NoneType object when using deepspeed

DeepSpeed model initialization memory overhead fix?

How can I log accuracy with wandb and deepspeed

What does "gradient_accumulation_steps" do in deepspeed?

Deepspeed : AttributeError: 'DummyOptim' object has no attribute 'step'

How do I free up GPU memory when using Accelerate with Deepspeed

Deepspeed JSON config file being ignored

DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3

Using uv to install packages in the bitnami/deepspeed:0.14.0 Docker image fails with 'uv: command not found'

Problems when profiling LLM-training using "huggingface/accelerate" to Night system

pip install deepspeed ERROR: error: subprocess-exited-with-error/error: metadata-generation-failed

LLava: deepspeed can not detect editable installed python package/module

DeepSpeed multi-GPU finetuning does not work

Deepspeed not offloading to CPU

Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)

Does Vertex AI Training for Distributed Training Across Multi-Nodes Work With HuggingFace Trainer + Deepspeed?

Deepspeed tensor parallel gets problem in tensor alignment when using tokenizer

Why does the DeepSpeed `estimate_zero2_model_states_mem_needs_…` API report the same memory per CPU with different `offload_optimizer` option values?

Training time for dolly-v2-12b on a custom dataset with an A10 gpu

how to set max gpu memory use for each device when using deepspeed for distributed training?

How can I use decaying learning rate in DeepSpeed?

DeepSpeed: no operator matches operands error

Hot Network Questions