22 questions
1
vote
1
answer
70
views
model.eval() return a NoneType object when using deepspeed
When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:
def evaluate(self, ...
0
votes
0
answers
158
views
DeepSpeed model initialization memory overhead fix?
I'm trying to train a small llm on my local computer which has a single gpu with 16gb vram. I kept encoutering vram oom, so I was looking for a way to reduce vram use. DeepSpeed seemed interesting, so ...
0
votes
0
answers
93
views
How can I log accuracy with wandb and deepspeed
I want to log my model's accuracy after each epoch and its final accuracy at the end but I cannot find a simple way of doing this.
I am following this tutorial: https://www.youtube.com/watch?v=...
1
vote
0
answers
492
views
What does "gradient_accumulation_steps" do in deepspeed?
I'm training my model with accelerate package which uses deepspeed internally. But I can't understand gradient_accumulation_steps param in its configuration.
In my knowledge, ...
1
vote
1
answer
715
views
Deepspeed : AttributeError: 'DummyOptim' object has no attribute 'step'
I want to use deepspeed for training LLMs along with Huggingface Trainer. But when I use deepspeed along with trainer I get error "AttributeError: 'DummyOptim' object has no attribute 'step'&...
2
votes
0
answers
863
views
How do I free up GPU memory when using Accelerate with Deepspeed
I am using accelerate launch with deepspeed zero stage 2 for multi gpu training and inference and am struggling to free up GPU memory.
Basically, my programme has three parts
Load first model...
-...
2
votes
0
answers
342
views
Deepspeed JSON config file being ignored
this is my first time writing on this platform, I apologise if there is any issue with the way the question is being asked.
I am trying to run a python file with certain deepspeed configurations such ...
0
votes
1
answer
136
views
DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3
I want to come up with a very simple Lightning example using DeepSpeed, but it refused to parallelize layers even when setting to stage 3.
I'm just blowing up the model by adding FC layers in the hope ...
0
votes
3
answers
5k
views
Using uv to install packages in the bitnami/deepspeed:0.14.0 Docker image fails with 'uv: command not found'
If I use the following Dockerfile:
FROM python:3.11-bullseye
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY requirements.txt /app
RUN pip install uv && uv pip install --system --no-cache -r ...
1
vote
1
answer
530
views
Problems when profiling LLM-training using "huggingface/accelerate" to Night system
I am learning the Llama model in a multi-node environment using huggingface/accelerate, and if I run it as follows to profile it, the program will die due to a problem with the ssh connection to ...
0
votes
0
answers
2k
views
pip install deepspeed ERROR: error: subprocess-exited-with-error/error: metadata-generation-failed
When I try to install the deepspeed library in the conda virtual environment, the following error occurs
Collecting deepspeed
Using cached deepspeed-0.12.6.tar.gz (1.2 MB)
Preparing metadata (...
0
votes
1
answer
3k
views
LLava: deepspeed can not detect editable installed python package/module
I have installed a package (llava model from github) as python install -e .
In my conda env, I have load llava as:
>>python
>>import llava
I put import in a .py file, when I used "...
3
votes
1
answer
6k
views
DeepSpeed multi-GPU finetuning does not work
Currently, I am trying to fine tune the Korean Llama model(13B) on a private dataset through DeepSpeed and Flash Attention 2, TRL SFTTrainer. I am using 2 * A100 80G GPUs for the fine-tuning, however, ...
1
vote
0
answers
849
views
Deepspeed not offloading to CPU
Deepspeed fails to offload operations to the CPU, like I thought it should do when it runs out of GPU memory. I guess I have some setting wrong. When the batch size is increased it gives an error like
...
1
vote
1
answer
3k
views
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs (Not Training or Finetuning)
Is there any way to load a Hugging Face model in multi GPUs and use those GPUs for inferences as well?
Like, there is this model which can be loaded on a single GPU (default cuda:0) and run for ...
1
vote
1
answer
454
views
Does Vertex AI Training for Distributed Training Across Multi-Nodes Work With HuggingFace Trainer + Deepspeed?
I am wondering if Vertex AI Training can be used for distributed training using Huggingface Trainer and deepspeed? All I have seen are examples with the native torch distribution strategy.
It would be ...
1
vote
0
answers
530
views
Deepspeed tensor parallel gets problem in tensor alignment when using tokenizer
I tried to use deepspeed to conduct tensor parallel on starcoder as I had multiple small GPUs and each of which cannot singly hold the whole model.
from transformers import AutoModelForCausalLM, ...
1
vote
0
answers
118
views
Why does the DeepSpeed `estimate_zero2_model_states_mem_needs_…` API report the same memory per CPU with different `offload_optimizer` option values?
The example provided in Memory Requirements - DeepSpeed 0.10.1 documentation is as follows:
python -c 'from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold; \...
2
votes
0
answers
226
views
Training time for dolly-v2-12b on a custom dataset with an A10 gpu
Hi I am trying to train the dolly-v2-12b or any of the dolly model using a custom dataset using A10 gpu. I am coding in pycharm, windows os. The task is similar to a Q&A. I am trying to use this ...
1
vote
0
answers
103
views
how to set max gpu memory use for each device when using deepspeed for distributed training?
I am newer to deepspeed, and have some experience in deeplearning. I want to know how to set the max gpu memory to use for each device when using deepspeed?.
I have done nothong. I have no thoughts
my ...
2
votes
1
answer
640
views
How can I use decaying learning rate in DeepSpeed?
I am training dolly2.0.
When I do so, I get the following output from the terminal:
If I use DeepSpeed to perform this training, I note that the learning rate didn't improve:
Why didn't the learning ...
1
vote
0
answers
69
views
DeepSpeed: no operator matches operands error
When I try to use DeepSpeed example to finetune a OPT 1.3b model on my local machine, I have an unexpected error, which related to following code snippet:
template <typename T>
__global__ ...