[Tracker] [bnb] Supporting `device_map` containing GPU and CPU devices #19090

younesbelkada · 2022-09-17T20:12:26Z

Feature request

We should be able to provide custom device_map when using 8-bit models using bitsandbytes. This would enable users having more control over the modules they want to quantize.

Linked issue: bitsandbytes-foundation/bitsandbytes#40

Motivation

Users should be able to pass their own custom device_map and chose which module should be quantized or not

Your contribution

Try coding this enhancement!

The text was updated successfully, but these errors were encountered:

z80maniac · 2022-09-18T09:14:23Z

UPDATE (for future readers): the title was changed.

I think that the title of this issue is a little bit misleading. Technically, a custom device_map is already supported for bitsandbytes, as long as all the layers are on GPU.

For example, in the linked issue, this device_map works correctly:

    device_map = {
        "transformer.wte": 0,
        "transformer.wpe": 0,
        "transformer.ln_f": 0,
        "lm_head": 0,
        "transformer.h.0": 0,
        "transformer.h.1": 0,
        "transformer.h.2": 0,
        "transformer.h.3": 0,
        "transformer.h.4": 0,
        "transformer.h.5": 0,
        "transformer.h.6": 0,
        "transformer.h.7": 0,
        "transformer.h.8": 0,
        "transformer.h.9": 0,
        "transformer.h.10": 0,
        "transformer.h.11": 0
    }

And I believe that there will be no problem in using 1 instead of 0 for any transformer.* layer if you have more than one GPU (but I may be mistaken, I didn't find any specific info in any docs about using bitsandbytes with multiple GPUs). And I suppose that replacing all 0 with 1 will also work. So, I think that users already can customize the device map, as long as it doesn't put anything on CPU.

The original issue was not about a custom map. It was about supporting the load_in_8bit flag for models that are shared between CPU and GPU.

github-actions · 2022-10-18T15:03:00Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

z80maniac · 2022-10-22T08:30:10Z

If you think this still needs to be addressed please comment on this thread.

unstale

github-actions · 2022-11-15T15:01:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

z80maniac · 2022-11-15T15:57:13Z

If you think this still needs to be addressed please comment on this thread.

unstale

I guess this will be my monthly routine...

younesbelkada · 2022-11-17T14:45:24Z

Hi
The PR #20281 will not be merged until a fix will be found on bitsandbytes side.
Could you please checkout from this PR if you want to use this feature from now? Thanks.

z80maniac · 2022-11-17T17:33:58Z

I've just tested that PR and it works. Thank you!

I tested it with a 13B model on GTX 3060. Without load_in_8bit only 10 layers are able to fit into the GPU. With that patch and load_in_8bit=True now 19 layers are able to fit into the GPU. Which gives a 30% speedup of the inference in my case.

For some reason when I test it on my initial example, it gives this warning:

/home/user/test/bnb-test/transformers/src/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(

However, I was not able to reproduce it in my other more complex program.

In the PR's discussion it was said:

this will result in weights offloaded on the CPU to not be converted in int8 at all

I expected this much, but I think it's still better than nothing.

Though, are there some gotchas in the fact that CPU layers are not converted to 8bit?

Also, not sure how to proceed next. You said:

we should probably wait until bitsandbytes supports weights offloading in 8-bit to add this feature

So I suppose this issue should remain open? I will then add more info to my initial issue at the bitsandbytes repo.

younesbelkada · 2022-11-18T08:36:31Z

Thank you very much for your feedback and happy that it worked for your usecase!

For some reason when I test it on my initial example, it gives this warning:

This is because you have set your input_ids on the cpu before running your inference! Make sure to set input_ids to the device of the first layers (so I guess here, your GPU) before running generate.

Though, are there some gotchas in the fact that CPU layers are not converted to 8bit?

I did not quite get your question here, but CPU layers are kept in their native dtype here indeed, which can be quite confusing. For example you could provide a device_map that contains only cpu layers and still load your model with load_in_8bit - users will think that they're loading their model in 8-bit on their CPU when actually it's not the case.

So I suppose this issue should remain open? I will then add more info to my initial issue at the bitsandbytes repo.

Yes, it can remain open. But feel free also to jump in the PR #20281 to give your opinion on the question and stress about the fact that you think this feature is useful. You can also add more information on the bitsandbytes repo also!

z80maniac · 2022-11-18T08:52:59Z

This is because you have set your input_ids on the cpu before running your inference! Make sure to set input_ids to the device of the first layers (so I guess here, your GPU) before running generate.

I use the following code:

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
    model_kwargs={
        "device_map": device_map,
        "load_in_8bit": load_in_8bit
    }
)

print("\n", pipe("It was")[0]["generated_text"])

Not sure where I am supposed to set input_ids here.

I did not quite get your question here

I mean, purely from a technical standpoint, are there some downsides to mixing 8bit and 16/32bit layers?

younesbelkada · 2022-11-18T08:55:13Z

Not sure where I am supposed to set input_ids here.

Thanks for sharing the code! It's clearer for me now, can you try to add device=0 as follows:

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
   device=0,
    model_kwargs={
        "device_map": device_map,
        "load_in_8bit": load_in_8bit
    }

)

I mean, purely from a technical standpoint, are there some downsides to mixing 8bit and 16/32bit layers?

Indeed, from a technical standpoint I don't see any downside

z80maniac · 2022-11-18T09:03:44Z

When I add device=0 I get this:

Traceback (most recent call last):
  File "/home/user/test/bnb-test/main.py", line 28, in <module>
    pipe = pipeline(
  File "/home/user/test/bnb-test/transformers/src/transformers/pipelines/__init__.py", line 870, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/home/user/test/bnb-test/transformers/src/transformers/pipelines/text_generation.py", line 64, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/test/bnb-test/transformers/src/transformers/pipelines/base.py", line 778, in __init__
    self.model = self.model.to(self.device)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to
    return self._apply(convert)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply
    param_applied = fn(param)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

The full code for clarity:

from transformers import pipeline

auto_map = False
load_in_8bit = True

if auto_map:
    device_map = "auto"
else:
    device_map = {
        "transformer.wte": 0,
        "transformer.wpe": 0,
        "transformer.ln_f": "cpu",
        "lm_head": 0,
        "transformer.h.0": 0,
        "transformer.h.1": "cpu",
        "transformer.h.2": "cpu",
        "transformer.h.3": "cpu",
        "transformer.h.4": "cpu",
        "transformer.h.5": "cpu",
        "transformer.h.6": "cpu",
        "transformer.h.7": "cpu",
        "transformer.h.8": "cpu",
        "transformer.h.9": "cpu",
        "transformer.h.10": "cpu",
        "transformer.h.11": "cpu"
    }

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    device=0,
    max_length=32,
    model_kwargs={
        "device_map": device_map,
        "load_in_8bit": load_in_8bit
    }
)

print("\n", pipe("It was")[0]["generated_text"])

The error occurs even when load_in_8bit = False.

Also, in any case, the original error is pretty confusing. It says You are calling .generate() with the input_ids, but I don't do such a thing.

younesbelkada · 2022-11-18T09:06:51Z

Thanks for sharing, I think it is fine, for now I would say that you can leave the pipeline without device=0. I expect a small speedup since accelerate copies the input_ids that is created on the cpu to the device of the model at the beginning, and copies back the result on cpu. Let me get back to you on this to see if I can find a solution

the reason it says generate() is because pipeline calls .generate() under the hood here

z80maniac · 2022-11-18T09:15:07Z

the reason it says generate() is because pipeline calls .generate() under the hood here

I know, but to an end user it still will not be immediately clear what the problem is just by reading that error message. It also says how to fix it:

Please make sure that you have put input_ids to the correct device
by calling for example input_ids = input_ids.to('cuda') before running .generate()

But it's absolutely not applicable in this situation, adding even more confusion. Maybe the call to pipeline should have a different error message?

github-actions · 2022-12-12T15:02:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

z80maniac · 2022-12-12T16:00:47Z

unstale

Also, I added some comments in the PR discussion:
#20281 (comment)
#20281 (comment)

github-actions · 2023-01-06T15:02:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

z80maniac · 2023-01-09T10:36:49Z

unstale

Technically, I personally don't need this fix anymore, since in my project I applied the hack described in the PR.
Though it would be nice to have it properly integrated into the transformers.

github-actions · 2023-02-27T15:03:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada · 2023-02-27T15:04:46Z

This should be solved by the introduction of BitsAndBytesConfig in #21579

z80maniac · 2023-02-28T18:38:41Z

Yes, indeed it works. Thank you, @younesbelkada!

For completeness sake, here's the final working version:

import torch
from transformers import BitsAndBytesConfig, pipeline

device_map = {
    "transformer.wte": 0,
    "transformer.wpe": 0,
    "transformer.ln_f": "cpu",
    "lm_head": 0,
    "transformer.h.0": 0,
    "transformer.h.1": "cpu",
    "transformer.h.2": "cpu",
    "transformer.h.3": "cpu",
    "transformer.h.4": "cpu",
    "transformer.h.5": "cpu",
    "transformer.h.6": "cpu",
    "transformer.h.7": "cpu",
    "transformer.h.8": "cpu",
    "transformer.h.9": "cpu",
    "transformer.h.10": "cpu",
    "transformer.h.11": "cpu"
}


quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
    llm_int8_skip_modules=["lm_head"]
)

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
    torch_dtype=torch.float16,
    model_kwargs={
        "device_map": device_map,
        "quantization_config": quantization_config
    }
)

print("\n", pipe("It was")[0]["generated_text"])

younesbelkada mentioned this issue Sep 17, 2022

Unable to use load_in_8bit when the model is shared between GPU and CPU bitsandbytes-foundation/bitsandbytes#40

Closed

younesbelkada self-assigned this Sep 17, 2022

younesbelkada changed the title ~~[Tracker] [bnb] Supporting custom device_map~~ [Tracker] [bnb] Supporting device_map containing GPU and CPU devices Sep 18, 2022

younesbelkada mentioned this issue Nov 16, 2022

[bnb] We should be able to run 8-bit models on CPU & GPU #20281

Closed

younesbelkada mentioned this issue Nov 22, 2022

error loading facebook/opt-30b with text generation pipeline using 8bit mixed precision #20361

Closed

4 tasks

huggingface deleted a comment from github-actions bot Feb 2, 2023

younesbelkada closed this as completed Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracker] [bnb] Supporting `device_map` containing GPU and CPU devices #19090

[Tracker] [bnb] Supporting `device_map` containing GPU and CPU devices #19090

younesbelkada commented Sep 17, 2022

z80maniac commented Sep 18, 2022 •

edited

Loading

github-actions bot commented Oct 18, 2022

z80maniac commented Oct 22, 2022

github-actions bot commented Nov 15, 2022

z80maniac commented Nov 15, 2022

younesbelkada commented Nov 17, 2022

z80maniac commented Nov 17, 2022

younesbelkada commented Nov 18, 2022

z80maniac commented Nov 18, 2022

younesbelkada commented Nov 18, 2022

z80maniac commented Nov 18, 2022

younesbelkada commented Nov 18, 2022 •

edited

Loading

z80maniac commented Nov 18, 2022

github-actions bot commented Dec 12, 2022

z80maniac commented Dec 12, 2022

github-actions bot commented Jan 6, 2023

z80maniac commented Jan 9, 2023

github-actions bot commented Feb 27, 2023

younesbelkada commented Feb 27, 2023

z80maniac commented Feb 28, 2023

[Tracker] [bnb] Supporting device_map containing GPU and CPU devices #19090

[Tracker] [bnb] Supporting device_map containing GPU and CPU devices #19090

Comments

younesbelkada commented Sep 17, 2022

Feature request

Motivation

Your contribution

z80maniac commented Sep 18, 2022 • edited Loading

github-actions bot commented Oct 18, 2022

z80maniac commented Oct 22, 2022

github-actions bot commented Nov 15, 2022

z80maniac commented Nov 15, 2022

younesbelkada commented Nov 17, 2022

z80maniac commented Nov 17, 2022

younesbelkada commented Nov 18, 2022

z80maniac commented Nov 18, 2022

younesbelkada commented Nov 18, 2022

z80maniac commented Nov 18, 2022

younesbelkada commented Nov 18, 2022 • edited Loading

z80maniac commented Nov 18, 2022

github-actions bot commented Dec 12, 2022

z80maniac commented Dec 12, 2022

github-actions bot commented Jan 6, 2023

z80maniac commented Jan 9, 2023

github-actions bot commented Feb 27, 2023

younesbelkada commented Feb 27, 2023

z80maniac commented Feb 28, 2023

[Tracker] [bnb] Supporting `device_map` containing GPU and CPU devices #19090

[Tracker] [bnb] Supporting `device_map` containing GPU and CPU devices #19090

z80maniac commented Sep 18, 2022 •

edited

Loading

younesbelkada commented Nov 18, 2022 •

edited

Loading