Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux.1 Schnell, memory issue on AMD Rocm #4341

Open
grigio opened this issue Aug 13, 2024 · 17 comments
Open

Flux.1 Schnell, memory issue on AMD Rocm #4341

grigio opened this issue Aug 13, 2024 · 17 comments
Labels
AMD Issue related to AMD driver support. Potential Bug User is reporting a bug. This should be tested.

Comments

@grigio
Copy link

grigio commented Aug 13, 2024

Expected Behavior

I expect it should work because I can run SDXL with AMD Ryzen 7 7700 on Linux

Actual Behavior

image

Steps to Reproduce

i installed comfyui via docker

    environment:
      CLI_ARGS: "--lowvram --use-split-cross-attention"
      HSA_OVERRIDE_GFX_VERSION: "10.3.0"
      HIP_VISIBLE_DEVICES: 0
      ROCR_VISIBLE_DEVICES: 0
      PYTORCH_HIP_ALLOC_CONF: "garbage_collection_threshold:0.6,max_split_size_mb:6144"

Debug Logs

Error occurred when executing SamplerCustomAdvanced:

HIP out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 16.00 MiB is free. Of the allocated memory 204.22 MiB is allocated by PyTorch, and 177.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

  File "/root/ComfyUI/execution.py", line 152, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
  File "/root/ComfyUI/execution.py", line 82, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
  File "/root/ComfyUI/execution.py", line 75, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
  File "/root/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/root/ComfyUI/comfy/samplers.py", line 716, in sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "/root/ComfyUI/comfy/samplers.py", line 695, in inner_sample
    samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
  File "/root/ComfyUI/comfy/samplers.py", line 600, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
  File "/usr/local/lib64/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/ComfyUI/comfy/k_diffusion/sampling.py", line 143, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
  File "/root/ComfyUI/comfy/samplers.py", line 299, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
  File "/root/ComfyUI/comfy/samplers.py", line 682, in __call__
    return self.predict_noise(*args, **kwargs)
  File "/root/ComfyUI/comfy/samplers.py", line 685, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
  File "/root/ComfyUI/comfy/samplers.py", line 279, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
  File "/root/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
  File "/root/ComfyUI/custom_nodes/ComfyUI-Advanced-ControlNet/adv_control/utils.py", line 68, in apply_model_uncond_cleanup_wrapper
    return orig_apply_model(self, *args, **kwargs)
  File "/root/ComfyUI/comfy/model_base.py", line 145, in apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/comfy/ldm/flux/model.py", line 150, in forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control)
  File "/root/ComfyUI/comfy/ldm/flux/model.py", line 129, in forward_orig
    img = block(img, vec=vec, pe=pe)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/comfy/ldm/flux/layers.py", line 233, in forward
    output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/comfy/ops.py", line 63, in forward
    return self.forward_comfy_cast_weights(*args, **kwargs)
  File "/root/ComfyUI/comfy/ops.py", line 58, in forward_comfy_cast_weights
    weight, bias = cast_bias_weight(self, input)
  File "/root/ComfyUI/comfy/ops.py", line 42, in cast_bias_weight
    weight = cast_to(s.weight, dtype, device, non_blocking=non_blocking)
  File "/root/ComfyUI/comfy/ops.py", line 24, in cast_to
    return weight.to(device=device, dtype=dtype, non_blocking=non_blocking)


### Other

comfyui-rocm | [INFO] Running set-proxy script...
comfyui-rocm | [INFO] Continue without proxy.
comfyui-rocm | [INFO] Running pre-start script...
comfyui-rocm | [INFO] Continue without pre-start script.
comfyui-rocm | ########################################
comfyui-rocm | [INFO] Starting ComfyUI...
comfyui-rocm | ########################################
comfyui-rocm | [ComfyUI-Manager] 'distutils' package not found. Activating fallback mode for compatibility.
comfyui-rocm | [START] Security scan
comfyui-rocm | [DONE] Security scan
comfyui-rocm | ## ComfyUI-Manager: installing dependencies done.
comfyui-rocm | ** ComfyUI startup time: 2024-08-13 19:33:29.232847
comfyui-rocm | ** Platform: Linux
comfyui-rocm | ** Python version: 3.10.14 (main, Mar 21 2024, 16:45:28) [GCC]
comfyui-rocm | ** Python executable: /usr/bin/python3
comfyui-rocm | ** ComfyUI Path: /root/ComfyUI
comfyui-rocm | ** Log path: /root/comfyui.log
comfyui-rocm |
comfyui-rocm | Prestartup times for custom nodes:
comfyui-rocm | 0.5 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Manager
comfyui-rocm |
comfyui-rocm | Total VRAM 512 MB, total RAM 47379 MB
comfyui-rocm | pytorch version: 2.1.2+rocm6.1.3
comfyui-rocm | Set vram state to: LOW_VRAM
comfyui-rocm | Device: cuda:0 AMD Radeon Graphics : native
comfyui-rocm | Using split optimization for cross attention
comfyui-rocm | [Prompt Server] web root: /root/ComfyUI/web
comfyui-rocm | ### Loading: ComfyUI-Manager (V2.48.7)
comfyui-rocm | ### ComfyUI Revision: 174 [39fb74c] | Released on '2024-08-13'
comfyui-rocm | [Crystools INFO] Crystools version: 1.16.6
comfyui-rocm | [Crystools INFO] CPU: AMD Ryzen 7 7700 8-Core Processor - Arch: x86_64 - OS: Linux 6.1.0-18-amd64
comfyui-rocm | [Crystools ERROR] Could not init pynvml (Nvidia).NVML Shared Library Not Found
comfyui-rocm | [Crystools WARNING] No GPU with CUDA detected.
Efficiency Nodes: Attempting to add Control Net options to the 'HiRes-Fix Script' Node (comfyui_controlnet_aux add-on)...Success!
comfyui-rocm | ### Loading: ComfyUI-Impact-Pack (V6.2)
comfyui-rocm | ### Loading: ComfyUI-Impact-Pack (Subpack: V0.6)
comfyui-rocm | [Impact Pack] Wildcards loading done.
comfyui-rocm | ### Loading: ComfyUI-Inspire-Pack (V0.83)
comfyui-rocm | [AnimateDiffEvo] - ERROR - No motion models found. Please download one and place in: ['/root/ComfyUI/custom_nodes/ComfyUI-AnimateDiff-Evolved/models', '/root/ComfyUI/models/animatediff_models']
comfyui-rocm |
comfyui-rocm | Import times for custom nodes:
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/websocket_image_save.py
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Crystools-save
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/sdxl_prompt_styler
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/cg-use-everywhere
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/comfyui_controlnet_aux
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/AIGODLIKE-ComfyUI-Translation
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI_IPAdapter_plus
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Custom-Scripts
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Frame-Interpolation
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Advanced-ControlNet
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Manager
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI_essentials
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-AnimateDiff-Evolved
comfyui-rocm | 0.0 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Inspire-Pack
comfyui-rocm | 0.1 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Crystools
comfyui-rocm | 0.1 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Impact-Pack
comfyui-rocm | 0.3 seconds: /root/ComfyUI/custom_nodes/ComfyUI-VideoHelperSuite
comfyui-rocm | 0.4 seconds: /root/ComfyUI/custom_nodes/efficiency-nodes-comfyui
comfyui-rocm |
comfyui-rocm | Starting server
comfyui-rocm |
comfyui-rocm | To see the GUI go to: http://0.0.0.0:8188
comfyui-rocm | FETCH DATA from: /root/ComfyUI/custom_nodes/ComfyUI-Manager/extension-node-map.json [DONE]
comfyui-rocm | got prompt
comfyui-rocm | model weight dtype torch.bfloat16, manual cast: None
comfyui-rocm | model_type FLOW
comfyui-rocm | /usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
comfyui-rocm | warnings.warn(
comfyui-rocm | Requested to load FluxClipModel_
comfyui-rocm | Loading 1 new model
comfyui-rocm | clip missing: ['text_projection.weight']
comfyui-rocm | Requested to load Flux
comfyui-rocm | Loading 1 new model
comfyui-rocm | loaded partially 64.0 60.7852783203125 0
0%| | 0/4 [01:11<?, ?it/s]
comfyui-rocm | !!! Exception during processing!!! HIP out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 16.00 MiB is free. Of the allocated memory 204.22 MiB is allocated by PyTorch, and 177.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
comfyui-rocm | Traceback (most recent call last):
comfyui-rocm | File "/root/ComfyUI/execution.py", line 152, in recursive_execute
comfyui-rocm | output_data, output_ui = get_output_data(obj, input_data_all)
comfyui-rocm | File "/root/ComfyUI/execution.py", line 82, in get_output_data
comfyui-rocm | return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
comfyui-rocm | File "/root/ComfyUI/execution.py", line 75, in map_node_over_list
comfyui-rocm | results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
comfyui-rocm | File "/root/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
comfyui-rocm | samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 716, in sample
comfyui-rocm | output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 695, in inner_sample
comfyui-rocm | samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 600, in sample
comfyui-rocm | samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/utils/contextlib.py", line 115, in decorate_context
comfyui-rocm | return func(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/k_diffusion/sampling.py", line 143, in sample_euler
comfyui-rocm | denoised = model(x, sigma_hat * s_in, **extra_args)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 299, in call
comfyui-rocm | out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 682, in call
comfyui-rocm | return self.predict_noise(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 685, in predict_noise
comfyui-rocm | return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 279, in sampling_function
comfyui-rocm | out = calc_cond_batch(model, conds, x, timestep, model_options)
comfyui-rocm | File "/root/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
comfyui-rocm | output = model.apply_model(input_x, timestep
, **c).chunk(batch_chunks)
comfyui-rocm | File "/root/ComfyUI/custom_nodes/ComfyUI-Advanced-ControlNet/adv_control/utils.py", line 68, in apply_model_uncond_cleanup_wrapper
comfyui-rocm | return orig_apply_model(self, *args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/model_base.py", line 145, in apply_model
comfyui-rocm | model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
comfyui-rocm | return self._call_impl(*args, **kwargs)
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
comfyui-rocm | return forward_call(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/ldm/flux/model.py", line 150, in forward
comfyui-rocm | out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control)
comfyui-rocm | File "/root/ComfyUI/comfy/ldm/flux/model.py", line 129, in forward_orig
comfyui-rocm | img = block(img, vec=vec, pe=pe)
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
comfyui-rocm | return self._call_impl(*args, **kwargs)
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
comfyui-rocm | return forward_call(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/ldm/flux/layers.py", line 233, in forward
comfyui-rocm | output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
comfyui-rocm | return self._call_impl(*args, **kwargs)
comfyui-rocm | File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
comfyui-rocm | return forward_call(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/ops.py", line 63, in forward
comfyui-rocm | return self.forward_comfy_cast_weights(*args, **kwargs)
comfyui-rocm | File "/root/ComfyUI/comfy/ops.py", line 58, in forward_comfy_cast_weights
comfyui-rocm | weight, bias = cast_bias_weight(self, input)
comfyui-rocm | File "/root/ComfyUI/comfy/ops.py", line 42, in cast_bias_weight
comfyui-rocm | weight = cast_to(s.weight, dtype, device, non_blocking=non_blocking)
comfyui-rocm | File "/root/ComfyUI/comfy/ops.py", line 24, in cast_to
comfyui-rocm | return weight.to(device=device, dtype=dtype, non_blocking=non_blocking)
comfyui-rocm | torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 16.00 MiB is free. Of the allocated memory 204.22 MiB is allocated by PyTorch, and 177.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
comfyui-rocm |
comfyui-rocm | Got an OOM, unloading all loaded models.
comfyui-rocm | Prompt executed in 98.52 seconds

@grigio grigio added the Potential Bug User is reporting a bug. This should be tested. label Aug 13, 2024
@jslegers
Copy link

jslegers commented Aug 14, 2024

I noticed that the UNETLoader.load_unet takes a lot more memory since the most recent changes when loading a FLUX transformer unet of weight_dtype fp8_e4m3fn.

Before the changes I could stay under 12GB total VRAM usage when loading a fp8_e4m3fn version of the flux1-schnell after first loading the t5xxl text decoder (given a minor tweak to unet_offload_device - see #4319).

After the changes, I run into the 16GB memory limit when the FLUX transformer unet is loaded.

See also #4343, #4318 & #4338

@grigio
Copy link
Author

grigio commented Aug 22, 2024

I tried to run the smallest Flux.1 Schnell GGUF and I also have this issue

Error occurred when executing KSampler:

HIP out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 17179869183.96 GiB is free. Of the allocated memory 292.04 MiB is allocated by PyTorch, and 137.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

@grigio
Copy link
Author

grigio commented Aug 23, 2024

This is rocminfo

/opt/rocm/bin/rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          NO

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7700 8-Core Processor  
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7700 8-Core Processor  
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3800                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    15865028(0xf214c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    15865028(0xf214c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15865028(0xf214c4) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      256(0x100) KB                      
  Chip ID:                 5710(0x164e)                       
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   3840                               
  Internal Node ID:        1                                  
  Compute Unit:            2                                  
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 18                                 
  SDMA engine uCode::      1                                  
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***      

@arch-user-france1
Copy link

GPU 0 has a total capacty of 512.00 MiB. You shouldn't be able to run FLUX.1 with less than one gigabyte of VRAM @grigio

@grigio
Copy link
Author

grigio commented Aug 23, 2024

@arch-user-france1 But I could run "stable diffusion rocm" on my hardware loading the model on RAM, I think flux.1 schnell quantized should work too

@arch-user-france1
Copy link

Flux needs about 32 GB in bfloat16 and I do not expect this to reduce to 1 GB. Are you really running a quantized model, and if so, what dtype is it in?

@grigio
Copy link
Author

grigio commented Aug 23, 2024

With stable diffusion I could load the model in RAM (I've 48GB) and it worked, flux1-schnell-Q2_K.gguf
4.01 GB and the missing 90Mb i don't think it's related to that

@arch-user-france1
Copy link

Looks like the GGUF model already is larger than 1 GB. Are you able to enable CPU-offloading somewhere? Note that dRAM is not related to the model size your device can handle. You need more vRAM.

@grigio
Copy link
Author

grigio commented Aug 23, 2024

I think the issue is a regression in the latest versions of pytorch / rocm

huggingface/autotrain-advanced#737
ROCm/ROCm#3580

@hartmark
Copy link

I think the issue is a regression in the latest versions of pytorch / rocm

huggingface/autotrain-advanced#737 ROCm/ROCm#3580

AMD have added my issue to their internal tracker so hopefully they can reproduce and get a fix out.

@arch-user-france1
Copy link

arch-user-france1 commented Aug 24, 2024

You stated: "I have AMD Radeon RX 7800 XT 16 GB, I couldn't select it in the list." What list do you mean?
Your driver installation might be corrupt, as I've seen similar issues on my end on ROCm 5 with DKMS.

Ah okay, there's no option for that in the AMD issue tracker. Still, you might consider to reinstall your driver if you haven't done that yet, and specifically without DKMS, which has been working well for me.

@hartmark
Copy link

You stated: "I have AMD Radeon RX 7800 XT 16 GB, I couldn't select it in the list." What list do you mean? Your driver installation might be corrupt, as I've seen similar issues on my end on ROCm 5 with DKMS.

Ah okay, there's no option for that in the AMD issue tracker. Still, you might consider to reinstall your driver if you haven't done that yet, and specifically without DKMS, which has been working well for me.

Try create an issue on their GitHub page, it's a list

@unclemusclez
Copy link

lol every day it is something new. i kind of enjoy it at this point.

@AlexX1999
Copy link

Try set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES to 1 for your docker setup. Right now it only has 512MB memory, which is likely using the iGPU

@unclemusclez
Copy link

Try set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES to 1 for your docker setup. Right now it only has 512MB memory, which is likely using the iGPU

yes you would think i would know this by now

@grigio
Copy link
Author

grigio commented Sep 2, 2024

Try set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES to 1 for your docker setup. Right now it only has 512MB memory, which is likely using the iGPU

I want to use the iGPU (it worked with stable diffusion web ui), with that ENV variables I get:

[+] Running 1/0
 ✔ Container comfyui-rocm  Recreated                                                                                                0.1s 
Attaching to comfyui-rocm
comfyui-rocm  | [INFO] Running set-proxy script...
comfyui-rocm  | [INFO] Continue without proxy.
comfyui-rocm  | [INFO] Running pre-start script...
comfyui-rocm  | [INFO] Continue without pre-start script.
comfyui-rocm  | ########################################
comfyui-rocm  | [INFO] Starting ComfyUI...
comfyui-rocm  | ########################################
comfyui-rocm  | [START] Security scan
comfyui-rocm  | [DONE] Security scan
comfyui-rocm  | ## ComfyUI-Manager: installing dependencies done.
comfyui-rocm  | ** ComfyUI startup time: 2024-09-02 08:16:34.651789
comfyui-rocm  | ** Platform: Linux
comfyui-rocm  | ** Python version: 3.10.14 (main, Mar 21 2024, 16:45:28) [GCC]
comfyui-rocm  | ** Python executable: /usr/bin/python3
comfyui-rocm  | ** ComfyUI Path: /root/ComfyUI
comfyui-rocm  | ** Log path: /root/comfyui.log
comfyui-rocm  | 
comfyui-rocm  | Prestartup times for custom nodes:
comfyui-rocm  |    0.5 seconds: /root/ComfyUI/custom_nodes/ComfyUI-Manager
comfyui-rocm  | 
comfyui-rocm  | Traceback (most recent call last):
comfyui-rocm  |   File "/root/./ComfyUI/main.py", line 90, in <module>
comfyui-rocm  |     import execution
comfyui-rocm  |   File "/root/ComfyUI/execution.py", line 13, in <module>
comfyui-rocm  |     import nodes
comfyui-rocm  |   File "/root/ComfyUI/nodes.py", line 21, in <module>
comfyui-rocm  |     import comfy.diffusers_load
comfyui-rocm  |   File "/root/ComfyUI/comfy/diffusers_load.py", line 3, in <module>
comfyui-rocm  |     import comfy.sd
comfyui-rocm  |   File "/root/ComfyUI/comfy/sd.py", line 5, in <module>
comfyui-rocm  |     from comfy import model_management
comfyui-rocm  |   File "/root/ComfyUI/comfy/model_management.py", line 143, in <module>
comfyui-rocm  |     total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
comfyui-rocm  |   File "/root/ComfyUI/comfy/model_management.py", line 112, in get_torch_device
comfyui-rocm  |     return torch.device(torch.cuda.current_device())
comfyui-rocm  |   File "/usr/local/lib64/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
comfyui-rocm  |     _lazy_init()
comfyui-rocm  |   File "/usr/local/lib64/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
comfyui-rocm  |     torch._C._cuda_init()
comfyui-rocm  | RuntimeError: No HIP GPUs are available
comfyui-rocm exited with code 1

@brcisna
Copy link

brcisna commented Sep 25, 2024

@grigio

Using rocm , AMD GPU, you HAVE to utilize xformers to be leveraged for doing any FLUX.1 models really
Is really a bear to get setup with ROCm 6.1 but ROCm 6.2 is much more manageable.
Am learning all this as well,,as just got an old AMD Radeon Pro W6600 to mess with.

@huchenlei huchenlei added the AMD Issue related to AMD driver support. label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMD Issue related to AMD driver support. Potential Bug User is reporting a bug. This should be tested.
Projects
None yet
Development

No branches or pull requests

8 participants