You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.
With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on
We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)