-
Notifications
You must be signed in to change notification settings - Fork 565
WIP: end-to-end ONNX export and inference for stable diffusion #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP: end-to-end ONNX export and inference for stable diffusion #730
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
is due to a bug in FusedConv in ONNX Runtime, tracked in microsoft/onnxruntime#14500
Tracked here: microsoft/onnxruntime#14526 . ONNX Runtime seems to be very greedy compared to PyTorch when it comes to GPU memory. From what I tried, CUDAExecutionProvider is basically unusable for stable diffusion currently. |
|
5f49c60
to
7f8685a
Compare
When I run the code of this branch. I can generate single pt, onnx files, but using the same input to run onnx model multiple times, the output results are different. Why is this, are there any random parameters that need to be set fixed? python run_ort.py --gpu |
A work in progress extension of the stable diffusion deployment through the ONNX + ONNX Runtime deployment path, allowing to perform the end-to-end pipeline through a single
InferenceSession
, with the exception of tokenization and generating the timesteps, that is done ahead of time.From my preliminar tests, the ONNX export and inference through

CUDAExecutionProvider
works nicely:Major remaining issues:
CPUExecutionProvider
yields garbageCUDAExecutionProvider
is huge, up to 20 GB, although the model is < GB and PyTorch inference takes ~5 GB during inference and the same setting.From the two major issues above, it's clear that this POC is not usable as is. On top of fixing them, there would remain quite a bit of work:
TensorrtExecutionProvider
==> I expect to have issues there as well, as in my experience the Loop / If support from TensorRT can be buggy: Model run withTensorrtExecutionProvider
outputs different results compared toCPUExecutionProvider
/CUDAExecutionProvider
when the ONNXLoop
operator is used onnx/onnx-tensorrt#891 I'm not sure how much NVIDIA folks are interested by TensorrtExecutionProvider to be honest.width
andheight
as inputs, or alternatively, to have them as constants that can be modified.guidance_scale
as a model input.num_inference_steps != 50
.Longer term goals once all of this is tested:
optimum.exporters
ORTStableDiffusionEndToEndPipeline
or something like this, that is PyTorch-free.