Newsletter

Performance Benchmark: Windows vs. Linux for Stable Diffusion WebUI on WSL2

overview

According to Wiki, when running Stable Diffusion WebUI, WSL2 seems to be faster on Windows and WSL2.

I tested if there is speed difference even though the programs themselves are exactly the same.

test environment

  • CPU: Intel Core i7-13700K
  • CPU Cooler: Big shuriken3 RGB size
  • Motherboard: Asrock Z690M-ITX/ax
  • SSD:M2_2 (chipset side) ARKINE NVMe Gen3 SSD 256GB
  • M2_1 (CPU side) None
  • Aluminum Case NVMe-USB3.2 Gen2X2 BLM20C:SUNEAST SE900NVG3-256G (not used this time)
  • Power: Corsair SFX 750W SF-750 power supply
  • Memory: G Skill Trident Z DDR4-3600 OC memory 16GB*2=32GB
  • Case: QDIY 0040-*PCJMK6-ITX (testbed)
  • Operating System: Lubuntu 22.04LTS (based on jammy-jellyfish)
  • Zotac Geforce RTX3060 Twinedge and Zotac Geforce RTX4070Ti 12GB GPU
  • Windows11 operating system (updated to the latest) and Lubuntu 22.04LTS (updated to the latest)
  • Windows11 driver official nVIDIA GRD driver, Linux official nVIDIA 545 driver

This time we didn’t use a USB-SSD, but we tested it using a 256GB NVMe Gen3 SSD connected to the main drive.

Now let’s look at the results.

Stable deployment WebUI generation benchmark

For Windows we cloned the portable version distributed on this site and for Linux we cloned it from github.

Python on Windows is 3.10.6 and on Linux it is 3.6.12.

torch2.0.1+cu118.

Options include –autolaunch and –opt-sdp-attention.

Verification method

  • Benchmark Hello Asuka: batch size 1 with batch count 10
  • Hi Asuka Benchmark 768: The above settings are set to 768*768

We measured each of these three times with cudnn as the default (built-in flashlight) and replaced it with the latest version and calculated the average value.

The fewer seconds, the better the performance, and the higher the number of processing steps per second, the better the performance.

Benchmark Hi Asuka: 512*512/28 steps/10 images

Hello Asuka Benchmark 768:768*768/28 steps/10 images

There’s basically nothing superior to Linux in terms of Windows performance, and the performance difference is especially noticeable with the latest GPU, the RTX4070Ti.

The RTX4070Ti on Linux is stable and fast regardless of cudnn version.

On the other hand, the RTX4070Ti on Windows is clearly slower than Linux.

The performance difference seems small for 512*512, but the results for 768*768 show that the difference increases as the processing becomes enormous.

On the other hand, with the dying RTX3060, it was confirmed that when the cudnn version was increased, the results with Linux decreased.

From this it appears that Windows drivers and libraries lag behind Linux in optimization.

Even so, it doesn’t outperform Linux in any area, so there must be a clear bottleneck somewhere.

benchmark kohya_ss gui (learning).

Generate LoRA using the frog image for operation and performance confirmation distributed on this site.

Comparison of processing steps per second with AdamW8bit, Lion8bit, AdamW and Lion

Since the number of processing is huge, there is almost no change in the numbers and it is a one-time operation.

The higher the number, the higher the performance.

Now let’s look at the learning outcomes.

In terms of reasoning, it wasn’t that big… although there was quite a difference in Ada Lovelace, but in terms of learning, there was a difference that couldn’t be reversed no matter how hard I tried.

For the RTX4070Ti the difference is almost double, and even for the previous generation RTX3060, which seems more optimized, there is a difference of about 30%.

Python, pytorch, etc. they were originally ported from Linux to Windows, so when combined with the inference results, it seems like there is some sort of bottleneck in Windows.

In short

I will write a cruel conclusion for Windows users.

Windows has no advantage as an operating system that uses AI generation.

Especially for those who use the latest generation GPUs, it can be said that using free Linux you can obtain performance that is one degree higher than Windows.

Furthermore, I think it is safe to say that performance will be 2-3 grades higher in terms of learning.

Although not related to Stable Diffusion WebUI, TensorFlow-GPU (which uses CUDA for calculations) has already canceled the release of Windows version binaries, and considering such future trends, it is difficult to use AI generation on Windows. at all.

First of all, with Linux you can use ROCm, so you can use Radeon too.

This expands your options and allows you to organize your environment at a relatively low cost.

By the way, I made this project because of what I wanted to say in the previous two lines.

Grabo used it for verification this time

ZOTAC

¥43,979 (As of 2/10 2024 19:04:14 Amazon search – details)

ZOTAC

¥136,498 (as of 2/10 2024 19:04:51 Amazon search – details)

Currently, RTX4070Super is a good deal.

Expert oriented

¥101,714 (as of 2/10 2024 19:05:44 Amazon search – details)

#compared #speed #Stable #Diffusion #WebUI #Windows #Linux #Gaming #guide #explained #selftaught #user