Incorporating 3D Gaussian Splats into the graphics pipeline

3D Gaussian splatting is the emerging rendering technique that is overtaking NeRFs. Since it is centered around point primitives, it is more compatible with traditional graphics pipelines that already support point rendering.

Gaussian splats essentially enhance the concept of point rendering by converting the point primitive into a 3D ellipsoid, which is then projected into 2D during the rendering process.. This concept was initially described in 2002 [3], but the technique of extending Structure from Motion scans in this way was only detailed more recently [1].

In this post, I explore how to integrate Gaussian splats into the traditional graphics pipeline. This allows them to be used alongside triangle-based primitives and interact with them through the depth buffer for occlusion (see header image). This approach also simplifies deployment by eliminating the need for CUDA.

Storage

The original implementation uses .ply files as their checkpoint format, focusing on maintaining training-relevant data structures at the expense of storage efficiency, leading to increased file sizes.

For example, it stores the covariance as scaling and a rotation quaternion, necessitating reconstruction during rendering. A more efficient approach would be to leverage orthogonality, storing only the diagonal and upper triangular vectors, thereby eliminating reconstruction and reducing storage requirements.

Further analysis of the storage usage for each attribute shows that the spherical harmonics of orders 1-3 are the main contributors to the file size. However, according to the ablation study in the original publication [1], these harmonics only lead to a modest PSNR improvement of 0.5.

Therefore, the most straightforward way to decrease storage is by discarding the higher-order spherical harmonics. Additionally, the level 0 spherical harmonics can be converted into a diffuse color and merged with opacity to form a single RGBA value. These simple yet effective methods were implemented in one of the early WebGL implementations, resulting in the .splat format. As an added benefit, this format can be easily interpreted by viewers unaware of Gaussian splats as a simple colored point cloud:

Results using a non Gaussian-splat aware renderer

By directly storing the covariance as previously mentioned we can reduce the precision from float32 to float16, thereby halving the storage needed for that data. Furthermore, since most splats have limited spatial extents, we can also utilize float16 for position data, yielding additional storage savings.

With these changes, we achieve a storage requirement of 22 bytes per splat, in contrast to the 44 bytes needed by the .splat format and 236 bytes in the original implementation. Thus, we have attained a 10x reduction in storage compared to the original implementation simply by using more suitable data types.

Blending

The image formation model presented in the original paper [1] is similar to the NeRF rendering, as it is compared to it. This involves casting a ray and observing its intersection with the splats, which leads to front-to-back blending. This is precisely the approach taken by the provided CUDA implementation.

Blending remains a component of the fixed-function unit within the graphics pipeline, which can be set up for front-to-back blending [2] by using the factors (one_minus_dest_alpha, one) and by multiplying color and alpha in the shader as color.rgb * color.a. This results in the following equation:

\begin{aligned}C_{dst} &= (1 - \alpha_{dst}) \cdot \alpha_{src} C_{src} &+ C_{dst}\\ \alpha_{dst} &= (1 - \alpha_{dst})\cdot\alpha_{src} &+ \alpha_{dst}\end{aligned}

However, this method requires the framebuffer alpha value to be zero before rendering the splats, which is not typically the case as any previous render pass could have written an arbitrary alpha value.

A simple solution is to switch to back-to-front sorting and use the standard alpha blending factors (src_alpha, one_minus_src_alpha) for the following blending equation:

C_{dst} = \alpha_{src} \cdot C_{src} + (1 - \alpha_{src}) \cdot C_{dst}

This allows us to regard Gaussian splats as a special type of particles that can be rendered together with other transparent elements within a scene.

References

  1. Kerbl, Bernhard, et al. “3d gaussian splatting for real-time radiance field rendering.” ACM Transactions on Graphics 42.4 (2023): 1-14.
  2. Green, Simon. “Volumetric particle shadows.” NVIDIA Developer Zone (2008).
  3. Zwicker, Matthias, et al. “EWA splatting.” IEEE Transactions on Visualization and Computer Graphics 8.3 (2002): 223-238.

stb_image_resize2.h – performance

Recently there was an large rework to the STB single-file image_resize library (STBIR) bumping it to 2.0. While the v1 was really slow and merely usable if you needed to quickly get some code running, the 2.0 rewrite claims to be more considerate of performance by using SIMD. So lets put it to a test.

As references, I chose the moderately optimized C only implementation of Ogre3D and the highly optimized SIMD implementation in OpenCV.

Below you find time to scale a 1024x1024px byte image to 512x512px. All libraries were set to linear interpolation. The time is the accumulated time for 200 runs.

RGBRGBA
Ogre3D 14.1.2660 ms668 ms
STBIR 2.01632 ms690 ms
OpenCV 4.8245 ms254 ms

For the RGBA test, STIBIR was set to the STBIR_4CHANNEL pixel layout. All libraries were compiled with -O2 -msse. Additionally OpenCV could dispatch AVX2 code. Enabling AVX2 with STBIR actually decreased performance.

Note that while STBIR has no performance advantage over a C only implementation for the simple resizing case, it offers some neat features if you want to handle SRGB data or non-premultiplied alpha.

Do not fall for the Synology Hardware SCAM

I recently needed some NAS and went with the “Synology RS1221+” barebone system. The system is competitively priced when compared to the similar “QNAP TS-873AeU-4G”.

Synology HDD

For storage, the sweet spot between price and capacity was at 18TB. Lets look at some options:

Toshiba MG09ACA 18TB270€
Seagate Exos X X18280€
Synology HAT5310-18T700€

Depending on the benchmark sometimes the Toshiba comes out on top and sometimes the Seagate. Both are similarly priced, so thats fine.
However, talking of the price the Synology HDD stands out by asking a 150% premium.
You might now wonder whether you also get a better performance or other features in return. Well.. guess which is the only 18TB HDD that is verified by Synology for the RS1221+?

The scammy part here however is that the HAT5300 series are just rebranded Toshiba Drives with a different firmware. So the HAT5310 likely is just the MG09ACA and the main difference is the profit margin.
Note that different firmware does not result in any noticeable difference in performance.

I went with the unverified Seagate drives and – as one might expect – there are zero issues with doing so.

Synology RAM

At this point you might say, well Synology just did not get to test more 18TB drives.
Well.. I found the 4GB RAM rather tight and wanted to upgrade to 32GB as RAM is currently quite cheap anyway.

The options here are

Kingston KSM26SED8/16HD50€
Synology D4ECSO-2666-16G350€

I think there appears to be a pattern here. Again, both options have the same specs i.e. DDR4 2666, ECC SO-DIMM. Maybe Synology even rebranded the Kingston modules too, but I did not verify this.

While the DiskManager did not complain about the Seagate HDD, there is a warning when going with Kingston now. I guess this is because it matters even less.

To conclude this, I first want to emphasizes that both the Synology NAS Hardware and their DiskManager software work great with non Synology Hardware – just as one would expect of a standard x86 platform.

It is just a pity that they try to FUD you into buying their overpriced HDD and RAM.
Basically this is the same game as with printer vendors predicting ravages and annihilation when using 3rd party ink.

Logitech M720 Triathlon mouse – long-term review

In this post I want to take a look at the Logitech M720 mouse after having used it for 2.5 years.

Table of Contents

Specs and durability

The specs are pretty common for a mouse you get today, so lets start with the special features:

  • There are side buttons, which I find pretty handy for navigating front/ back in the browser or a file manager
  • It can be paired with up to 3 devices at the same time, which makes it easy to use with your PC, Laptop and Tablet
  • It supports both Bluetooth LE and the Logitech Wireless Receiver
  • It is powered by a single, replaceable AA battery

Especially the last two points make this seem to be future-proof product that you can use for a long time.

Logitech is currently replacing their Wireless Receiver dongles by Logitech Bolt, so in the near future the Wireless Receivers will go away. But thanks to the Bluetooth support you will still be able to use the mouse without having to occupy a USB port just for using it.

Then, using standard AA batteries means that you just use some nice rechargeable ones. This means that you will never have to wait for the mouse to charge and that the mouse can out-live the battery. As you are probably aware from using your phone, rechargeable batteries wear-out over time until the device cannot be properly used any more.

So we finally got a mouse for the years to come? Well..

Built-in obsolescence

Unfortunately, Logitech made some design decision that drastically shorten the life-span of the device, even though they must have known better.

Rubber coating

The most obvious one is likely the rubber coating of the mouse.

Note how the plastic buttons look still perfectly fine in comparison

I took the images for this post after cleaning the mouse. So the dirt you see there is not the skin from my greasy hands, but rather said rubber coating disintegrating.
This is caused by your sweat which is slightly acidic and thus takes hold of the rubber.
There is a reason that Gamepads do not have such coating, even though having good grip is even more important there.
Also, the way the coating is used here, all it does is making the mouse look greasy after some time.

Bad switches

The less obvious issue are the used switches i.e. the things that perform the clicks.
Did you ever notice that after some time your mouse does incorrect double clicks or releases the click while drag and dropping on its own? Well, that means the switch starts wearing out.

The mouse uses OMRON D2FC-F-7N micro-switches in a cheap variant that is only rated for 10 million clicks (10M). While this sounds a lot, it yields to 6850 clicks/ per day for 4 years, which is not all that much if you think about playing a shooter or using photoshop.
The crazy part is that going for the 20M rated variant (2x the durability) only costs 50 ct more (pack of 5 on amazon). This would make the mouse merely 1€ more expensive – probably way less even as Logitech can negotiate bulk discounts on these things.
Given that the mouse is priced at 50€, I do not think we can pass this off as cost optimization.

Note, that even more expensive Logitech Mice, like the MX Master have the rubber coating issue and use the same cheap 10M rated switches.

Introducing ODRS Browser

GNOME Open Desktop Ratings is the service that enables user ratings in various Linux app stores like the Snap-Store, Gnome Software and KDE Discover.

While it nowadays works for users by providing a mostly useful star rating, from a application developer perspective the story is very grim.

Basically one only gets the users view, which provides an average rating and some reviews in the current locale.
This means you might see something like “2 Stars from 80 Reviews” – but the 3 reviews in your current locale are all 4-5 Star.
To see something else you have to change the locale and restart the app store – which is inconvenient and confusing.
As a developer, seeing the negative reviews is crucial, as people often just post bug reports there and this is the only way to find out why the app did not work for them.

Therefore I quickly hacked together a web-based browser for the ODRS service, skillfully named

This allows accessing the ODRS service from the web and shows the reviews from multiple locales at once. The idea here is that often people write reviews in english – regardless of their current locale. Currently, ODRS has no logic to detect that.

Also, if your app is packaged in different formats like snap and flatpack and deb, you can see the reviews of all variants in the overview.

Unfortunately, ODRS currently does not set the CORS header which prevents browsers from accessing it directly. The data that you see right now was scraped with python script. But once this issue is fixed, the ODRS Browser will be able to use live data.

Debugging Python with GDB on Ubuntu

Lets say you want to debug a python process that is either already running or crashing in native code. Pythons PDB is of no help here and you will have to use low-level GDB debugger. Fortunately, it comes with support for debugging high level python scripts.

However, while the actual python-gdb commands are nicely described here, that page lacks important details on how to get python-gdb in the first place. We are merely told that a python-gdb.py is needed.

On Ubuntu/ Debian, this file is included in the python3-dbg package:

sudo apt install python3.10-dbg

Installing that is sufficient, if you use the matching python3 package. You can go ahead and connect to some running python process via:

gdb -p <PID>
# verify that the script is loaded
(gdb) info auto-load
# get a python backtrace
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/lib/python3.10/socketserver.py", line 232, in serve_forever
...

In case Ubuntu is merely a host and you use coda, you can still use the host python-gdb.py – even if the python versions dont match. You will have to load the script manually though like:

(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.10-gdb.py

Fix Steam Deck Input in Desktop Mode

While older SteamOS releases used to map the right trigger to the left mouse button by default, in current SteamOS you can only click by using the touchpad. However due to the way you hold the device it is really fiddly – especially if you try to drag and drop something.

Fortunately, there is a way to fix this via a setting in Steam. For this you need launch Steam when in Desktop Mode. There, switch to big picture mode and go to

Settings > Base configuration > Desktop Configuratiom

In this view you can configure the inputs to your liking

I suggest you to go with the following setup

  • Right trigger for left click (sounds counter-intuitive, but works well)
  • Left trigger for right click
  • Left touchpad for moving the mouse (doh)
  • Right touchpad for scroll wheel

With this configuration you can use the desktop mostly pain-free.

Using Docker with SLURM

The SLURM documentation provides you with the basic information that you can use Docker withing SLURM – as long as you use rootless Docker. However some crucial pieces are missing.

The issue that you will immediately run into is that the SLURM resource allocation is not propagated to docker at all. E.g. if you start your job with srun --gpus 1 docker ... all GPUs will be available to docker nevertheless.

The issue here is that Docker uses a manager daemon that the docker CLI communicates with. And that daemon does not know anything about SLURM or any resources it allocated for the job.

The solution is to start a daemon per job (instead of per user) as one user might want to run different jobs with different allocations on the same machine. The docker documentation gives you an idea on how to do that.

You will need to set at least the following parameters to make the daemon fully job-specific

# dockerd-rootless.sh requires XDG_RUNTIME_DIR
XDG_RUNTIME_DIR=/somewhere/including/$SLURM_JOB_ID
# export, so docker client sees it later on
export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock
dockerd-rootless.sh --host=$DOCKER_HOST --data-root=... --exec-root=...

Here, exporting DOCKER_HOST makes the docker CLI use the correct daemon.

The drawback of this method is that each job needs to pull the container again due to the separate data-root paths. Switching to podman might solve that.

Steam Deck SSD Upgrade

If you, like me, went with the entry level Steam Deck option with only 64 GB of internal storage, you likely realized quite soon that some games wont fit on it.

One option is to use the microSD expansion card slot. For current-gen games the throughput of only about 150 MB/s does not seem to degrade loading performance compared to a NVMe SSD.
However, given that the internal storage is upgradable, the only logical choice for keeping your PC master race status is to cram in the fastest NVME SSD inside that thing.

Specifically, you will need a one-sided SSD in the M.2 2230 for factor so it fits the space inside the Steam Deck.
I went with the KIOXIA Client-SSD BG5 512GB. Kioxia is the Toshiba spin-off for SSD drives, if you wonder about the brand. Although it is a PCIe 4.0 drive, its peak read throughput of 3.5 GB/s is within the practical limits of PCIe 3.0 of the Steam Deck.
Also, the active power consumption of 4.1W is quite close to the 3.8W drawn by the custom PHISON PS5013 E13 SSD that Valve uses.

You can follow the iFixit Guide for the steps to actually swap the SSD. Make sure to transfer the ESD shielding wrap to the new SSD.

To get Steam OS on the new drive, follow the official recovery instructions and select the “Re-image Steam Deck” script.
This will install Steam OS on the blank SSD – similar to how you would install Ubuntu from a live USB.

Benchmarking results

Next, I wanted to actually compare the speed of the upgraded NVMe SSD with the one of the stock eMMC memory. To this end I used KDiskMark – an open-source alternative to CrystalDiskMark that runs on Linux natively.

The tests were performed on SteamOS 3.3.1 using KDiskMark 2.3.0.


In short, the NVME offers roughly one order of magnitude faster throughput over the eMMC.
Whether you feel this in-game, highly depends on the given game. For older titles, even the eMMC is so fast, that you cannot read the hints on the loading-screen. However, for something like the Flight Simulator 2020 that shuffles huge assets around, it will surely be noticeable.

Finally, the peak read performance of 3.5GB/s is not reached. This might be due to the PCIe 3.0 bottleneck – I did not bother putting the drive in a PCIe 4.0 device. Still, there is a significant advantage in writing performance over the older Kioxia BG4 series, that only do 1.4 GB/s.

Computing replaygain for your Music library

TLDR; command at the end of post

If you want a equal loudness for your Music library the go to solution and the de-facto standard is ReplayGain.
If you are using a music streaming service, the provider is typically taking care of that for you – but maybe you want to migrate towards your own streaming solution.

ReplayGain analyses your audio files and stores their deviation from the baseline loudness as a tag. A compatible audio player can then read the tag and correct the playback volume so all you tracks have the same loudness.

Of course things get messy once you look at details like what the baseline loudness should be and how to determine loudness in the first place. Therefore we set the baseline once and for all as 89db and consider even tracks of the same album individually. If you disagree, feel free to branch off reading up the details now.

The next issue is that ReplayGain was born in a time where mp3 was synonymous to digital music, hence the algorithm was first implemented as the mp3gain CLI tool. Nowadays you also need aacgain and vorbisgain to cover all your formats, which is cumbersome to automate.

The larger issue with ReplayGain is that it defines loudness of a track by its peak volume. While a sane choice in theory, in practice the music and advertising industries raced to increase the perceived loudness without raising the peak volume. As broadcasters also used peak volume normalization, one could blow your eardrum with that very special advertisement.
Therefore the EBU R 128 was proposed which at its core is RMS based, meaning it is considering the average volume of the track.

Remember that ReplayGain merely adds a correction value to the tracks? This allows us to compute that correction value based on the R128 algorithm for a better normalization, which is exactly what the <a href="https://github.com/desbma/r128gain">r128gain</a> tool does.
Being written in modern day, r128gain also processes all possible audio files by hooking into ffmpeg as a filter.

So, without further ado, this is the command to normalize your Music library:

# pip3 install r128gain
r128gain -p -r Music/

This will preserve "-p" the file timestamps and recursively "-r" process all files in the given directory.

Trouble shooting

Note that if you previously used mp3gain, your files might contain non-standard lower-case replaygain_* tags, while r128gain will only write REPLAYGAIN_* tags.
To avoid confusing players with different values, you should remove the non-standard tags. This can be automated with eyeD3

eyeD3 -Q --remove-frame RGAD --preserve-file-times --user-text-frame=replaygain_track_gain: --user-text-frame=replaygain_track_peak: --user-text-frame=replaygain_album_gain: --user-text-frame=replaygain_album_peak: Music/

Refer to its documentation for the meaning of the parameters. For RGAD see here.

Header Image: “volume” by christina rutz (CC-BY-2.0)