Spectral analysis of Phone SSB model — presence band attenuation

Post Reply
User avatar
HB9VQQ
Site Admin
Posts: 86
Joined: Thu Dec 07, 2023 11:18 am
Location: Switzerland
Contact:

Spectral analysis of Phone SSB model — presence band attenuation

Post by HB9VQQ »

Hi all,

I've done some controlled spectral analysis of the RMNoise "Phone web client version 824e959" model. I wanted to quantify a subjective impression that the denoised audio sounds "dull" compared to the original — and the data confirms it. The same behavior is observed both through my ka9q-web SDR integration and through the official RMNoise web client, so this is a model characteristic, not an integration issue.

Test setup:
- RX888 MkII → ka9q-radio (radiod) → ka9q-web (HB9VQQ fork)
- 80m SSB signals, strong stations (S7-S8)
- Recordings taken back-to-back on same signal: RMNoise Off, 60% mix, 90% mix
- Compressor and all EQ OFF for all recordings
- Audio resampled to 8 kHz before sending to RMNoise (matching the documented spec)
- Confirmed same behavior on official RMNoise web client
- Analysis: Welch PSD (4096-point FFT, Hann window, 50% overlap)
- Multiple sample sets recorded to confirm repeatability

Results (averaged across multiple samples):

Code: Select all

Band                     | 60% mix        | 90% mix
─────────────────────────┼────────────────┼──────────────
0-300 Hz   (sub-voice)   |   ~0 dB        |   ~0 dB
300-1000 Hz (low voice)  |  -5 to -7 dB   |  -1 to -3 dB
1000-2000 Hz (mid voice) |  -3 to -6 dB   |  -2 to -3 dB
2000-3000 Hz (presence)  |  -4 to -7 dB   |  -5 to -7 dB
3000+ Hz   (brilliance)  | -13 to -17 dB  | -15 to -17 dB

The key finding is the consistent 4-7 dB attenuation in the 2000-3000 Hz presence band. This is the frequency range responsible for voice clarity, consonant intelligibility, and the "crispness" that makes speech cut through noise. Losing 5+ dB there is very audible — it shifts the perceived timbre from clear to muffled/warm.

I also tested whether my send-path LPF cutoff was contributing: changing from 3000 Hz to 2800 Hz (matching the documented bandwidth spec) actually made the presence loss worse (-6.9 dB vs -5.7 dB at 90% mix), confirming the attenuation is inherent to the AI model, not the input filtering.

Additional observations:
- The mix slider doesn't scale the effect linearly across all bands. At 90% the low-voice band (300-1000 Hz) is barely touched while presence is heavily cut. At 60% the lows are cut more but presence stays similar. The model's spectral shaping appears signal-dependent.
- The model appears to treat upper voice harmonics in the 2-3 kHz range as noise. This is where consonants (s, t, f, th) live and where the human ear is most sensitive to speech intelligibility.
- Sub-voice (0-300 Hz) is essentially untouched at any mix level.

Would it be possible to train a model variant that preserves more energy in the 2-3 kHz presence band? Even 2-3 dB less attenuation there would significantly improve perceived voice clarity for HF SSB use.

Spectral comparison plots attached.

73, Roland HB9VQQ
https://rx888.hb9vqq.ch:8081
Attachments
rmnoise_3000hz.png
rmnoise_3000hz.png (263.78 KiB) Viewed 13653 times
rmnoise_2800hz.png
rmnoise_2800hz.png (263.81 KiB) Viewed 13653 times
User avatar
RandyW
Site Admin
Posts: 185
Joined: Tue Dec 12, 2023 3:47 pm

Re: Spectral analysis of Phone SSB model — presence band attenuation

Post by RandyW »

Roland,
The model appears to treat upper voice harmonics in the 2-3 kHz range as noise. This is where consonants (s, t, f, th) live and where the human ear is most sensitive to speech intelligibility.
Thank you for this excellent analysis. I agree with your observations.

I think the denoising difference between the lower audio frequencies and the higher ones is related to the characteristics of speech. The lower fundamental frequencies of speech are simply more powerful than the higher fundamentals plus the harmonics. If you assume the noise has the same power across all audio frequencies, then this leads to very different SNR values between the lower and higher frequencies.

Here is the pure speech spectrum of both a man and a woman, talking for several seconds:
spec-man-woman2.jpg
spec-man-woman2.jpg (229.23 KiB) Viewed 13626 times


Here is the spectrum of a random noise recording:
spec-noise-N7QJP.jpg
spec-noise-N7QJP.jpg (192.44 KiB) Viewed 13626 times

Using this example, the speech power at 2500Hz is roughly 24 dB's lower than 500Hz (48dB - 72dB), and I will assume the noise power difference is 0dB between those two frequencies. If you mixed them together such that the audio at 500Hz was +15dB SNR (a strong signal), the SNR at 2500Hz would be around -9dB.


Indeed, I have read about this concern in research papers, and I have observed other people's commented-out attempts to address this issue in their code. I'm open to suggestions, and I've personally tried a few things to improve this:
  • I changed the model to include substantially more 'neurons' that were inline with only the higher frequencies
  • I changed the objective function to additionally penalize the model for mistakes at higher frequencies
  • I created a secondary model to generatively recover the missing high frequency information
(Out of this list, the generative method was the most interesting. Currently RM Noise uses a lowpass filter at around 2800Hz, but I configured the generative method to recover the audio up to 8kHz. My subjective observation was that it recovered high quality audio in some cases, but damaged the audio in others...)

As always, I reserve the right to be completely wrong... 73
Randy Williams K5QE
User avatar
HB9VQQ
Site Admin
Posts: 86
Joined: Thu Dec 07, 2023 11:18 am
Location: Switzerland
Contact:

Re: Spectral analysis of Phone SSB model — presence band attenuation

Post by HB9VQQ »

Hi Randy,

I compared the pure speech spectrum (from your example) with my received SSB audio (RMNoise OFF):

Code: Select all

Pure speech:   500 Hz = -47 dB, 2500 Hz = -73 dB → rolloff = ~26 dB
Received SSB:  500 Hz = -65 dB, 2500 Hz = -67 dB → rolloff = ~2 dB
The received SSB spectrum is essentially flat across 300-2800 Hz. The steep rolloff of pure speech doesn't survive the SSB transmit chain.

73, Roland HB9VQQ
User avatar
RandyW
Site Admin
Posts: 185
Joined: Tue Dec 12, 2023 3:47 pm

Re: Spectral analysis of Phone SSB model — presence band attenuation

Post by RandyW »

The received SSB spectrum is essentially flat across 300-2800 Hz. The steep rolloff of pure speech doesn't survive the SSB transmit chain.
I saw the spectrum in your graphs, and it seems consistent as the speech is mixed with noise which is likely flat across that range.

The "RMNoise OFF" (speech + noise) line of your chart also seemed consistent and shows more power at lower frequencies. The signal + noise power at 500Hz is around 9dB higher than that 2500Hz. At 100Hz (the typical fundamental frequencies of a man), the signal + noise power is around 14dB dB higher than 2500Hz.
spec-markup2.jpg
spec-markup2.jpg (48.11 KiB) Viewed 13603 times


From google AI:
In typical human speech, there is significantly less power at 2500 Hz compared to 500 Hz. The energy at 2500 Hz is generally about 15 dB to 20 dB less than at 500 Hz.


I was already researching a deterministic algorithm to boost either a) the high frequency part of the training target or b) the neural network's output. Alternatively, it might be fruitful to boost the high frequency part of the original audio [to be mixed], although all of these options might be tricky because of phase issues...
Randy Williams K5QE
User avatar
HB9VQQ
Site Admin
Posts: 86
Joined: Thu Dec 07, 2023 11:18 am
Location: Switzerland
Contact:

Re: Spectral analysis of Phone SSB model — presence band attenuation

Post by HB9VQQ »

Hi Randy,

You're right, looking more carefully at the full range from 100 Hz to 2500 Hz the rolloff is more than what I measured between just 500-2500 Hz. Fair point.

The deterministic high-frequency boost approach sounds very promising. Option (b) — boosting the neural network's output above ~1.5 kHz before mixing — might be the least risky since it wouldn't require retraining the model. Even a simple +3 dB shelf above 1.5 kHz on the denoised output could make a noticeable difference to perceived clarity before blending with the original.

Looking forward to hearing how your experiments go.

73, Roland HB9VQQ
Post Reply