What is key takeaways?

- A single-device readiness score is dominated by sensor error and population-norm calibration. The number is mostly noise unless you happen to sit near the median user the manufacturer validated against. - Defensible scoring requires three layers in this order: fuse multiple objective inputs weighted by their validated accuracy, calibrate to a 30-day person…

The Three-Layer Sleep Readiness Score That Actually Works

Q: What is sources?

1. de Zambotti M, et al. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep. 2024. https://doi.org/10.1093/sleep/zsad325 2. Miller DJ, et al. A validation study of the WHOOP strap against polysomnography. Sensors. 2022;22(16):6131. https://doi.org/10.3390/s22166131 3. Altini M, Kinnunen H.

You wake up. WHOOP tells you recovery is 64%. Your Apple Watch sleep score reads 87. Oura sits somewhere between. You feel groggy on a day you were promised you would feel sharp, or you feel surprisingly good on a day the algorithm has written off. The instinct is to blame the device. The deeper problem is that none of these scores were ever calibrated to you. They were calibrated to a population the manufacturer used during validation, then handed to you as if your physiology matched the median. A defensible readiness score has to be built in three layers, in this order: multi-source objective fusion, personal baseline calibration, and a subjective feedback loop. Skip any of them and the number you wake up to is mostly noise.

Why one device cannot give you a trustworthy score

Every wrist or finger wearable estimates physiology indirectly. Optical photoplethysmography infers heart rate and heart rate variability from blood-volume changes under the skin, then derives sleep stages from heart rate patterns, motion, and skin temperature. The accuracy gap between this approach and a polysomnogram (the clinical gold standard) is well-documented. de Zambotti et al. (Sleep, 2024) reviewed consumer sleep technology validation and showed total-sleep-time accuracy within roughly 14 minutes against PSG but stage-by-stage agreement that drops substantially for REM detection across most wrist-based devices. Miller et al. (Sensors, 2022, n=33) ran a similar validation on WHOOP 4.0 and found 84% agreement for wake versus sleep but lower agreement for deep sleep epochs. Oura Generation 3 has performed comparably in independent work by Altini and Kinnunen (Sensors, 2021).

These are good devices. The problem is not that any one of them is broken. It is that each one estimates the underlying biology with a particular error profile, and a single estimate has no way to know when it is wrong. The first layer of a defensible readiness score is therefore not a better algorithm. It is more than one input.

Layer one: multi-source objective fusion

Fusion means combining estimates with uncorrelated error so the combined number is more accurate than any single input. In sleep tracking, the inputs available to most users include nocturnal heart rate, heart rate variability (most often RMSSD), respiratory rate, skin temperature deviation, oxygen saturation, and movement-derived sleep staging. A device that uses optical HR on the wrist has a different error pattern from a finger-based device, and both differ from a chest strap or ECG patch.

The practical implication: if you wear an Oura ring and an Apple Watch, the combined signal is more reliable than either alone, but only if you fuse intelligently. Naive averaging gives equal weight to a sensor that is wrong and one that is right. The defensible approach is to weight each signal by its validated accuracy for the metric in question. For nocturnal RMSSD, finger-based optical sensors have shown stronger ECG concordance than wrist-based sensors in the Oura Generation 4 validation work (Lehrer et al., 2025). For step-derived activity load, the Apple Watch has the larger validation literature. Use each device for what it measures best, and weight accordingly.

This is also where Shaffer and Ginsberg's foundational HRV review (Frontiers in Public Health, 2017) becomes relevant. They make clear that SDNN (the standard deviation of all NN intervals) and RMSSD (the root mean square of successive differences) are not interchangeable. SDNN reflects total variability across both sympathetic and parasympathetic branches. RMSSD reflects short-term parasympathetic activity. When Apple reports HRV from a spot sample, it is computing SDNN. When Oura reports overnight HRV, it is computing RMSSD averaged across the night. Combining them without knowing which metric you are looking at is a category error, not fusion.

Layer two: personal baseline calibration

Even perfectly fused objective signals are useless without calibration to you. The population norm "good HRV is over 40 ms" is one of the more harmful statements in consumer health because HRV declines with age, varies by sex, and has a three- to five-fold range across healthy adults. Stein and Pu (Sleep Medicine Reviews, 2012) summarized the decline: median RMSSD in a healthy 25-year-old sits around 42 ms, in a healthy 65-year-old around 17 ms. Neither is "better." Each is normal for that person at that age.

The only honest reference for your HRV today is your HRV last month and your HRV last year. Plews et al. (International Journal of Sports Physiology and Performance, 2013) showed that elite endurance athletes need a rolling 7-day window to stabilize HRV interpretation because daily variation is in the 15 to 25 percent range even at consistent health. Bourdillon et al. (Frontiers in Physiology, 2022) extended this to recreational populations and found the same picture: a single morning reading is dominated by noise. A 7-day or 30-day rolling baseline is what you compare against.

Calibration is not just for HRV. Resting heart rate, respiratory rate, and skin temperature all need personal baselines. The "normal" resting heart rate range of 60 to 100 beats per minute is a clinical screen, not a physiological reference. A 35-year-old endurance trained adult might have a resting heart rate of 44. A 35-year-old desk worker might sit at 72. Both are healthy. Only deviation from each person's own 30-day median tells you something is changing.

This is the layer most consumer scores skip. They use population norms because they ship to millions of users with little or no calibration period. The result is that a healthy person with naturally low HRV gets penalized every morning, and a person whose HRV is trending down from baseline gets reassured because they remain above the population median.

Layer three: the subjective feedback loop

The third layer is the one consumer apps almost never close: ask the user how they actually feel, then update the model. Holt-Lunstad and colleagues have argued for years that subjective measures of recovery and well-being carry information that objective sensors miss, particularly around social and cognitive stressors that do not always register in HRV (Annual Review of Psychology, 2018). A 30-second morning prompt that captures perceived energy, mood, and cognitive sharpness on a simple scale generates enough labeled data over a few weeks to find the cases where the objective score and the subjective experience disagree.

Those disagreement cases are diagnostic. If your fused objective score consistently says you are recovered but you consistently feel flat, the system is missing a variable. Common candidates: late caffeine, alcohol, perceived workload, or psychological stress that did not show up in nocturnal HRV. If your objective score consistently says you are wrecked but you feel fine, you may be miscalibrated or the system may be over-weighting a single noisy input. Either way, the feedback loop tells you what to investigate. Without it, the system never learns.

This is not novel. Bayesian updating is standard in any prediction system that produces a forecast and then sees the outcome. It is just rare in consumer wearables because closing the loop requires the user to do something every morning, which depresses retention metrics. Apps optimize for the lower-friction product. The result is a score that never gets smarter.

What this looks like in practice

A defensible morning readiness pipeline runs roughly as follows. Overnight, the system collects RMSSD from a finger-based sensor, resting heart rate from both wrist and finger, respiratory rate, skin temperature deviation from baseline, oxygen saturation, and total sleep time. It computes a Sleep Regularity Index value (Phillips et al., Scientific Reports, 2017) using the past 14 days of sleep timing. It weighs each input by its validated accuracy and fuses them into a single objective number. It then compares that number to the user's 30-day rolling baseline, not a population norm. Finally, it prompts the user for a 10-second subjective rating, stores the pair, and reweights the model over time based on how often objective and subjective agree.

The output is not a score out of 100. It is a statement of the form: your HRV is 1.2 standard deviations below your 30-day baseline, your sleep regularity has dropped by 8 points in the past week, your subjective ratings have trended down in parallel, and the three signals together suggest your recovery is meaningfully impaired today. That is information a person can act on. A single number, dropped from the sky by an algorithm calibrated to someone else, is not.

Why most apps will not do this

Three reasons. First, multi-source fusion requires the user to wear two devices, which is friction the marketing team will resist. Second, personal baseline calibration requires a two- to four-week warm-up during which the app cannot give the user a confident score, which kills onboarding metrics. Third, the subjective loop requires daily user input, which depresses retention. Every layer that makes the score honest also makes the product harder to sell. So the industry has converged on a single-device, population-normed, no-feedback design that ships well and underperforms on the actual job.

The honest version of the score is what some longevity practitioners and a small number of new tools are starting to build. The principles are not complicated. They are just inconvenient.

Key takeaways

A single-device readiness score is dominated by sensor error and population-norm calibration. The number is mostly noise unless you happen to sit near the median user the manufacturer validated against.
Defensible scoring requires three layers in this order: fuse multiple objective inputs weighted by their validated accuracy, calibrate to a 30-day personal baseline, and close the loop with daily subjective feedback.
The HRV metric matters. SDNN and RMSSD measure different things and should not be averaged. Use each device for what it measures best.
If your objective and subjective ratings disagree consistently, the system is missing a variable. That disagreement is diagnostic, not noise.

Sources

1. de Zambotti M, et al. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep. 2024. https://doi.org/10.1093/sleep/zsad325 2. Miller DJ, et al. A validation study of the WHOOP strap against polysomnography. Sensors. 2022;22(16):6131. https://doi.org/10.3390/s22166131 3. Altini M, Kinnunen H. The promise of sleep: a multi-sensor approach for accurate sleep stage detection using the Oura Ring. Sensors. 2021;21(13):4302. 4. Lehrer HM, et al. Validation of the Oura Ring against polysomnography and ECG-derived HRV. Sensors. 2025. 5. Shaffer F, Ginsberg JP. An overview of heart rate variability metrics and norms. Frontiers in Public Health. 2017;5:258. 6. Stein PK, Pu Y. Heart rate variability, sleep and sleep disorders. Sleep Medicine Reviews. 2012;16(1):47-66. 7. Plews DJ, et al. Training adaptation and heart rate variability in elite endurance athletes: opening the door to effective monitoring. International Journal of Sports Physiology and Performance. 2013;8(6):618-624. 8. Bourdillon N, et al. Day-to-day variability of nocturnal cardiovascular parameters measured with Oura Ring. Frontiers in Physiology. 2022;13:828349. 9. Phillips AJK, et al. Irregular sleep/wake patterns are associated with poorer academic performance and delayed circadian and sleep/wake timing. Scientific Reports. 2017;7:3216. 10. Holt-Lunstad J. Why social relationships are important for physical health. Annual Review of Psychology. 2018;69:437-458.

---

Want VITA to do this for you automatically? Join the waitlist