HRV Calibration: Why Your Recovery Score Is Wrong

The most common claim in consumer wearable marketing is some version of "good HRV is over 40 ms." It is a clean number that fits in a tile on a home screen. It is also nearly meaningless. Heart rate variability declines with age, varies by sex, depends on which metric you measure and when, and shifts 15 to 25 percent day to day at consistent health. The only reference that carries real information is your own baseline. This article walks through what HRV actually measures, why population norms mislead, and how to build the personal calibration that lets a recovery score say something true.

What HRV is, and what it is not

Heart rate variability is the millisecond-level variation in the time interval between consecutive heartbeats. Even when your average heart rate is steady at 60 beats per minute, the actual gaps between beats are not 1,000 ms exactly. They are 980, 1,020, 990, 1,010, and so on. That variation is generated by the autonomic nervous system. The parasympathetic branch (vagus nerve) accelerates and decelerates heart rate quickly in response to breathing and other inputs. The sympathetic branch acts more slowly. HRV is a window onto how these two branches are interacting in real time.

The reason wellness products care about HRV is that, in general, higher HRV correlates with better cardiovascular health, better recovery from physical and psychological stress, and lower all-cause mortality across populations. Dekker et al. (Circulation, 2000) and a long line of subsequent epidemiological work established the cross-sectional and prospective associations.

What HRV is not: a universal fitness score that scales the same way across people. Two healthy adults the same age and sex can have nightly RMSSD values that differ by a factor of three. The cross-sectional spread in healthy populations is enormous. Shaffer and Ginsberg's overview (Frontiers in Public Health, 2017) lays out the metric definitions and the range of normal in detail.

The metric soup: SDNN, RMSSD, LF, HF, pNN50

Consumer apps report different numbers under the same label. This is the first source of confusion when comparing scores across devices.

SDNN is the standard deviation of all normal-to-normal beat intervals during a recording. It reflects total variability and is sensitive to both sympathetic and parasympathetic input. SDNN is the standard metric for 24-hour ECG analysis in cardiology and has the most epidemiological evidence behind it. Apple Watch reports SDNN from spot samples typically taken during the Breathe app or randomly during the day.

RMSSD is the root mean square of successive differences between adjacent beat intervals. It is dominated by short-term beat-to-beat changes driven by the parasympathetic branch. RMSSD is what most wellness products use for nightly recovery scoring because parasympathetic recovery during sleep is the variable they want to track. Oura, WHOOP, Garmin, and most others report nightly RMSSD averaged across some portion of the sleep recording.

LF (low-frequency power, 0.04 to 0.15 Hz) and HF (high-frequency power, 0.15 to 0.40 Hz) come from spectral analysis. HF is dominated by respiratory sinus arrhythmia and is a parasympathetic marker. LF reflects a mix of sympathetic and parasympathetic activity and has a long-disputed interpretation. The Task Force of the European Society of Cardiology guidelines (Circulation, 1996) remain the reference for spectral methods.

pNN50 is the percentage of successive beat intervals that differ by more than 50 ms. It correlates strongly with RMSSD and is reported by some research devices.

The practical implication: comparing the SDNN your Apple Watch shows you on Tuesday to the RMSSD Oura shows you on Wednesday is comparing two different physiological quantities. Both are valid. Neither is the same.

Why population norms mislead

Stein and Pu (Sleep Medicine Reviews, 2012) summarized the decline of HRV with age across multiple cohorts. Median RMSSD in healthy adults runs roughly 42 ms at age 25, 35 ms at age 35, 28 ms at 45, 22 ms at 55, and 17 ms at 65. The decline is monotonic and largely unavoidable, driven by changes in vagal tone, sinoatrial node remodeling, and accumulated cardiovascular load.

This means a flat threshold like "HRV over 40 ms is good" is age-blind in a way that produces opposite errors at different ages. A naturally lean, well-conditioned 60-year-old with a true baseline RMSSD of 32 ms is in excellent health for their age but would be flagged as below threshold by a fixed cutoff. A 25-year-old with a baseline of 55 ms who drops to 42 ms because of an undiagnosed infection would be flagged as fine, even though their personal HRV has dropped 24 percent.

Sex matters too. Premenopausal women tend to show higher HF power and slightly higher RMSSD than age-matched men in cross-sectional studies (Koenig and Thayer, Neuroscience and Biobehavioral Reviews, 2016), though the effect size is modest and varies across studies and across the menstrual cycle.

Even within a single person, day-to-day variation is large. Plews et al. (International Journal of Sports Physiology and Performance, 2013) tracked elite endurance athletes daily and found that within-subject HRV varied by 15 to 25 percent across days even during periods of stable training and stable health. Bourdillon et al. (Frontiers in Physiology, 2022) extended this to recreational populations using Oura ring data and found the same picture in non-athletes. A single reading is dominated by noise. A 7-day rolling mean is much more interpretable. A 30-day baseline is needed to detect a meaningful trend.

What a real baseline looks like

Building a personal HRV baseline is straightforward in principle and requires patience.

Step one: pick a single metric and a single measurement context. Either nightly RMSSD averaged across sleep, or morning RMSSD from a short controlled measurement immediately after waking. Mixing contexts (a morning chest-strap reading on Tuesday, an overnight ring reading on Wednesday) destroys the baseline because the contexts produce different values.

Step two: record continuously for at least 14 days, ideally 30, before drawing any conclusions. During the baseline window, do not change anything. Sleep normally, train normally, drink the same amount or none. You are establishing a reference distribution, not optimizing.

Step three: compute your 30-day rolling median and your 30-day rolling interquartile range (IQR). The median is your typical value. The IQR is the spread of normal day-to-day variation. Together they define the band inside which a daily reading is uninformative.

Step four: define meaningful deviation. A daily reading that falls more than 1.5 IQR below your rolling median is a meaningful drop. A reading inside the band is noise. A series of three or more consecutive readings below the median is a trend that warrants investigation.

This is the framework Plews and Buchheit use to advise the Olympic teams they have worked with. It is also the framework that any honest consumer score should be built on. The reason most are not is that personal baselines require a two- to four-week warm-up period during which the product cannot give the user a confident score, which kills onboarding metrics.

What changes a true HRV reading

Once you have a baseline, the variables that move the metric are reasonably well-characterized.

Alcohol. Pietilä et al. (JMIR Mental Health, 2018) used Oura data from 4,098 participants and found a dose-response relationship: even a single drink suppressed nightly HRV by roughly 7 percent compared to a sober night, and heavier drinking suppressed HRV by 24 percent or more.

Late food. Eating within three hours of sleep onset suppresses HRV. The Wehrens et al. (Current Biology, 2017) controlled feeding study showed that late meal timing shifts peripheral clocks and impairs nocturnal autonomic recovery.

Illness onset. A drop of 10 to 20 percent below baseline for two or three consecutive days frequently precedes symptomatic infection by 24 to 48 hours. This is the basis of the wearable-based illness detection literature (Mason et al., Lancet Digital Health, 2022, n=153 COVID-19 cases tracked through Oura).

Heat and dehydration. Sleeping in a warm room or going to bed dehydrated reliably suppresses RMSSD. The effect is mechanical: sustained thermoregulatory load keeps sympathetic tone elevated.

Hard training the day before. This is the variable consumer products are best at detecting. A genuinely hard session typically produces a 10 to 30 percent drop in nightly RMSSD the following night.

What does not move HRV in the way consumer marketing suggests: a single supplement, a single breathing exercise, or a single cold plunge produces effects in the noise band of an uncalibrated user. The studies that show acute breathing-induced HRV increases are measuring the response during the breathing, not next-day nocturnal RMSSD.

How wearables compare on accuracy

Validation work has improved substantially in the past three years. We cover the device-by-device comparison in detail in our companion article on Whoop versus Oura versus Apple Watch HRV. The short summary: finger-based optical sensors with overnight recording (Oura) currently show the strongest ECG concordance for RMSSD, with reported correlations above 0.95 in recent validation work. Wrist-based optical with overnight recording (WHOOP, Garmin) is good but slightly noisier. Spot-sample SDNN from the Apple Watch is reliable for what it measures but is a fundamentally different metric and a single point in time rather than a nightly average.

Key takeaways

Population HRV norms are clinically irrelevant for individual decision-making. Age, sex, and individual physiology produce a three-fold or larger spread in healthy people.
SDNN and RMSSD are different metrics. Do not compare across them. Pick one and a single measurement context, and stick with it.
A personal baseline requires 14 to 30 days of consistent recording. Daily variation of 15 to 25 percent is normal even at stable health.
A 30-day rolling median plus IQR gives you a meaningful deviation threshold. A reading inside the band is noise. Three consecutive readings below the median is a trend worth investigating.
Alcohol, late food, illness onset, heat, and hard training move HRV in predictable directions. Single supplements and one-off interventions usually do not.

Sources

1. Shaffer F, Ginsberg JP. An overview of heart rate variability metrics and norms. Frontiers in Public Health. 2017;5:258. 2. Stein PK, Pu Y. Heart rate variability, sleep and sleep disorders. Sleep Medicine Reviews. 2012;16(1):47-66. 3. Plews DJ, et al. Training adaptation and heart rate variability in elite endurance athletes: opening the door to effective monitoring. International Journal of Sports Physiology and Performance. 2013;8(6):618-624. 4. Bourdillon N, et al. Day-to-day variability of nocturnal cardiovascular parameters measured with Oura Ring. Frontiers in Physiology. 2022;13:828349. 5. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. Heart rate variability: standards of measurement, physiological interpretation, and clinical use. Circulation. 1996;93(5):1043-1065. 6. Dekker JM, et al. Low heart rate variability in a 2-minute rhythm strip predicts risk of coronary heart disease and mortality from several causes. Circulation. 2000;102(11):1239-1244. 7. Koenig J, Thayer JF. Sex differences in healthy human heart rate variability: a meta-analysis. Neuroscience and Biobehavioral Reviews. 2016;64:288-310. 8. Pietilä J, et al. Acute effect of alcohol intake on cardiovascular autonomic regulation during the first hours of sleep. JMIR Mental Health. 2018;5(1):e23. 9. Wehrens SMT, et al. Meal timing regulates the human circadian system. Current Biology. 2017;27(12):1768-1775. 10. Mason AE, et al. Detection of COVID-19 using ring-based continuous monitoring. Lancet Digital Health. 2022.

---

Want VITA to do this for you automatically? Join the waitlist