| Title: | Classify Missing Data as MCAR, MAR, or MNAR |
|---|---|
| Description: | Classify missing data as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). This step is required before handling missing data (e.g. mean imputation) so that bias is not introduced. See Little (1988) <doi:10.1080/01621459.1988.10478722> for the statistical rationale for the methods used. |
| Authors: | Noah William Trelawny Hellen [aut, cre, cph] |
| Maintainer: | Noah William Trelawny Hellen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1.9000 |
| Built: | 2026-06-01 09:22:40 UTC |
| Source: | https://github.com/noahhellen/missr |
A toy dataset with heart rate data for various animals.
animalhealthanimalhealth
A 200 x 2 data frame:
The animal of interest
The corresponding heart rate of the animal (bpm)
A toy dataset with typical company metrics across various firms.
companydatacompanydata
A 500 x 5 data frame:
Sales in the last fiscal year (USD, million)
Marketing spend in last fiscal year (USD, million)
Average rating across all products
Total employee count in last fiscal year
Gross profit in last fiscal year (USD, million)
A toy dataset with typical health check-up metrics for various individuals.
healthcheckhealthcheck
A 200 x 5 data frame:
Bone mass of individual (kg)
Body fat percentage of individual
Height of individual (cm)
Age of individual
Red blood cell count of individual (million/mm^3)
mar() performs multiple logistic regressions to test for MAR.
The null hypothesis for each is that the data are not MAR.
mar(data, debug = FALSE)mar(data, debug = FALSE)
data |
A data frame. |
debug |
A logical value used only for unit testing. |
In the following, each column of M with missing data is regressed on
D_obs. Each regression produces a vector of p-values (one for each
variable in D_obs). The smallest p-value is the most important. This
is because missing data need only be dependent on one observed variable
for the data to be MAR. If each reported smallest p-value is significant,
the data is MAR. See vignette("background") for definitions of M and
D_obs.
missing |
Column of M with missing data |
p_value |
Smallest p-value of the logistic regressions |
explanatory |
Variable corresponding to |
p_values |
The p-values of the logistic regressions |
variables |
Variables corresponding to |
combined |
Paired |
mar(healthcheck)mar(healthcheck)
mcar() performs Little's MCAR test to test for MCAR.
The null hypothesis is that the data is MCAR.
mcar(data, debug = FALSE)mcar(data, debug = FALSE)
data |
A data frame. |
debug |
A logical value used only for unit testing. |
This function reproduces the d^2 statistic in equation (5) from [1].
This statistic is used to test for MCAR. Comments reference variables
from vignette("background") (in brackets) to improve readability and
traceability.
statistic |
The d^2 statistic |
degrees_freedom |
Degrees of freedom of chi-squared distribution |
p_val |
P-value of the test |
missing_patterns |
Number of missing patterns |
Code is adapted from mcar_test() from the naniar package
using base R instead of the tidyverse.
[1] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-202.
mcar(pollutionlevels)mcar(pollutionlevels)
mnar() presents the statistics from mar() and mcar(). If at least one
p-value in mar() is not significant, and the p-value in mcar() is
significant then the data is MNAR.
mnar(data)mnar(data)
data |
A data frame |
There exists no formal test for MNAR data. This function therefore
presents the statistics for the tests in mar() and mcar(). If the
results suggest the data is neither MAR nor MCAR, one can use process of
elimination to deduce that the data is MNAR.
A list:
mcar |
Results of Little's MCAR test |
mar |
Results of MAR test |
mnar(companydata)mnar(companydata)
A toy dataset with typical pollution level metrics for various settlements.
pollutionlevelspollutionlevels
A 200 x 4 data frame:
Light pollution of settlement (mag/arcsec^2)
Visual pollution of settlement (VPI)
Noise pollution of settlement (dB)
Air pollution of settlement (AQI)
A toy dataset with test scores of various students.
testscorestestscores
A 200 x 2 data frame:
The ID of the student
The student's score in the test