mlr-org · BjarkeHautop · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025 · Jul 22, 2025
diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd
@@ -129,18 +129,21 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to
 ## Survival Analysis {#sec-survival}
 
 `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place.
-This predictive problem is unique as survival models are trained and tested on data that may include 'censoring', which occurs when the event of interest does *not* take place.
+This predictive problem is unique because survival models are trained and tested on data that may include 'censoring', which occurs when the exact event time is *not* observed for some subjects.
+The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*).
 Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race.
 Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race.
-However, if the event of interest does not take place (e.g., the marathon runner gives up and does not finish the race), they are said to be censored.
-Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation.
-So in our example, we might write the runner's outcome as $(4, 1)$ if they finish the race at four hours, otherwise, if they give up at two hours we would write $(2, 0)$.
+However, not all finish times may be observed.
+For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored.
+Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring.
+Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed.
+In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race in three hours and we observe it, as $(4, 0)$ if they are still running at four hours when observation ends (administrative censoring), or as $(2.5, 0)$ if their tracking device fails and we lose contact at 2.5 hours (random censoring).
 
 The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring).
 Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$.
 Learners are trained on $(T, \Delta)$ but, critically, make predictions of $Y$ from previously unseen features.
 This means that unlike classification and regression, learners are trained on two variables, $(T, \Delta)$, which, in R, is often captured in a `r ref("survival::Surv")` object.
-Relating to our example above, the runner's outcome would then be $(T = 4, \Delta = 1)$ or $(T = 2, \Delta = 0)$.
+Relating to our example above, the runner's outcome would then be represented as $(T = 3, \Delta = 1)$ if they finish in three hours, or as $(T = 4, \Delta = 0)$ if they are still running when the race clock ends, or as $(T = 2.5, \Delta = 0)$ if we lose contact with them partway through.
 Another example is in the code below, where we randomly generate six survival times and six event indicators, an outcome with a `+` indicates the outcome is censored, otherwise, the event of interest occurred.
 
 ```{r beyond_regression_and_classification-006}