From 7b99570151ba0cb2e19c93ba953497af7e34956a Mon Sep 17 00:00:00 2001 From: Bjarke Hautop Date: Thu, 10 Jul 2025 17:07:47 +0200 Subject: [PATCH 01/12] Update intial survival analysis example and clarify censoring used in the book refers to right censoring --- .../chapter13/beyond_regression_and_classification.qmd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 58919c52c..0b05555c8 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -128,12 +128,13 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to ## Survival Analysis {#sec-survival} `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. -This predictive problem is unique as survival models are trained and tested on data that may include 'censoring', which occurs when the event of interest does *not* take place. +This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed. +The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends. For the rest of this section when we write censoring we refer to right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. -However, if the event of interest does not take place (e.g., the marathon runner gives up and does not finish the race), they are said to be censored. +However, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be censored. Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation. -So in our example, we might write the runner's outcome as $(4, 1)$ if they finish the race at four hours, otherwise, if they give up at two hours we would write $(2, 0)$. +In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring). Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$. From ad75a4717f091a1a7d3262c1b6b8b6c59acf0211 Mon Sep 17 00:00:00 2001 From: Lars Kotthoff Date: Thu, 10 Jul 2025 12:33:05 -0600 Subject: [PATCH 02/12] Update beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 0b05555c8..e73252abd 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -134,7 +134,7 @@ Survival analysis can be hard to explain in the abstract, so as a working exampl Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. However, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be censored. Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation. -In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. +In our example, we might record a runner's outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring). Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$. From 0a81f2e5756a83af3dd4d88d67c498f2607e3544 Mon Sep 17 00:00:00 2001 From: Lars Kotthoff Date: Thu, 10 Jul 2025 12:34:46 -0600 Subject: [PATCH 03/12] Update beyond_regression_and_classification.qmd --- .../chapter13/beyond_regression_and_classification.qmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index e73252abd..a0e9ed7f1 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -129,7 +129,8 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed. -The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends. For the rest of this section when we write censoring we refer to right censoring unless otherwise stated. +The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends. +For the rest of this section, censoring means right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. However, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be censored. From 34b21c7060fc7c4bcb026108f18c8d3d435d9253 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 16:58:58 +0200 Subject: [PATCH 04/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index a0e9ed7f1..567fb6c3d 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -128,7 +128,7 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to ## Survival Analysis {#sec-survival} `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. -This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed. +This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed for some subjects. The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends. For the rest of this section, censoring means right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. From 30cd077f8bcf26c2aab216224f214a070dbbbfc9 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 16:59:08 +0200 Subject: [PATCH 05/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 567fb6c3d..943e63190 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -129,7 +129,7 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed for some subjects. -The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends. +The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*). For the rest of this section, censoring means right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. From 2b171cc71536b46dceaa92b1a87897eca108fd04 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 16:59:29 +0200 Subject: [PATCH 06/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapter13/beyond_regression_and_classification.qmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 943e63190..4dedd9bd8 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -133,7 +133,8 @@ The most common type of censoring is 'right censoring', which happens when the e For the rest of this section, censoring means right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. -However, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be censored. +However, not all finish times may be observed. +For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation. In our example, we might record a runner's outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. From 59fa6d42e7912fb7c3a04c72deb9a6dd24ac957d Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 16:59:38 +0200 Subject: [PATCH 07/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 4dedd9bd8..c06f6fe6c 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -135,7 +135,7 @@ Survival analysis can be hard to explain in the abstract, so as a working exampl Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. However, not all finish times may be observed. For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. -Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation. +Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed. In our example, we might record a runner's outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring). From fe41c66d082531d348a3a114b2cbf6c6bc23f2c6 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 16:59:49 +0200 Subject: [PATCH 08/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index c06f6fe6c..049f38c70 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -136,7 +136,7 @@ Here the 'survival problem' is trying to predict the time when the marathon runn However, not all finish times may be observed. For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed. -In our example, we might record a runner's outcome as $(3, 1)$ if they finish the race at three hours and we observe it, and as $(4, 0)$ if they are still running at four hours when we stop observing. +In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race in three hours and we observe it, as $(4, 0)$ if they are still running at four hours when observation ends (administrative censoring), or as $(2.5, 0)$ if their tracking device fails and we lose contact at 2.5 hours (random censoring). The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring). Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$. From 5b8ff98a07fb38793e8c022847b6e72ee302a7d6 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 17:03:52 +0200 Subject: [PATCH 09/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- book/chapters/chapter13/beyond_regression_and_classification.qmd | 1 - 1 file changed, 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 049f38c70..c4c82d105 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -130,7 +130,6 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed for some subjects. The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*). -For the rest of this section, censoring means right censoring unless otherwise stated. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. However, not all finish times may be observed. From 7a9dcffbc5f687c77af5c04c589ee3dc13a15f7b Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 17:04:12 +0200 Subject: [PATCH 10/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index c4c82d105..44d9b38aa 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -128,7 +128,7 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to ## Survival Analysis {#sec-survival} `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. -This predictive problem is unique because survival models are trained and tested on data that may include censoring, which occurs when the exact event time is not observed for some subjects. +This predictive problem is unique because survival models are trained and tested on data that may include 'censoring', which occurs when the exact event time is *not* observed for some subjects. The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*). Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. From 048bea732f4adcc4e8bed5a7c52af028b0f1ab22 Mon Sep 17 00:00:00 2001 From: John Zobolas Date: Tue, 22 Jul 2025 17:04:26 +0200 Subject: [PATCH 11/12] Update book/chapters/chapter13/beyond_regression_and_classification.qmd --- .../chapter13/beyond_regression_and_classification.qmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index 44d9b38aa..af99019eb 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -133,7 +133,8 @@ The most common type of censoring is 'right censoring', which happens when the e Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. However, not all finish times may be observed. -For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. +For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. +Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed. In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race in three hours and we observe it, as $(4, 0)$ if they are still running at four hours when observation ends (administrative censoring), or as $(2.5, 0)$ if their tracking device fails and we lose contact at 2.5 hours (random censoring). From 0e4e00151fc74cd137e54b436d7f393716e0a951 Mon Sep 17 00:00:00 2001 From: john Date: Tue, 22 Jul 2025 17:13:04 +0200 Subject: [PATCH 12/12] refine one sentence more --- .../chapters/chapter13/beyond_regression_and_classification.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index af99019eb..5e476b9d9 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -142,7 +142,7 @@ The key to modeling in survival analysis is that we assume there exists a hypoth Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$. Learners are trained on $(T, \Delta)$ but, critically, make predictions of $Y$ from previously unseen features. This means that unlike classification and regression, learners are trained on two variables, $(T, \Delta)$, which, in R, is often captured in a `r ref("survival::Surv")` object. -Relating to our example above, the runner's outcome would then be $(T = 4, \Delta = 1)$ or $(T = 2, \Delta = 0)$. +Relating to our example above, the runner's outcome would then be represented as $(T = 3, \Delta = 1)$ if they finish in three hours, or as $(T = 4, \Delta = 0)$ if they are still running when the race clock ends, or as $(T = 2.5, \Delta = 0)$ if we lose contact with them partway through. Another example is in the code below, where we randomly generate six survival times and six event indicators, an outcome with a `+` indicates the outcome is censored, otherwise, the event of interest occurred. ```{r beyond_regression_and_classification-006}