+
Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions book/chapters/chapter13/beyond_regression_and_classification.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,18 +129,21 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to
## Survival Analysis {#sec-survival}

`r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place.
This predictive problem is unique as survival models are trained and tested on data that may include 'censoring', which occurs when the event of interest does *not* take place.
This predictive problem is unique because survival models are trained and tested on data that may include 'censoring', which occurs when the exact event time is *not* observed for some subjects.
The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*).
Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race.
Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race.
However, if the event of interest does not take place (e.g., the marathon runner gives up and does not finish the race), they are said to be censored.
Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation.
So in our example, we might write the runner's outcome as $(4, 1)$ if they finish the race at four hours, otherwise, if they give up at two hours we would write $(2, 0)$.
However, not all finish times may be observed.
For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored.
Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring.
Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed.
In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race in three hours and we observe it, as $(4, 0)$ if they are still running at four hours when observation ends (administrative censoring), or as $(2.5, 0)$ if their tracking device fails and we lose contact at 2.5 hours (random censoring).

The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring).
Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$.
Learners are trained on $(T, \Delta)$ but, critically, make predictions of $Y$ from previously unseen features.
This means that unlike classification and regression, learners are trained on two variables, $(T, \Delta)$, which, in R, is often captured in a `r ref("survival::Surv")` object.
Relating to our example above, the runner's outcome would then be $(T = 4, \Delta = 1)$ or $(T = 2, \Delta = 0)$.
Relating to our example above, the runner's outcome would then be represented as $(T = 3, \Delta = 1)$ if they finish in three hours, or as $(T = 4, \Delta = 0)$ if they are still running when the race clock ends, or as $(T = 2.5, \Delta = 0)$ if we lose contact with them partway through.
Another example is in the code below, where we randomly generate six survival times and six event indicators, an outcome with a `+` indicates the outcome is censored, otherwise, the event of interest occurred.

```{r beyond_regression_and_classification-006}
Expand Down
Loading
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载