-
Notifications
You must be signed in to change notification settings - Fork 2.1k
fix: Add fine grained interruptible task restarts in watch mode #11135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@johnpyp is attempting to deploy a commit to the Vercel Team on Vercel. A member of the Team first needs to authorize it. |
0475181 to
c091360
Compare
c091360 to
de049c1
Compare
| .filter(|c| predicate(c)) | ||
| .cloned() | ||
| .collect(); | ||
|
|
||
| lock.children.retain(|c| !predicate(c)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between the evaluation in the .filter() and the .retain() a child's state might change. We should do something more like:
let (matching, remaining): (Vec<_>, Vec<_>) = lock.children
.drain(..)
.partition(|c| predicate(c));
lock.children = remaining;
| // process manager is closing. If it is closing, it means we are shutting down | ||
| // the entire run. If it is not closing, it means we are restarting specific | ||
| // tasks. | ||
| if self.manager.is_closing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially dangerous based on race conditions and dependent task behavior. What tests can we write to validate that we're not creating bugs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example Race:
Thread 1 (exec.rs): Thread 2 (manager):
child exits
is_closing() → false
close() called
is_closing = true
return Restarted ❌
That should've been Shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we introduce some atomics, this concern would get addressed.
| pub struct ChildCommandChannel(mpsc::Sender<ChildCommand>); | ||
| pub struct ChildCommandChannel { | ||
| sender: mpsc::Sender<ChildCommand>, | ||
| task_id: Option<TaskId<'static>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little rough, because now the child process has to be aware of task_ids . These channels shouldn't be aware of "business logic" ideally, to preserve encapsulation.
I'm also worried about the lifetime implications. The TaskId is going to live the entire program, but line 433 has a .clone() that can end up referencing tasks that are non-static. That could result in panics.
Another small, small worry I have, but a worry nonethless, is the memory overhead associated with this strategy. TaskIds aren't dropped until the Child is dropped, so its possible to see some memory in high concurrency scenarios. (I haven't taken the time to validate this, but based on my other comments, I'd love to see an architectural change where this is no longer a concern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm fair enough. Is there a better identifier or way of getting the "set of processes mapped to tasks" that would allow us to not have the process code care about tasks at all?
That was the primary thing I wasn't confident in going into this change without being too familiar with the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After sleeping on this, I'm thinking the best way (still not great) we can go about this is if ProcessManager gets indices into Vec<Child>. Some pseudocoding:
#[derive(Debug)]
struct ProcessManagerInner {
is_closing: bool,
// None = inactive/dead process slot
children: Vec<Option<Child>>,
// Maps task_id -> indices in the children vec
// We keep dead indices here and filter them on access
task_index: HashMap<TaskId<'static>, Vec<usize>>,
size: Option<PtySize>,
}
impl ProcessManagerInner {
fn is_index_active(&self, idx: usize) -> bool {
self.children.get(idx)
.and_then(|opt| opt.as_ref())
.is_some()
}
fn get_active_children_for_task(&self, task_id: &TaskId) -> Vec<Child> {
self.task_index
.get(task_id)
.map(|indices| {
indices.iter()
.filter_map(|&idx| {
self.children.get(idx)
.and_then(|opt| opt.as_ref())
.cloned()
})
.collect()
})
.unwrap_or_default()
}
}
Now when a process exits, we don't need to update task_index immediately, it will get filtered.
// Mark index as dead - just set to None
lock.children[3] = None;
Stopping a task now looks like:
pub async fn stop_tasks(&self, task_ids: &[TaskId<'static>]) {
let children_to_stop = {
let mut lock = self.state.lock().unwrap();
let mut children = Vec::new();
for task_id in task_ids {
if let Some(indices) = lock.task_index.get(task_id) {
for &idx in indices {
// Check if still active
if let Some(Some(child)) = lock.children.get(idx) {
children.push(child.clone());
// Mark as inactive
lock.children[idx] = None;
}
}
}
// Remove task from index
lock.task_index.remove(task_id);
}
children
};
// Stop outside the lock
for mut child in children_to_stop {
child.stop().await;
}
}
What do you think? Is that enough to have another go with or do you need more detail?
anthonyshew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for this PR. I've put some comments into the review, but, as a top-level review comment, I'd love to see an approach to solve this such that we can be minimally invasive to the codebase outside of watch.rs. In truth, turbo watch deserves a large refactor, and so I'd love to keep the blast radius of PRs for fixes on the existing code to be small.
I completely understand that what I'm asking may border on impossible. This code is quite tricky, and has no test coverage, but I'd love to keep exploring what's possible here.
Description
Fixes #9421
This PR refactors watch mode task restarts to be "fine-grained", such that if a package changes, only active tasks which rely on that output get killed and restarted.
(Similar to #10846, but taking a slightly different approach which I think is cleaner).
Testing Instructions
Manual testing confirms the issue is fixed with this change. There aren't any existing automated tests for
watchand the like, given that it's a long running process I assume.