这是indexloc提供的服务,不要输入任何密码
Skip to content

Fix race between k8s and provider when deleting pod #908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 16, 2021

Conversation

cwdsuzhou
Copy link
Contributor

@cwdsuzhou cwdsuzhou commented Nov 12, 2020

Fixes #867
Fixes #907

Copy link
Contributor

@sargun sargun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the wrong approach because it still allows for a race The deletePodHandler can read from the knownPods map, and get the pod UID / (and resource version?) from there.

In that, it can populate DeleteField's Precondition:

https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#DeleteOptions

In that, you can specify a UID precondition:

https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#Preconditions

Thanks for addressing this work!

@sargun
Copy link
Contributor

sargun commented Nov 12, 2020

P.P.S.

There is a slight problem introduced, in that in AddFunc we get a duplicate pod in known pods potentially because a pod with the same key has been added. Perhaps we introduce another syncMap of deleting pods that things are transitioned to? Even at that, I can imagine this being confusing.

@cwdsuzhou
Copy link
Contributor Author

cwdsuzhou commented Nov 12, 2020

I believe this is the wrong approach because it still allows for a race The deletePodHandler can read from the knownPods map, and get the pod UID / (and resource version?) from there.

In that, it can populate DeleteField's Precondition:

https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#DeleteOptions

In that, you can specify a UID precondition:

https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#Preconditions

Thanks for addressing this work!

Thanks for your suggestions.

Done ! PTAL

@cwdsuzhou cwdsuzhou force-pushed the race_delete branch 2 times, most recently from 1b7cf97 to 19de33e Compare November 12, 2020 10:38
@cwdsuzhou cwdsuzhou requested a review from sargun November 12, 2020 10:39
@sargun
Copy link
Contributor

sargun commented Nov 18, 2020

@cwdsuzhou I made a comment on how I think this can be simpler.

@cwdsuzhou
Copy link
Contributor Author

@cpuguy83 PTAL thanks

Copy link
Member

@pires pires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, implement at least one e2e test that proves this fix.

@cwdsuzhou
Copy link
Contributor Author

cwdsuzhou commented Jan 18, 2021

Please, implement at least one e2e test that proves this fix.

Thanks! IMO this may be hard to add e2e tests since there are chances that the pod controller works correctly.

Will have a try

@cwdsuzhou
Copy link
Contributor Author

Please, implement at least one e2e test that proves this fix.

Thanks! IMO this may be hard to add e2e tests since there are chances that the pod controller works correctly.

Will have a try

Add some unit tests

@cwdsuzhou cwdsuzhou requested a review from pires January 18, 2021 13:13
@cwdsuzhou cwdsuzhou force-pushed the race_delete branch 2 times, most recently from 0da1d55 to 7549cb8 Compare January 18, 2021 14:16
Copy link
Contributor

@sargun sargun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the mechanism of adding metadata to the queue item to dedupe it, this looks okay to me apart from some minor nits. I yield to @pires or @cpuguy83 since they originally built the WQ based mechanism

@cwdsuzhou
Copy link
Contributor Author

@sargun PR addressed

@sargun
Copy link
Contributor

sargun commented Jan 29, 2021

@cpuguy83 ^

@pires
Copy link
Member

pires commented Jan 29, 2021

I plan to review this over the weekend.

@pires pires added the bug label Jan 29, 2021
@pires
Copy link
Member

pires commented Feb 1, 2021

I didn't get the chance to review, I'm sorry. Will try again during the week or weekend.

@cwdsuzhou
Copy link
Contributor Author

I didn't get the chance to review, I'm sorry. Will try again during the week or weekend.

Hi @pires could you help review this PR this week? thanks!

Copy link
Member

@pires pires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sargun initial review and I ask for the removal of code that injects/reads Pod UID into queue keys and instead simply pass the appropriate DeleteOptions with a Precondition that ensures the Pod UID is set.

UPDATE: this review is flawed because there's no way to actually get the Pod UID.

Copy link
Member

@pires pires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like most of this change but it does seems to fix this particular issue, and, right now, I can't think of any solution that doesn't require some deeper thinking into how we handle Pod events. And I do think that it will take some time to do that, including coming up with a design that covers as many scenarios as we can come up with, before any execution is taken on.

@pires
Copy link
Member

pires commented Feb 8, 2021

@sargun @cpuguy83 is there something you want to add?

@cwdsuzhou
Copy link
Contributor Author

I don't like most of this change but it does seems to fix this particular issue, and, right now, I can't think of any solution that doesn't require some deeper thinking into how we handle Pod events. And I do think that it will take some time to do that, including coming up with a design that covers as many scenarios as we can come up with, before any execution is taken on.

Thanks, this is not the best solution, but this can solve the issue without changing the vk designation too much.

@cwdsuzhou
Copy link
Contributor Author

@sargun @pires @cpuguy83
do you have any other suggestions ? or could you help approve this PR?

Thanks!

@pires
Copy link
Member

pires commented Feb 16, 2021

@sargun implicitly LGTM'ed and I have done it explicitly so I'm merging.

@pires pires merged commit d11968a into virtual-kubelet:master Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race issue between k8s and provider when deleting pod Pod recreated during pod.DeletionGracePeriodSeconds will be forcibly deleted
3 participants