-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32097] Enable Spark History Server to read from multiple directories #29630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: Ie3e2a6cc08b4c0e8770417c3b66f2f747bebefda
|
Can one of the admins verify this patch? |
|
I'm not sure your PR really deals with reading from multiple directories. The change is listing -> glob with *. Could you please elaborate what is the difference? The change also doesn't have any new unit tests verifying the changes. In general comment with the idea, having multiple root directories are still possible, but probably better to be just a static list (IMHO) instead of regex, as listing with glob pattern is known to be very slow. One thing I'm afraid of having multiple root directories is, SHS is already very complicated in point of thread-safety view even we only allow single root directory, and it may make things more complicated. I'm on the fence on doing this, until we are clear that this won't make SHS more complicated. |
Thanks for your response! By multiple directories I meant that a regex could potentially match more than one directory. In case of external file system, glob pattern might be better considering we will have to make just one over the network call. Also, it will be easier for the user to specify just one setting, instead of multiple values. What do you think? I will add the unit tests. Thanks for pointing out. MHS will function only as a read only server. Can thread-safety be an issue in that case? |
|
SHS isn't a read-only; there's also a cleanup phase, and load and cleanup phases are running in parallel. Please take a look at the SHS code in master branch thoughtfully. |
| val updated = Option(fs.listStatus(new Path(logDir))).map(_.toSeq).getOrElse(Nil) | ||
| val updated = Option(fs.globStatus(new Path(logDir + "/*"))).map(_.toSeq).getOrElse(Nil) | ||
| .filter { entry => isAccessible(entry.getPath) } | ||
| .filter { entry => !isProcessing(entry.getPath) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is playing as a "lock" for the log file. You can see where the lock flag is accessed or/and modified.
| } | ||
| } | ||
|
|
||
| testRetry("provider reports error after FS leaves safe mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't remove the existing test unless you have strong reason to do so. If your change breaks existing test, you may need to explain the reason about necessity of discarding this.
| logDebug(s"Scanning $logDir with lastScanTime==$lastScanTime") | ||
|
|
||
| val updated = Option(fs.listStatus(new Path(logDir))).map(_.toSeq).getOrElse(Nil) | ||
| val updated = Option(fs.globStatus(new Path(logDir + "/*"))).map(_.toSeq).getOrElse(Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't quite get this, you are requiring all the directories to be under a single directory, I guess that makes this logic easier, but at the same time why the restriction and not have a list? If we are going to support multiple directories and make sure it works I don't see the reason to have this restriction. What if people have multiple clusters writing to different HDFS filesystems for instance.
I agree with @HeartSaVioR if we are going to support multiple directories we need to have a thorough look at all the logic here to make sure no other problems. I guess in this case you are using a single filesystem?
I think we need to flush out more of the overall goals and design first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are using a single file system in this case. This feature is useful when using external file systems for log data because then multiple directories will correspond to multiple clusters.
We could have a list too. Looking into that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we chose to go with glob-pattern is that our service creates short-lived YARN clusters on which Spark applications are run. Log data goes to directories in a remote file-system. Since the number of these clusters can be really large, glob pattern would better fit our use-case, than a static list would.
Thanks. I will go over those phases and ensure there aren't any concurrency concerns. |
|
I don't think Spark has the concept of "clusters". Even I don't think Spark has the concept of "cluster", unless you use standalone mode. More specifically, there's no strong relation between applications, and there's no sort of control plane for Spark side to control all applications in the cluster. The cluster is actually resource scheduler's cluster. If the rationalization of SPARK-32097 and SPARK-32135 is to make SHS be cluster-wise, then probably the concept of "cluster" needs to be defined and introduced, instead of having some fixes for workaround. Everyone has different views on being "cluster-wise". e.g. If SHS is cluster-wise and supports multi-clusters, I would prefer "isolated view" per cluster, like selecting the cluster first, and see filtered view for the cluster. I wouldn't prefer listing all applications from all clusters in the same list. This would be the opposite view you're proposing in SPARK-32135. So that's not a trivial thing. It warrants the discussion, including we really want to make it be "cluster-wise". |
Hey, I have updated the JIRA with a more elaborate description of our use-case. Initially, I had tried to keep it general but I noticed that has caused a lot of unintended confusion. Apologies for that. Could you review our use-case again and we can discuss what you think would be a better way to support it. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Change-Id: Ie3e2a6cc08b4c0e8770417c3b66f2f747bebefda
What changes were proposed in this pull request?
Currently logDir just refers to one directory. We would like to add a capability to HistoryServer UI to read from multiple directories.
Why are the changes needed?
Our service dynamically creates short-lived YARN clusters in cloud. Spark applications run on these dynamically created clusters. We want a static instance of SparkHistoryServer to view information on jobs that ran on these clusters.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By running existing test suites.