[SPARK-32097] Enable Spark History Server to read from multiple directories #29630

Gaurangi94 · 2020-09-02T23:20:09Z

Change-Id: Ie3e2a6cc08b4c0e8770417c3b66f2f747bebefda

What changes were proposed in this pull request?

Currently logDir just refers to one directory. We would like to add a capability to HistoryServer UI to read from multiple directories.

Why are the changes needed?

Our service dynamically creates short-lived YARN clusters in cloud. Spark applications run on these dynamically created clusters. We want a static instance of SparkHistoryServer to view information on jobs that ran on these clusters.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running existing test suites.

Change-Id: Ie3e2a6cc08b4c0e8770417c3b66f2f747bebefda

AmplabJenkins · 2020-09-02T23:22:43Z

Can one of the admins verify this patch?

HeartSaVioR · 2020-09-03T05:30:25Z

I'm not sure your PR really deals with reading from multiple directories. The change is listing -> glob with *. Could you please elaborate what is the difference? The change also doesn't have any new unit tests verifying the changes.

In general comment with the idea, having multiple root directories are still possible, but probably better to be just a static list (IMHO) instead of regex, as listing with glob pattern is known to be very slow.

One thing I'm afraid of having multiple root directories is, SHS is already very complicated in point of thread-safety view even we only allow single root directory, and it may make things more complicated. I'm on the fence on doing this, until we are clear that this won't make SHS more complicated.

Gaurangi94 · 2020-09-03T07:44:23Z

I'm not sure your PR really deals with reading from multiple directories. The change is listing -> glob with *. Could you please elaborate what is the difference? The change also doesn't have any new unit tests verifying the changes.

In general comment with the idea, having multiple root directories are still possible, but probably better to be just a static list (IMHO) instead of regex, as listing with glob pattern is known to be very slow.

One thing I'm afraid of having multiple root directories is, SHS is already very complicated in point of thread-safety view even we only allow single root directory, and it may make things more complicated. I'm on the fence on doing this, until we are clear that this won't make SHS more complicated.

Thanks for your response! By multiple directories I meant that a regex could potentially match more than one directory. In case of external file system, glob pattern might be better considering we will have to make just one over the network call. Also, it will be easier for the user to specify just one setting, instead of multiple values. What do you think?

I will add the unit tests. Thanks for pointing out.

MHS will function only as a read only server. Can thread-safety be an issue in that case?

HeartSaVioR · 2020-09-03T08:40:31Z

SHS isn't a read-only; there's also a cleanup phase, and load and cleanup phases are running in parallel. Please take a look at the SHS code in master branch thoughtfully.

HeartSaVioR · 2020-09-03T08:41:43Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

-      val updated = Option(fs.listStatus(new Path(logDir))).map(_.toSeq).getOrElse(Nil)
+      val updated = Option(fs.globStatus(new Path(logDir + "/*"))).map(_.toSeq).getOrElse(Nil)
        .filter { entry => isAccessible(entry.getPath) }
        .filter { entry => !isProcessing(entry.getPath) }


This is playing as a "lock" for the log file. You can see where the lock flag is accessed or/and modified.

HeartSaVioR · 2020-09-03T08:43:06Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

    }
  }

-  testRetry("provider reports error after FS leaves safe mode") {


Please don't remove the existing test unless you have strong reason to do so. If your change breaks existing test, you may need to explain the reason about necessity of discarding this.

tgravescs · 2020-09-03T13:32:52Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

      logDebug(s"Scanning $logDir with lastScanTime==$lastScanTime")

-      val updated = Option(fs.listStatus(new Path(logDir))).map(_.toSeq).getOrElse(Nil)
+      val updated = Option(fs.globStatus(new Path(logDir + "/*"))).map(_.toSeq).getOrElse(Nil)


yeah I don't quite get this, you are requiring all the directories to be under a single directory, I guess that makes this logic easier, but at the same time why the restriction and not have a list? If we are going to support multiple directories and make sure it works I don't see the reason to have this restriction. What if people have multiple clusters writing to different HDFS filesystems for instance.

I agree with @HeartSaVioR if we are going to support multiple directories we need to have a thorough look at all the logic here to make sure no other problems. I guess in this case you are using a single filesystem?

I think we need to flush out more of the overall goals and design first

Yes, we are using a single file system in this case. This feature is useful when using external file systems for log data because then multiple directories will correspond to multiple clusters.

We could have a list too. Looking into that option.

The reason we chose to go with glob-pattern is that our service creates short-lived YARN clusters on which Spark applications are run. Log data goes to directories in a remote file-system. Since the number of these clusters can be really large, glob pattern would better fit our use-case, than a static list would.

Gaurangi94 · 2020-09-03T17:58:08Z

SHS isn't a read-only; there's also a cleanup phase, and load and cleanup phases are running in parallel. Please take a look at the SHS code in master branch thoughtfully.

Thanks. I will go over those phases and ensure there aren't any concurrency concerns.

HeartSaVioR · 2020-09-04T02:57:33Z

I don't think Spark has the concept of "clusters". Even I don't think Spark has the concept of "cluster", unless you use standalone mode. More specifically, there's no strong relation between applications, and there's no sort of control plane for Spark side to control all applications in the cluster. The cluster is actually resource scheduler's cluster.

If the rationalization of SPARK-32097 and SPARK-32135 is to make SHS be cluster-wise, then probably the concept of "cluster" needs to be defined and introduced, instead of having some fixes for workaround.

Everyone has different views on being "cluster-wise". e.g. If SHS is cluster-wise and supports multi-clusters, I would prefer "isolated view" per cluster, like selecting the cluster first, and see filtered view for the cluster. I wouldn't prefer listing all applications from all clusters in the same list. This would be the opposite view you're proposing in SPARK-32135. So that's not a trivial thing. It warrants the discussion, including we really want to make it be "cluster-wise".

gatorsmile · 2020-09-05T05:25:00Z

cc @rednaxelafx @jiangxb1987

Gaurangi94 · 2020-09-08T16:25:20Z

I don't think Spark has the concept of "clusters". Even I don't think Spark has the concept of "cluster", unless you use standalone mode. More specifically, there's no strong relation between applications, and there's no sort of control plane for Spark side to control all applications in the cluster. The cluster is actually resource scheduler's cluster.

If the rationalization of SPARK-32097 and SPARK-32135 is to make SHS be cluster-wise, then probably the concept of "cluster" needs to be defined and introduced, instead of having some fixes for workaround.

Everyone has different views on being "cluster-wise". e.g. If SHS is cluster-wise and supports multi-clusters, I would prefer "isolated view" per cluster, like selecting the cluster first, and see filtered view for the cluster. I wouldn't prefer listing all applications from all clusters in the same list. This would be the opposite view you're proposing in SPARK-32135. So that's not a trivial thing. It warrants the discussion, including we really want to make it be "cluster-wise".

Hey, I have updated the JIRA with a more elaborate description of our use-case. Initially, I had tried to keep it general but I noticed that has caused a lot of unintended confusion. Apologies for that. Could you review our use-case again and we can discuss what you think would be a better way to support it.

github-actions · 2020-12-18T01:04:49Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Enable Spark History Server to read from multiple directories

5cce482

Change-Id: Ie3e2a6cc08b4c0e8770417c3b66f2f747bebefda

probot-autolabeler bot added the CORE label Sep 2, 2020

HyukjinKwon changed the title ~~SPARK-32097: Enable Spark History Server to read from multiple directories~~ [SPARK-32097] Enable Spark History Server to read from multiple directories Sep 3, 2020

HeartSaVioR reviewed Sep 3, 2020

View reviewed changes

tgravescs reviewed Sep 3, 2020

View reviewed changes

tgravescs mentioned this pull request Sep 3, 2020

[SPARK-32135] Show Spark Driver name on Spark history web page #29629

Closed

github-actions bot added the Stale label Dec 18, 2020

github-actions bot closed this Dec 19, 2020

[SPARK-32097] Enable Spark History Server to read from multiple directories #29630

[SPARK-32097] Enable Spark History Server to read from multiple directories #29630

Uh oh!

Conversation

Gaurangi94 commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 2, 2020

Uh oh!

HeartSaVioR commented Sep 3, 2020

Uh oh!

Gaurangi94 commented Sep 3, 2020

Uh oh!

HeartSaVioR commented Sep 3, 2020

Uh oh!

HeartSaVioR Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

tgravescs Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

Gaurangi94 Sep 3, 2020

Choose a reason for hiding this comment

Uh oh!

Gaurangi94 Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

Gaurangi94 commented Sep 3, 2020

Uh oh!

HeartSaVioR commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 5, 2020

Uh oh!

Gaurangi94 commented Sep 8, 2020

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Gaurangi94 commented Sep 2, 2020 •

edited

Loading

HeartSaVioR commented Sep 4, 2020 •

edited

Loading