Horovord Spark example "keras_spark3_rossmann.py" fails with FileNotFoundError #3213
Unanswered
aakash-sharma
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment:
Framework: TensorFlow, Keras
Framework version: tf-2.4.3
Horovod version: 0.23.0
CUDA version: 11.0
Python version: python-3.6.9
Spark / PySpark version: Spark-3.1.2
Ray version:
OS and version: Ubuntu 18
GCC version: gcc-7.5.0
CMake version: cmake-3.21.2
When I run "keras_spark3_rossmann" or "keras_spark_mnist.py", both the scripts fail at model training. I use the following command to start the script:
python keras_spark3_rossmann.py --num-proc 2
And the stack trace(model training) I am getting is:
Exception in thread Thread-16: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/task/task_service.py", line 76, in _run_command prefix_output_with_timestamp) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/service/task_service.py", line 133, in _run_command events=[event]) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 242, in execute stderr_fwd = in_thread(target=prefix_connection, args=(stderr_r, stderr, 'stderr', index, prefix_output_with_timestamp)) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/util/threads.py", line 119, in in_thread bg.start() File "/usr/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
Exception in thread Thread-16: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/task/task_service.py", line 76, in _run_command prefix_output_with_timestamp) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/service/task_service.py", line 133, in _run_command events=[event]) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 242, in execute stderr_fwd = in_thread(target=prefix_connection, args=(stderr_r, stderr, 'stderr', index, prefix_output_with_timestamp)) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/util/threads.py", line 119, in in_thread bg.start() File "/usr/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
Traceback (most recent call last): Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 1, in <module> File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) exitcode = _main(fd) File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) self = reduction.pickle.load(from_parent) File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__ File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__ self._semlock = _multiprocessing.SemLock._rebuild(*state) self._semlock = _multiprocessing.SemLock._rebuild(*state) FileNotFoundError: [Errno 2] No such file or directory FileNotFoundError: [Errno 2] No such file or directory Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/runner.py", line 141, in run_spark result = procs.mapPartitionsWithIndex(mapper).collect() File "/usr/local/lib/python3.6/dist-packages/pyspark/rdd.py", line 949, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job 0 cancelled part of cancelled job group horovod.spark.run.0 at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:2154) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleJobGroupCancelled$4(DAGScheduler.scala:1048) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.handleJobGroupCancelled(DAGScheduler.scala:1047) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2407) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last): File "keras_spark3_rossmann.py", line 549, in <module> prefix_output_with_timestamp=True)[0] File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/runner.py", line 287, in run _launch_job(use_mpi, use_gloo, settings, driver, env, stdout, stderr, executable) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/runner.py", line 157, in _launch_job settings.verbose) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/launch.py", line 684, in run_controller gloo_run() File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/runner.py", line 154, in <lambda> run_controller(use_gloo, lambda: gloo_run(executable, settings, nics, driver, env, stdout, stderr), File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/gloo_run.py", line 80, in gloo_run launch_gloo(command, exec_command, settings, nics, {}, server_ip) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/gloo_run.py", line 285, in launch_gloo .format(name=name, code=exit_code)) RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: 1 Exit code: None
21/10/10 08:03:27 ERROR TaskContextImpl: Error in TaskCompletionListener java.lang.IllegalStateException: Block broadcast_0 not found at org.apache.spark.storage.BlockInfoManager.$anonfun$unlock$3(BlockInfoManager.scala:293) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:293) at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:1196) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$releaseBlockManagerLock$1(TorrentBroadcast.scala:287) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$releaseBlockManagerLock$1$adapted(TorrentBroadcast.scala:287) at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/10/10 08:03:27 ERROR Utils: Uncaught exception in thread Executor task launch worker for task 1.0 in stage 0.0 (TID 1) java.lang.NullPointerException at org.apache.spark.scheduler.Task.$anonfun$run$2(Task.scala:152) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1419) at org.apache.spark.scheduler.Task.run(Task.scala:150) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Running the same script in yarn mode with:
spark-submit keras_spark3_rossmann.py --processing-master=yarn --num-proc=2 --epochs=1
gave a similar stack trace:
Exception in thread Thread-16: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/task/task_service.py", line 76, in _run_command prefix_output_with_timestamp) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/service/task_service.py", line 133, in _run_command events=[event]) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 242, in execute stderr_fwd = in_thread(target=prefix_connection, args=(stderr_r, stderr, 'stderr', index, prefix_output_with_timestamp)) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/util/threads.py", line 119, in in_thread bg.start() File "/usr/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
Exception in thread Thread-16: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/cc/.local/lib/python3.6/site-packages/horovod/spark/task/task_service.py", line 76, in _run_command prefix_output_with_timestamp) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/service/task_service.py", line 133, in _run_command events=[event]) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/common/util/safe_shell_exec.py", line 242, in execute stderr_fwd = in_thread(target=prefix_connection, args=(stderr_r, stderr, 'stderr', index, prefix_output_with_timestamp)) File "/home/cc/.local/lib/python3.6/site-packages/horovod/runner/util/threads.py", line 119, in in_thread bg.start() File "/usr/lib/python3.6/threading.py", line 846, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__ self._semlock = _multiprocessing.SemLock._rebuild(*state) FileNotFoundError: [Errno 2] No such file or directory Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__ self._semlock = _multiprocessing.SemLock._rebuild(*state) FileNotFoundError: [Errno 2] No such file or directory
`
The bottom most error is:
FileNotFoundError: [Errno 2] No such file or directory
And is thrown right from threading.py. Can somebody please help?
Beta Was this translation helpful? Give feedback.
All reactions