这是indexloc提供的服务,不要输入任何密码
Skip to content

请问,全量微调多卡训练要出问题,单卡训练却没有问题,是怎么回事?(预训练多卡没有问题) #462

@cqcracked

Description

@cqcracked

Epoch:1/3 loss:1.696 lr:0.000000606225 epoch_Time:136.0min:
Epoch:1/3 loss:1.829 lr:0.000000606056 epoch_Time:135.0min:
Epoch:1/3 loss:1.716 lr:0.000000605887 epoch_Time:135.0min:
W0712 21:41:30.388961 61592 site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGHUP death signal, shutting down workers
W0712 21:41:30.391399 61592 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 61660 closing signal SIGHUP
W0712 21:41:30.391720 61592 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 61661 closing signal SIGHUP
Epoch:1/3 loss:1.959 lr:0.000000605717 epoch_Time:134.0min:
Epoch:1/3 loss:1.793 lr:0.000000605547 epoch_Time:133.0min:
W0712 21:42:00.392091 61592 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 61660 via Signals.SIGHUP, forcefully exiting via Signals.SIGKILL
W0712 21:42:00.999498 61592 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 61661 via Signals.SIGHUP, forcefully exiting via Signals.SIGKILL
Traceback (most recent call last):
File "/public/home/a/anaconda3/envs/rag/bin/torchrun", line 8, in
sys.exit(main())
File "/public/home//anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
result = self._invoke_run(role)
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run
time.sleep(monitor_interval)
File "/public/home/a/anaconda3/envs/rag/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 61592 got signal: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions