-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
#预训练指令
deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
#微调指令
torchrun --nproc_per_node 2 3-full_sft.py微调时错误如下:
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING]
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
LLM总参数量:26.878 百万
Epoch:[0/19](0/24681) loss:8.882 lr:0.00020000 epoch_Time:351.0min:
Epoch:[0/19](100/24681) loss:5.368 lr:0.00020000 epoch_Time:82.0min:
Traceback (most recent call last):
File "/data/minimind/3-full_sft.py", line 212, in <module>
train_epoch(epoch)
File "/data/minimind/3-full_sft.py", line 48, in train_epoch
for step, (X, Y, loss_mask) in enumerate(train_loader):
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
raise exception
NameError: Caught NameError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/minimind/model/dataset.py", line 74, in __getitem__
history = eval(sample['history'])
File "<string>", line 1, in <module>
NameError: name '加完班回到家窝在沙发里' is not defined
[2024-09-14 13:32:12,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 157780 closing signal SIGTERM
[2024-09-14 13:32:12,221] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 157779) of binary: /home/nlp/anaconda3/envs/minimind/bin/python
Traceback (most recent call last):
File "/home/nlp/anaconda3/envs/minimind/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
3-full_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-14_13:32:12
host : nlp-Z790-UD
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 157779)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
这是不让加完班回到家窝在沙发里吗?
jiaohuix
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working