这是indexloc提供的服务,不要输入任何密码
Skip to content

预训练了2轮半,强制结束,任务微调时出现错误:NameError: name '加完班回到家窝在沙发里' is not defined #20

@ipfgao

Description

@ipfgao
#预训练指令
deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
#微调指令
torchrun --nproc_per_node 2 3-full_sft.py

微调时错误如下:

[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] 
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
LLM总参数量:26.878 百万
Epoch:[0/19](0/24681) loss:8.882 lr:0.00020000 epoch_Time:351.0min:
Epoch:[0/19](100/24681) loss:5.368 lr:0.00020000 epoch_Time:82.0min:
Traceback (most recent call last):
  File "/data/minimind/3-full_sft.py", line 212, in <module>
    train_epoch(epoch)
  File "/data/minimind/3-full_sft.py", line 48, in train_epoch
    for step, (X, Y, loss_mask) in enumerate(train_loader):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
NameError: Caught NameError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/minimind/model/dataset.py", line 74, in __getitem__
    history = eval(sample['history'])
  File "<string>", line 1, in <module>
NameError: name '加完班回到家窝在沙发里' is not defined

[2024-09-14 13:32:12,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 157780 closing signal SIGTERM
[2024-09-14 13:32:12,221] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 157779) of binary: /home/nlp/anaconda3/envs/minimind/bin/python
Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
3-full_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-14_13:32:12
  host      : nlp-Z790-UD
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 157779)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

这是不让加完班回到家窝在沙发里吗?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions