这是indexloc提供的服务,不要输入任何密码
Skip to content

W&B: add artifacts support #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Sep 24, 2021
Merged

W&B: add artifacts support #441

merged 31 commits into from
Sep 24, 2021

Conversation

AyushExel
Copy link
Contributor

@AyushExel AyushExel commented Sep 7, 2021

  • I have marked all applicable categories:
    • exception-raising fix
    • algorithm implementation fix
    • documentation modification
    • new feature
  • I have reformatted the code using make format (required)
  • I have checked the code using make commit-checks (required)
  • If applicable, I have mentioned the relevant/related issue(s)
  • If applicable, I have listed every items in this Pull Request below

This PR fixes existing bugs in WandbLogger and adds support for artifacts.
Screenshot (131)

When using WandbLogger, you can now resume your runs from any device. Example usage is in the examples/atari/atari_dqn_wandb.py:

cd examples/atari
python atari_dqn_wandb.py
# terminate run. The run is executable on any device via
python atari_dqn_wandb.py --resume_id {your run id}

Let me know if I missed something. This is still a WIP. I'll add some more advanced visualization features.

Bug fix

It seems like the atari_dqn task is passing floats to logger.write method which crashes the script. I've handled the use case but it is hacky. We might need to modify the trainer to always pass dicts

else:
eps = args.eps_train_final
policy.set_eps(eps)
logger.write('train/eps', env_step, eps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the root course of that bug, I'll fix that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2021

Codecov Report

Merging #441 (dac8958) into master (e8f8cdf) will decrease coverage by 0.50%.
The diff coverage is 38.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
- Coverage   94.71%   94.21%   -0.51%     
==========================================
  Files          60       60              
  Lines        3821     3853      +32     
==========================================
+ Hits         3619     3630      +11     
- Misses        202      223      +21     
Flag Coverage Δ
unittests 94.21% <38.23%> (-0.51%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tianshou/utils/logger/wandb.py 47.72% <36.36%> (-10.61%) ⬇️
tianshou/utils/__init__.py 100.00% <100.00%> (ø)
tianshou/utils/logger/base.py 91.66% <0.00%> (-4.17%) ⬇️
tianshou/policy/modelfree/npg.py 97.70% <0.00%> (-1.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8f8cdf...dac8958. Read the comment docs.

@Trinkle23897
Copy link
Collaborator

I'm wondering if we can add a unit test of wandb under test/base/, what do you think?

@AyushExel
Copy link
Contributor Author

AyushExel commented Sep 7, 2021

I'm wondering if we can add a unit test of wandb under test/base/, what do you think?

@Trinkle23897 yes I'll do that before marking this ready for review. I think the easiest way to add tests would be to set up a dummy account that we can use for CI tests.

Co-authored-by: Jiayi Weng <trinkle23897@qq.com>
@Trinkle23897
Copy link
Collaborator

I've already set up https://wandb.ai/tianshou a week before, maybe we can use a sub-namespace under this account?

@AyushExel
Copy link
Contributor Author

I've already set up https://wandb.ai/tianshou a week before, maybe we can use a sub-namespace under this account?

Okay. One thing to note here would be that the API key of the account used for CI will be exposed. So, it'll be better to use an account that doesn't have any important experiments.

@Trinkle23897
Copy link
Collaborator

But I think we can store the API key inside environment secrets:
2021-09-07 08-58-28屏幕截图

@AyushExel
Copy link
Contributor Author

@Trinkle23897 Ohh I wasn't aware of that. Yes, that would work!

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Sep 7, 2021

Just configure secret as follows:

2021-09-07 09-02-23屏幕截图
2021-09-07 09-02-34屏幕截图

let me know if you need further changes (also you can experiment this feature in your own fork)

@AyushExel
Copy link
Contributor Author

This is very helpful. Thanks!

@Trinkle23897
Copy link
Collaborator

my bad. should be this one:

2021-09-07 09-06-18屏幕截图

@AyushExel
Copy link
Contributor Author

@Trinkle23897 I was trying to log gym visualizations over time to show training progress something like this. To do that we need to monitor the videos from the gym. So I added env = gym.wrappers.Monitor(env, "videos", force=True) after this line . But now it throws an error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jupyter/repos/tianshou/tianshou/env/worker/subproc.py", line 95, in _worker
    obs = env.reset()
  File "/home/jupyter/repos/tianshou/examples/atari/atari_wrapper.py", line 196, in reset
    obs = self.env.reset()
  File "/opt/conda/lib/python3.7/site-packages/gym/core.py", line 278, in reset
    observation = self.env.reset(**kwargs)
  File "/home/jupyter/repos/tianshou/examples/atari/atari_wrapper.py", line 115, in reset
    self.env.reset()
  File "/opt/conda/lib/python3.7/site-packages/gym/core.py", line 251, in reset
    return self.env.reset(**kwargs)
  File "/home/jupyter/repos/tianshou/examples/atari/atari_wrapper.py", line 26, in reset
    self.env.reset()
  File "/opt/conda/lib/python3.7/site-packages/gym/wrappers/monitor.py", line 52, in reset
    self._before_reset()
  File "/opt/conda/lib/python3.7/site-packages/gym/wrappers/monitor.py", line 230, in _before_reset
    self.stats_recorder.before_reset()
  File "/opt/conda/lib/python3.7/site-packages/gym/wrappers/monitoring/stats_recorder.py", line 82, in before_reset
    self.env_id
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for PongNoFrameskip-v4, you cannot call reset() unless the episode is over.
Traceback (most recent call last):
  File "atari_dqn_wandb.py", line 216, in <module>
    test_dqn(get_args())
  File "atari_dqn_wandb.py", line 208, in test_dqn
    save_checkpoint_fn=save_checkpoint_fn
  File "/home/jupyter/repos/tianshou/tianshou/trainer/offpolicy.py", line 97, in offpolicy_trainer
    env_step, reward_metric
  File "/home/jupyter/repos/tianshou/tianshou/trainer/utils.py", line 22, in test_episode
    collector.reset_env()
  File "/home/jupyter/repos/tianshou/tianshou/data/collector.py", line 118, in reset_env
    obs = self.env.reset()
  File "/home/jupyter/repos/tianshou/tianshou/env/venvs.py", line 173, in reset
    obs_list = [self.workers[i].reset() for i in id]
  File "/home/jupyter/repos/tianshou/tianshou/env/venvs.py", line 173, in <listcomp>
    obs_list = [self.workers[i].reset() for i in id]
  File "/home/jupyter/repos/tianshou/tianshou/env/worker/subproc.py", line 165, in reset
    obs = self.parent_remote.recv()
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

@Trinkle23897
Copy link
Collaborator

it may because env.reset() after env.reset() inside the initialization process. a quick fix is to add another wrapper that can handle this consecutive reset into a single one

Copy link
Contributor

@drozzy drozzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it, but I don't really know how to modify an in-progress PR.

@AyushExel
Copy link
Contributor Author

@Trinkle23897 @drozzy the psrl test is failing. I think the latest gym doesn't include gym_toytext. I installed it using pip install gym-legacy-toytext and imported it using import gym_toytext and the psrl test worked. See the logging test

@AyushExel AyushExel marked this pull request as ready for review September 16, 2021 14:06
@Trinkle23897
Copy link
Collaborator

@AyushExel
Copy link
Contributor Author

@Trinkle23897 ok I've updated the gym req. based on that link

@AyushExel
Copy link
Contributor Author

@Trinkle23897 Do you know how the github env secret can be accessed in python script? For the wandb test to pass we need to login in the script like wandb.login(api_key=secret_api_key)

@Trinkle23897
Copy link
Collaborator

wandb.login(api_key=os.environ["WANDB_API_KEY"])

@AyushExel
Copy link
Contributor Author

@Trinkle23897 Ok now it's working

@AyushExel
Copy link
Contributor Author

@Trinkle23897 Hi. This is ready to be merged from my end. Do you have any changes that you'd like me to address.
Once this is merged, I can start with updating docs and a getting started colab

@Trinkle23897
Copy link
Collaborator

Could you please wait for me to finish one assignment for 2 days? I'll polish it.

@AyushExel
Copy link
Contributor Author

@Trinkle23897 sure. Take your time :)

@AyushExel
Copy link
Contributor Author

@Trinkle23897 Would you like to add the instructions on how to use this logger in the docs or readme?

@Trinkle23897
Copy link
Collaborator

yep

@AyushExel
Copy link
Contributor Author

Awesome

@Trinkle23897 Trinkle23897 changed the title [WIP] W&B: add artifacts support W&B: add artifacts support Sep 23, 2021
Comment on lines 78 to 84
if args.logger == "wandb":
logger = WandbLogger(save_interval=1, project='psrl', name='wandb_test')
elif args.logger == "tensorboard":
log_path = os.path.join(args.logdir, args.task, 'psrl')
writer = SummaryWriter(log_path)
writer.add_text("args", str(args))
logger = TensorboardLogger(writer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to log all args into wandb at beginning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I'll add it now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I've added configuration logging. Hopefully tests succeed


Creates three panels with plots: train, test, and update.
This logger creates three panels with plots: train, test, and update.
Make sure to select the correct access for each panel in weights and biases:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can here show where the example script is?

@@ -41,6 +41,13 @@ def get_args():
)
parser.add_argument('--frames-stack', type=int, default=4)
parser.add_argument('--resume-path', type=str, default=None)
parser.add_argument('--resume-id', type=str, default=None)
parser.add_argument(
'--logger',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe we can add some instructions on how to use wandb (including resume) in examples/atari/README.md instead of tensorboard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the instructions in doc also? I can make a separate PR tomorrow for adding the detailed instructions in examples/atari/README.md as well as docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just append to this pr

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Sep 23, 2021

Great work @AyushExel, please check if my modification is correct or not (I haven't run the atari_dqn script for resume training and see if wandb still work). I change loading checkpoint under args.resume_id to args.resume_path because users can specify which model to load. This is a trade-off: it may cause inconsistency when loading model and logger, but gives more flexibility when they don't want to resume logger (this is the common case).

@AyushExel
Copy link
Contributor Author

AyushExel commented Sep 23, 2021

@Trinkle23897 Thanks. Yeah It makes sense. I'm testing the resume functionality on atari now.
Edit: It's working :)

@AyushExel
Copy link
Contributor Author

@Trinkle23897 alright I think this is good to go. I can add instructions and details in a separate PR.

@Trinkle23897
Copy link
Collaborator

sure!

@Trinkle23897 Trinkle23897 merged commit 22d7bf3 into thu-ml:master Sep 24, 2021
BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024
- rename WandBLogger -> WandbLogger
- add save_data and restore_data
- allow more input arguments for wandb init
- integrate wandb into test/modelbase/test_psrl.py and examples/atari/atari_dqn.py
- documentation update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants