[tabular] Fix exception in HPO when only specifying one of `num_cpus`/`num_gpus` #4384

Innixma · 2024-08-13T02:37:20Z

Issue #, if available:

Resolves #4375

Description of changes:

Fix exception in HPO when only specifying one of num_cpus or num_gpus in ag_args_fit.
Fix exception in HPO for FASTAI and NN_TORCH when num_cpus specified in ag_args_fit, caused by RayTune converting int to float for num_cpus, which leads to a failure in a downstream type assertion as reported in [BUG] Issues with FastAI and PyTorch when using best_quality preset + HPO params #4375.

Example

For example, in mainline the following with raise an exception:

import pandas as pd
from autogluon.tabular import TabularPredictor


if __name__ == '__main__':
    label = 'class'
    train_data = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
    subsample_size = 1000  # subsample subset of data for faster demo, try setting this to much larger values
    if subsample_size is not None and subsample_size < len(train_data):
        train_data = train_data.sample(n=subsample_size, random_state=0)
    test_data = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
    hyperparameters = {'FASTAI': {}, 'NN_TORCH': {}, 'CAT': {}, 'GBM': {}, }
    predictor = TabularPredictor(
        label=label,
    )

    predictor = predictor.fit(
        train_data=train_data,
        hyperparameters=hyperparameters,
        dynamic_stacking=False,
        presets="best_quality",
        time_limit=45,
        ag_args_fit={"num_cpus": 1},
        hyperparameter_tune_kwargs='auto',
    )

    predictor.leaderboard(test_data, display=True)

This results in the following exception:

Hyperparameter tuning model: LightGBM_BAG_L1 ... Tuning model for up to 10.11s of the 44.95s of remaining time.
/opt/conda/envs/ag-111-310v2/lib/python3.10/site-packages/torch/cuda/__init__.py:628: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Warning: Exception caused LightGBM_BAG_L1 to fail during hyperparameter tuning... Skipping this model.
Traceback (most recent call last):
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 2278, in _train_single_full
    hpo_models, hpo_results = model.hyperparameter_tune(
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 1538, in hyperparameter_tune
    hpo_executor.register_resources(self, **kwargs)
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/hpo/executors.py", line 501, in register_resources
    super().register_resources(initialized_model, **kwargs)
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/hpo/executors.py", line 168, in register_resources
    gpu_per_trial = user_specified_fold_num_gpus * min(k_fold, num_folds_in_parallel)
TypeError: unsupported operand type(s) for *: 'NoneType'

Similarly, replacing with ag_args_fit={"num_gpus": 0} causes an exception:

Hyperparameter tuning model: LightGBM_BAG_L1 ... Tuning model for up to 10.11s of the 44.95s of remaining time.
/opt/conda/envs/ag-111-310v2/lib/python3.10/site-packages/torch/cuda/__init__.py:628: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Warning: Exception caused LightGBM_BAG_L1 to fail during hyperparameter tuning... Skipping this model.
Traceback (most recent call last):
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 2278, in _train_single_full
    hpo_models, hpo_results = model.hyperparameter_tune(
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/models/abstract/abstract_model.py", line 1538, in hyperparameter_tune
    hpo_executor.register_resources(self, **kwargs)
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/hpo/executors.py", line 501, in register_resources
    super().register_resources(initialized_model, **kwargs)
  File "/home/ubuntu/workspace/code/autogluon/core/src/autogluon/core/hpo/executors.py", line 167, in register_resources
    cpu_per_trial = user_specified_fold_num_cpus * min(k_fold, num_folds_in_parallel)
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
unsupported operand type(s) for *: 'NoneType' and 'int'

This PR fixes these exceptions and allows the models to fit as intended.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…/`num_gpus`

github-actions · 2024-08-13T07:23:06Z

Job PR-4384-a266065 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4384/a266065/index.html

prateekdesai04

LGTM!

Innixma added 2 commits August 13, 2024 02:14

[tabular] Fix exception in HPO when only specifying one of num_cpus…

0185f71

…/`num_gpus`

Fix CPU calculation when ag.num_gpus=0

f2ed6b5

Innixma added this to the 1.2 Release milestone Aug 13, 2024

Innixma requested a review from prateekdesai04 August 13, 2024 02:37

Innixma added bug Something isn't working module: tabular feature: hpo Related to Hyper-parameter Optimization priority: 1 High priority labels Aug 13, 2024

Innixma added 2 commits August 13, 2024 03:28

linting

803eee8

Fix crash in HPO with NN_TORCH and FASTAI

a266065

Innixma mentioned this pull request Aug 13, 2024

[BUG] Issues with FastAI and PyTorch when using best_quality preset + HPO params #4375

Closed

prateekdesai04 approved these changes Aug 14, 2024

View reviewed changes

Innixma merged commit fc14548 into autogluon:master Aug 14, 2024

Innixma deleted the tabular_fix_hpo_resources branch April 16, 2025 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tabular] Fix exception in HPO when only specifying one of `num_cpus`/`num_gpus` #4384

[tabular] Fix exception in HPO when only specifying one of `num_cpus`/`num_gpus` #4384

Uh oh!

Innixma commented Aug 13, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

prateekdesai04 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tabular] Fix exception in HPO when only specifying one of num_cpus/num_gpus #4384

[tabular] Fix exception in HPO when only specifying one of num_cpus/num_gpus #4384

Uh oh!

Conversation

Innixma commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Uh oh!

github-actions bot commented Aug 13, 2024

Uh oh!

prateekdesai04 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tabular] Fix exception in HPO when only specifying one of `num_cpus`/`num_gpus` #4384

[tabular] Fix exception in HPO when only specifying one of `num_cpus`/`num_gpus` #4384

Innixma commented Aug 13, 2024 •

edited

Loading