这是indexloc提供的服务,不要输入任何密码
Skip to content

Adding support for CAMELS IND in the datasetzoo #250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

saurabh-3110
Copy link

@saurabh-3110 saurabh-3110 commented Jun 26, 2025

I have added support for the CAMELS-IND dataset following the structure and conventions used for other CAMELS datasets in the repo.

@saurabh-3110 saurabh-3110 requested a review from gauchm as a code owner June 26, 2025 11:39
Comment on lines 1 to 9
import os
import pandas as pd
from pathlib import Path
from tqdm import tqdm
from functools import reduce
from typing import List, Dict, Union
import xarray
from neuralhydrology.datasetzoo.basedataset import BaseDataset
from neuralhydrology.utils.config import Config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs are not correctly sorted and in general it seems like the file was not formatted. Would be great if you could run yapf over this file. If you have the default neuralhydrology environment installed, you can either modify you IDE to autoformat on save (e.g. possible in visual studio vode) and by pointing to the .style.yapf file. Or you run yapf on the file manually from the terminal with something like

/path/to/yapf -i /path/to/camelsind.py --style /path/to/.style.yapf

the path to the yapf binary can be found by which yapf if you have the conda environment activated. The .style.yapf file is in the root directory of this repository.

Copy link
Contributor

@kratzert kratzert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for opening this CL and for adding support for the CAMELS-IND dataset. Besides the line by line comments, a few additional points:

  • for multi-basin neural network training, you usually want to work with mm/day discharge and not cms. I am not sure how the data is provided in CAMELS-IND but a few of our other classes (e.g. check CAMELSUS) have a conversion from cms to mmd implemented. You might want to check if that is needed or not.
  • You haven't added your class to the docs, i.e. they would not appear on our online documentation. Can you make sure to add the class to docs/source/api ? Just check any of the other dataset classes there for what to do. It will be mostly copy and pasting. Besides creating a neuralhydrology.datasetzoo.camelsind.rst file, you also want to add https://github.com/neuralhydrology/neuralhydrology/blob/master/docs/source/api/neuralhydrology.datasetzoo.rst
  • Please run the yapf code formatter over the file (as stated in one comment)

Also: I haven't checked the conversion code in detail and trust you that it works. The functions are rather long but I currently don't have much time to see if they could be better structure. Also probably not the most important thing.

from neuralhydrology.datasetzoo.basedataset import BaseDataset
from neuralhydrology.utils.config import Config

# This class remains unchanged, but is included for completeness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

id_to_int=id_to_int,
scaler=scaler)

def _load_basin_data(self, basin: str) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you similarly to the other dataset classes move the logic of this method into a public function called load_camels_ind_timeseries() and the same for the attribute function below. The class method then simply calls this function. This is helpful, because this allows others to use the functions also outside of the neuralhydrology classes to work with the dataset.

-------
pd.DataFrame
Time-indexed DataFrame, containing the time series data (forcings + discharge) data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty line

Suggested change

additional_features=additional_features,
id_to_int=id_to_int,
scaler=scaler)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth adding a check to the init if the cfg.data_dir is in the correct format and then have a verbose error message that tells the user to first preprocess the dataset? WDYT?

Comment on lines 103 to 110
for var in self.cfg.dynamic_inputs:
if var not in df.columns:
raise ValueError(f"Dynamic input '{var}' from config not found in columns of {basin_file}. "
f"Available columns: {df.columns.to_list()}")
for var in self.cfg.target_variables:
if var not in df.columns:
raise ValueError(f"Target variable '{var}' from config not found in columns of {basin_file}. "
f"Available columns: {df.columns.to_list()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of these checks, I think it would be better to first get a list of dynamic_inputs/targets that are missing in the df.columns and then, if there are any, raise an error that includes all missing vars. Otherwise this error would be raised on the first missing var but don't tell the users that other vars are also missing.

Something like (untested)

missing_vars = [x for x in self.cfg.dynamic_inputs + self.cfg.target_variables if x not in df.columns]
if missing_vars:
  raise ValueError(...)  # error message here with all missing_vars

Comment on lines 86 to 89
if hasattr(self.cfg, 'timeseries_dir'):
timeseries_dir_name = self.cfg.timeseries_dir
else:
timeseries_dir_name = 'preprocessed'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config has no timeseries_dir. Maybe you used this in a hacky way with debug=True? Please make sure to only use cfg.data_dir and hardcode the rest of the nested folder structure.

Comment on lines 130 to 135
if hasattr(self.cfg, 'attributes_file'):
attribute_file_name = self.cfg.attributes_file
else:
attribute_file_name = 'attributes.csv'

attributes_file = self.cfg.data_dir / attribute_file_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. cfg.attributes_file does not exist, please only use cfg.data_dir with a hardcoded path from there to the attributes.

Comment on lines 155 to 158
"""
Handles the aggregation of time series data (forcings, streamflow) and
splits it into per-basin files, then prints a summary of available features.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not correctly formatted doc-string. Missing one line sentence + call arg section plus return type annotation

Comment on lines 257 to 259
"""
Handles the merging of all static attribute files into a single CSV.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines 286 to 288
"""
Orchestrates the full preprocessing of a CAMELS-IND dataset. The dataset can be downloaded from
<https://zenodo.org/records/14999580>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing one line doc string. Also I think the link will not correctly render in the sphinx docs. Maybe use the way you also reference the dataset in the doc string of the class?

@saurabh-3110
Copy link
Author

saurabh-3110 commented Jul 4, 2025

@kratzert thanks for reviewing my code. I have made changes to the code according to your comments.

  1. The code now converts the streamflow to mm/day which was earlier m3/s
  2. Modified the docs
  3. Correctly formatted the doc-strings for the functions and corrected the other formatting errors
  4. Hardcoded the paths wherever necessary
  5. Converted the loading functions into public functions
  6. Added a check to make sure that if cfg.data_dir is in correct format or needs preprocessing
  7. Removed the check for input variables and attributes as i think it was not necessary in the loading function. Modified the check for basins according to your suggestion.

I hope these changes are enough. Let me know if any more changes are required.

@saurabh-3110 saurabh-3110 requested a review from kratzert July 4, 2025 08:44
omriporat1 pushed a commit to omriporat1/neuralhydrology that referenced this pull request Jul 8, 2025
omriporat1 pushed a commit to omriporat1/neuralhydrology that referenced this pull request Jul 8, 2025
Co-authored-by: Frederik Kratzert <kratzert@users.noreply.github.com>
Copy link
Contributor

@kratzert kratzert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still are a few issues with the docs strings that prevent the Sphinx docs from being built. See Line 99-102 in this test https://github.com/neuralhydrology/neuralhydrology/actions/runs/16467592880/job/46548803896?pr=250

I made a guess in my line-by-line comments what could be the issue, but I am not 100% sure. To iterate more quickly, I suggest you locally install the doc dependencies from https://github.com/neuralhydrology/neuralhydrology/blob/master/environments/rtd_requirements.txt so that you can try building the docs locally to make sure they work.

If you have a neuralhydrology environment, the only things missing are:

  • sphinx>=3.2.1
  • sphinx-rtd-theme>=0.5.0
  • nbsphinx>=0.8.0
  • nbsphinx-link>=1.3.0

Once they are installed, and assuming you have a linux environment, you can go into the docs/ directory and run make html. You will then see a build/ directory with an html/ subdirectory. If the make html build process finishes without errors, we should be good to go. If you want to inspect the docs locally, just open the index.html inside the build/html directory, once they finish bulding.

Let me know if the instructions are unclear or if I can help you with anything.

Comment on lines 49 to 50
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.
Earth System Science Data, 17(2), 461-491.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two lines need to be intended to the same level as e.g. the explanation of the input Parameters.

Suggested change
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.
Earth System Science Data, 17(2), 461-491.
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.
Earth System Science Data, 17(2), 461-491.

Comment on lines 169 to 174

Returns
-------
None
This function does not return a value but saves files to disk and
prints summary information to the console.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the entire returns section needs to be removed.

Comment on lines 325 to 330

Returns
-------
None
This function does not return a value but saves a file to disk and
prints progress and summary information to the console.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the entire returns section needs to be removed.

print(f"Consolidated attributes file saved to: {attributes_output_file}")


# --- DOCSTRING UPDATED ---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Comment on lines 359 to 360
This function performs two main tasks:
1. Processes time series data (forcings, streamflow) and saves one CSV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to have a blank line before this enumeration here, for the list to correctly render.

Comment on lines 368 to 373
output_dir/
├── attributes.csv
└── preprocessed/
├── [gauge_id_1].csv
├── [gauge_id_2].csv
└── ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this works. Sphinx (the docs engine) is complaining about unexpected indentation. You might need to wrap this somehow, maybe as a code block.

@saurabh-3110
Copy link
Author

@kratzert Thanks for taking the time to review and helping me out. I have made the necessary changes and the code should now pass the docs check and be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants