Adding support for CAMELS IND in the datasetzoo #250

saurabh-3110 · 2025-06-26T11:39:08Z

I have added support for the CAMELS-IND dataset following the structure and conventions used for other CAMELS datasets in the repo.

kratzert · 2025-07-02T16:48:27Z

neuralhydrology/datasetzoo/camelsind.py

+import os
+import pandas as pd
+from pathlib import Path
+from tqdm import tqdm
+from functools import reduce
+from typing import List, Dict, Union
+import xarray
+from neuralhydrology.datasetzoo.basedataset import BaseDataset
+from neuralhydrology.utils.config import Config


The inputs are not correctly sorted and in general it seems like the file was not formatted. Would be great if you could run yapf over this file. If you have the default neuralhydrology environment installed, you can either modify you IDE to autoformat on save (e.g. possible in visual studio vode) and by pointing to the .style.yapf file. Or you run yapf on the file manually from the terminal with something like

/path/to/yapf -i /path/to/camelsind.py --style /path/to/.style.yapf

the path to the yapf binary can be found by which yapf if you have the conda environment activated. The .style.yapf file is in the root directory of this repository.

kratzert

Hi, thanks for opening this CL and for adding support for the CAMELS-IND dataset. Besides the line by line comments, a few additional points:

for multi-basin neural network training, you usually want to work with mm/day discharge and not cms. I am not sure how the data is provided in CAMELS-IND but a few of our other classes (e.g. check CAMELSUS) have a conversion from cms to mmd implemented. You might want to check if that is needed or not.
You haven't added your class to the docs, i.e. they would not appear on our online documentation. Can you make sure to add the class to docs/source/api ? Just check any of the other dataset classes there for what to do. It will be mostly copy and pasting. Besides creating a neuralhydrology.datasetzoo.camelsind.rst file, you also want to add https://github.com/neuralhydrology/neuralhydrology/blob/master/docs/source/api/neuralhydrology.datasetzoo.rst
Please run the yapf code formatter over the file (as stated in one comment)

Also: I haven't checked the conversion code in detail and trust you that it works. The functions are rather long but I currently don't have much time to see if they could be better structure. Also probably not the most important thing.

kratzert · 2025-07-02T16:49:16Z

neuralhydrology/datasetzoo/camelsind.py

+from neuralhydrology.datasetzoo.basedataset import BaseDataset
+from neuralhydrology.utils.config import Config
+
+# This class remains unchanged, but is included for completeness.


What does this comment mean?

kratzert · 2025-07-02T16:51:17Z

neuralhydrology/datasetzoo/camelsind.py

+                         id_to_int=id_to_int,
+                         scaler=scaler)
+
+    def _load_basin_data(self, basin: str) -> pd.DataFrame:


Can you similarly to the other dataset classes move the logic of this method into a public function called load_camels_ind_timeseries() and the same for the attribute function below. The class method then simply calls this function. This is helpful, because this allows others to use the functions also outside of the neuralhydrology classes to work with the dataset.

kratzert · 2025-07-02T16:51:42Z

neuralhydrology/datasetzoo/camelsind.py

+        -------
+        pd.DataFrame
+            Time-indexed DataFrame, containing the time series data (forcings + discharge) data.
+


Remove empty line

Suggested change

kratzert · 2025-07-02T16:52:44Z

neuralhydrology/datasetzoo/camelsind.py

+                         additional_features=additional_features,
+                         id_to_int=id_to_int,
+                         scaler=scaler)
+


Maybe it is worth adding a check to the init if the cfg.data_dir is in the correct format and then have a verbose error message that tells the user to first preprocess the dataset? WDYT?

kratzert · 2025-07-02T16:55:27Z

neuralhydrology/datasetzoo/camelsind.py

+        for var in self.cfg.dynamic_inputs:
+            if var not in df.columns:
+                raise ValueError(f"Dynamic input '{var}' from config not found in columns of {basin_file}. "
+                                 f"Available columns: {df.columns.to_list()}")
+        for var in self.cfg.target_variables:
+            if var not in df.columns:
+                 raise ValueError(f"Target variable '{var}' from config not found in columns of {basin_file}. "
+                                  f"Available columns: {df.columns.to_list()}")


For both of these checks, I think it would be better to first get a list of dynamic_inputs/targets that are missing in the df.columns and then, if there are any, raise an error that includes all missing vars. Otherwise this error would be raised on the first missing var but don't tell the users that other vars are also missing.

Something like (untested)

missing_vars = [x for x in self.cfg.dynamic_inputs + self.cfg.target_variables if x not in df.columns] if missing_vars: raise ValueError(...) # error message here with all missing_vars

kratzert · 2025-07-02T16:57:35Z

neuralhydrology/datasetzoo/camelsind.py

+        if hasattr(self.cfg, 'timeseries_dir'):
+            timeseries_dir_name = self.cfg.timeseries_dir
+        else:
+            timeseries_dir_name = 'preprocessed'


The config has no timeseries_dir. Maybe you used this in a hacky way with debug=True? Please make sure to only use cfg.data_dir and hardcode the rest of the nested folder structure.

kratzert · 2025-07-02T16:58:11Z

neuralhydrology/datasetzoo/camelsind.py

+        if hasattr(self.cfg, 'attributes_file'):
+            attribute_file_name = self.cfg.attributes_file
+        else:
+            attribute_file_name = 'attributes.csv'
+
+        attributes_file = self.cfg.data_dir / attribute_file_name


Same as above. cfg.attributes_file does not exist, please only use cfg.data_dir with a hardcoded path from there to the attributes.

kratzert · 2025-07-02T16:59:20Z

neuralhydrology/datasetzoo/camelsind.py

+    """
+    Handles the aggregation of time series data (forcings, streamflow) and
+    splits it into per-basin files, then prints a summary of available features.
+    """


Not correctly formatted doc-string. Missing one line sentence + call arg section plus return type annotation

kratzert · 2025-07-02T17:04:26Z

neuralhydrology/datasetzoo/camelsind.py

+    """
+    Handles the merging of all static attribute files into a single CSV.
+    """


Same as above

kratzert · 2025-07-02T17:05:50Z

neuralhydrology/datasetzoo/camelsind.py

+    """
+    Orchestrates the full preprocessing of a CAMELS-IND dataset. The dataset can be downloaded from
+    <https://zenodo.org/records/14999580>


Missing one line doc string. Also I think the link will not correctly render in the sphinx docs. Maybe use the way you also reference the dataset in the doc string of the class?

saurabh-3110 · 2025-07-04T08:42:09Z

@kratzert thanks for reviewing my code. I have made changes to the code according to your comments.

The code now converts the streamflow to mm/day which was earlier m3/s
Modified the docs
Correctly formatted the doc-strings for the functions and corrected the other formatting errors
Hardcoded the paths wherever necessary
Converted the loading functions into public functions
Added a check to make sure that if cfg.data_dir is in correct format or needs preprocessing
Removed the check for input variables and attributes as i think it was not necessary in the loading function. Modified the check for basins according to your suggestion.

I hope these changes are enough. Let me know if any more changes are required.

1.12.0

docs/source/api/neuralhydrology.datasetzoo.camelsind.rst

Co-authored-by: Frederik Kratzert <kratzert@users.noreply.github.com>

kratzert

There still are a few issues with the docs strings that prevent the Sphinx docs from being built. See Line 99-102 in this test https://github.com/neuralhydrology/neuralhydrology/actions/runs/16467592880/job/46548803896?pr=250

I made a guess in my line-by-line comments what could be the issue, but I am not 100% sure. To iterate more quickly, I suggest you locally install the doc dependencies from https://github.com/neuralhydrology/neuralhydrology/blob/master/environments/rtd_requirements.txt so that you can try building the docs locally to make sure they work.

If you have a neuralhydrology environment, the only things missing are:

sphinx>=3.2.1
sphinx-rtd-theme>=0.5.0
nbsphinx>=0.8.0
nbsphinx-link>=1.3.0

Once they are installed, and assuming you have a linux environment, you can go into the docs/ directory and run make html. You will then see a build/ directory with an html/ subdirectory. If the make html build process finishes without errors, we should be good to go. If you want to inspect the docs locally, just open the index.html inside the build/html directory, once they finish bulding.

Let me know if the instructions are unclear or if I can help you with anything.

kratzert · 2025-07-23T10:34:29Z

neuralhydrology/datasetzoo/camelsind.py

+    CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. 
+    Earth System Science Data, 17(2), 461-491.


I think these two lines need to be intended to the same level as e.g. the explanation of the input Parameters.

Suggested change

CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.

Earth System Science Data, 17(2), 461-491.

CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.

Earth System Science Data, 17(2), 461-491.

kratzert · 2025-07-23T10:35:56Z

neuralhydrology/datasetzoo/camelsind.py

+
+    Returns
+    -------
+    None
+        This function does not return a value but saves files to disk and
+        prints summary information to the console.


I think the entire returns section needs to be removed.

kratzert · 2025-07-23T10:36:08Z

neuralhydrology/datasetzoo/camelsind.py

+
+    Returns
+    -------
+    None
+        This function does not return a value but saves a file to disk and
+        prints progress and summary information to the console.


I think the entire returns section needs to be removed.

kratzert · 2025-07-23T10:36:33Z

neuralhydrology/datasetzoo/camelsind.py

+    print(f"Consolidated attributes file saved to: {attributes_output_file}")
+
+
+# --- DOCSTRING UPDATED ---


kratzert · 2025-07-23T10:37:14Z

neuralhydrology/datasetzoo/camelsind.py

+    This function performs two main tasks:
+    1.  Processes time series data (forcings, streamflow) and saves one CSV


You might need to have a blank line before this enumeration here, for the list to correctly render.

kratzert · 2025-07-23T10:40:20Z

neuralhydrology/datasetzoo/camelsind.py

+    output_dir/
+    ├── attributes.csv
+    └── preprocessed/
+        ├── [gauge_id_1].csv
+        ├── [gauge_id_2].csv
+        └── ...


I am not sure if this works. Sphinx (the docs engine) is complaining about unexpected indentation. You might need to wrap this somehow, maybe as a code block.

saurabh-3110 · 2025-07-24T04:49:18Z

@kratzert Thanks for taking the time to review and helping me out. I have made the necessary changes and the code should now pass the docs check and be good to go.

saurabh-3110 added 2 commits June 26, 2025 16:49

adding camels-ind to datasetzoo

f1fa34f

adding CAMELS IND to datasetzoo

a41d640

saurabh-3110 requested a review from gauchm as a code owner June 26, 2025 11:39

modified __init__.py

a38ad27

kratzert reviewed Jul 2, 2025

View reviewed changes

saurabh-3110 added 2 commits July 4, 2025 13:48

modify camels ind

e0f3203

modify docs

03b47ec

saurabh-3110 requested a review from kratzert July 4, 2025 08:44

omriporat1 pushed a commit to omriporat1/neuralhydrology that referenced this pull request Jul 8, 2025

Merge pull request neuralhydrology#250 from kratzert/master

def180e

1.12.0

omriporat1 pushed a commit to omriporat1/neuralhydrology that referenced this pull request Jul 8, 2025

Merge pull request neuralhydrology#250 from kratzert/master

368a0e9

1.12.0

Merge branch 'master' into adding-camels-ind

57a1942

kratzert reviewed Jul 23, 2025

View reviewed changes

docs/source/api/neuralhydrology.datasetzoo.camelsind.rst Outdated Show resolved Hide resolved

Update docs/source/api/neuralhydrology.datasetzoo.camelsind.rst

3a3d95b

Co-authored-by: Frederik Kratzert <kratzert@users.noreply.github.com>

kratzert reviewed Jul 23, 2025

View reviewed changes

updating docs

c3d7ca3

		CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India.
		Earth System Science Data, 17(2), 461-491.

		print(f"Consolidated attributes file saved to: {attributes_output_file}")


		# --- DOCSTRING UPDATED ---

		This function performs two main tasks:
		1. Processes time series data (forcings, streamflow) and saves one CSV

Adding support for CAMELS IND in the datasetzoo #250

Are you sure you want to change the base?

Adding support for CAMELS IND in the datasetzoo #250

Uh oh!

Conversation

saurabh-3110 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kratzert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saurabh-3110 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kratzert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saurabh-3110 commented Jul 24, 2025

Uh oh!

Uh oh!

saurabh-3110 commented Jun 26, 2025 •

edited

Loading

saurabh-3110 commented Jul 4, 2025 •

edited

Loading