-
Notifications
You must be signed in to change notification settings - Fork 234
Adding support for CAMELS IND in the datasetzoo #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adding support for CAMELS IND in the datasetzoo #250
Conversation
import os | ||
import pandas as pd | ||
from pathlib import Path | ||
from tqdm import tqdm | ||
from functools import reduce | ||
from typing import List, Dict, Union | ||
import xarray | ||
from neuralhydrology.datasetzoo.basedataset import BaseDataset | ||
from neuralhydrology.utils.config import Config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputs are not correctly sorted and in general it seems like the file was not formatted. Would be great if you could run yapf over this file. If you have the default neuralhydrology environment installed, you can either modify you IDE to autoformat on save (e.g. possible in visual studio vode) and by pointing to the .style.yapf file. Or you run yapf on the file manually from the terminal with something like
/path/to/yapf -i /path/to/camelsind.py --style /path/to/.style.yapf
the path to the yapf binary can be found by which yapf
if you have the conda environment activated. The .style.yapf file is in the root directory of this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for opening this CL and for adding support for the CAMELS-IND dataset. Besides the line by line comments, a few additional points:
- for multi-basin neural network training, you usually want to work with mm/day discharge and not cms. I am not sure how the data is provided in CAMELS-IND but a few of our other classes (e.g. check CAMELSUS) have a conversion from cms to mmd implemented. You might want to check if that is needed or not.
- You haven't added your class to the docs, i.e. they would not appear on our online documentation. Can you make sure to add the class to docs/source/api ? Just check any of the other dataset classes there for what to do. It will be mostly copy and pasting. Besides creating a
neuralhydrology.datasetzoo.camelsind.rst
file, you also want to add https://github.com/neuralhydrology/neuralhydrology/blob/master/docs/source/api/neuralhydrology.datasetzoo.rst - Please run the yapf code formatter over the file (as stated in one comment)
Also: I haven't checked the conversion code in detail and trust you that it works. The functions are rather long but I currently don't have much time to see if they could be better structure. Also probably not the most important thing.
from neuralhydrology.datasetzoo.basedataset import BaseDataset | ||
from neuralhydrology.utils.config import Config | ||
|
||
# This class remains unchanged, but is included for completeness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?
id_to_int=id_to_int, | ||
scaler=scaler) | ||
|
||
def _load_basin_data(self, basin: str) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you similarly to the other dataset classes move the logic of this method into a public function called load_camels_ind_timeseries()
and the same for the attribute function below. The class method then simply calls this function. This is helpful, because this allows others to use the functions also outside of the neuralhydrology classes to work with the dataset.
------- | ||
pd.DataFrame | ||
Time-indexed DataFrame, containing the time series data (forcings + discharge) data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty line
additional_features=additional_features, | ||
id_to_int=id_to_int, | ||
scaler=scaler) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is worth adding a check to the init if the cfg.data_dir is in the correct format and then have a verbose error message that tells the user to first preprocess the dataset? WDYT?
for var in self.cfg.dynamic_inputs: | ||
if var not in df.columns: | ||
raise ValueError(f"Dynamic input '{var}' from config not found in columns of {basin_file}. " | ||
f"Available columns: {df.columns.to_list()}") | ||
for var in self.cfg.target_variables: | ||
if var not in df.columns: | ||
raise ValueError(f"Target variable '{var}' from config not found in columns of {basin_file}. " | ||
f"Available columns: {df.columns.to_list()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of these checks, I think it would be better to first get a list of dynamic_inputs/targets that are missing in the df.columns and then, if there are any, raise an error that includes all missing vars. Otherwise this error would be raised on the first missing var but don't tell the users that other vars are also missing.
Something like (untested)
missing_vars = [x for x in self.cfg.dynamic_inputs + self.cfg.target_variables if x not in df.columns]
if missing_vars:
raise ValueError(...) # error message here with all missing_vars
if hasattr(self.cfg, 'timeseries_dir'): | ||
timeseries_dir_name = self.cfg.timeseries_dir | ||
else: | ||
timeseries_dir_name = 'preprocessed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config has no timeseries_dir. Maybe you used this in a hacky way with debug=True? Please make sure to only use cfg.data_dir and hardcode the rest of the nested folder structure.
if hasattr(self.cfg, 'attributes_file'): | ||
attribute_file_name = self.cfg.attributes_file | ||
else: | ||
attribute_file_name = 'attributes.csv' | ||
|
||
attributes_file = self.cfg.data_dir / attribute_file_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. cfg.attributes_file does not exist, please only use cfg.data_dir with a hardcoded path from there to the attributes.
""" | ||
Handles the aggregation of time series data (forcings, streamflow) and | ||
splits it into per-basin files, then prints a summary of available features. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not correctly formatted doc-string. Missing one line sentence + call arg section plus return type annotation
""" | ||
Handles the merging of all static attribute files into a single CSV. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
""" | ||
Orchestrates the full preprocessing of a CAMELS-IND dataset. The dataset can be downloaded from | ||
<https://zenodo.org/records/14999580> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing one line doc string. Also I think the link will not correctly render in the sphinx docs. Maybe use the way you also reference the dataset in the doc string of the class?
@kratzert thanks for reviewing my code. I have made changes to the code according to your comments.
I hope these changes are enough. Let me know if any more changes are required. |
Co-authored-by: Frederik Kratzert <kratzert@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still are a few issues with the docs strings that prevent the Sphinx docs from being built. See Line 99-102 in this test https://github.com/neuralhydrology/neuralhydrology/actions/runs/16467592880/job/46548803896?pr=250
I made a guess in my line-by-line comments what could be the issue, but I am not 100% sure. To iterate more quickly, I suggest you locally install the doc dependencies from https://github.com/neuralhydrology/neuralhydrology/blob/master/environments/rtd_requirements.txt so that you can try building the docs locally to make sure they work.
If you have a neuralhydrology environment, the only things missing are:
- sphinx>=3.2.1
- sphinx-rtd-theme>=0.5.0
- nbsphinx>=0.8.0
- nbsphinx-link>=1.3.0
Once they are installed, and assuming you have a linux environment, you can go into the docs/ directory and run make html
. You will then see a build/
directory with an html/
subdirectory. If the make html
build process finishes without errors, we should be good to go. If you want to inspect the docs locally, just open the index.html
inside the build/html
directory, once they finish bulding.
Let me know if the instructions are unclear or if I can help you with anything.
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. | ||
Earth System Science Data, 17(2), 461-491. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two lines need to be intended to the same level as e.g. the explanation of the input Parameters.
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. | |
Earth System Science Data, 17(2), 461-491. | |
CAMELS-IND: hydrometeorological time series and catchment attributes for 228 catchments in Peninsular India. | |
Earth System Science Data, 17(2), 461-491. |
|
||
Returns | ||
------- | ||
None | ||
This function does not return a value but saves files to disk and | ||
prints summary information to the console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the entire returns section needs to be removed.
|
||
Returns | ||
------- | ||
None | ||
This function does not return a value but saves a file to disk and | ||
prints progress and summary information to the console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the entire returns section needs to be removed.
print(f"Consolidated attributes file saved to: {attributes_output_file}") | ||
|
||
|
||
# --- DOCSTRING UPDATED --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
This function performs two main tasks: | ||
1. Processes time series data (forcings, streamflow) and saves one CSV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to have a blank line before this enumeration here, for the list to correctly render.
output_dir/ | ||
├── attributes.csv | ||
└── preprocessed/ | ||
├── [gauge_id_1].csv | ||
├── [gauge_id_2].csv | ||
└── ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this works. Sphinx (the docs engine) is complaining about unexpected indentation. You might need to wrap this somehow, maybe as a code block.
@kratzert Thanks for taking the time to review and helping me out. I have made the necessary changes and the code should now pass the docs check and be good to go. |
I have added support for the CAMELS-IND dataset following the structure and conventions used for other CAMELS datasets in the repo.