-
-
Notifications
You must be signed in to change notification settings - Fork 104
[WIP] Add PReLU and LeakyReLU #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PReLU forward and backward now working properly
@coreylowman I believe this is done for the most part. |
src/tensor_ops/prelu/cpu_kernel.rs
Outdated
let r = *rhs.as_vec().get(0).unwrap(); | ||
|
||
// Should I do this? | ||
let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this okay to do? I thought some normalization might be needed since it is being updated once for every entry in the tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't. Backward operations should always do the following:
input_grad_arg += partial_derivative_arg * output_grad
Nevermind it doesn't run. |
Actually runs now
Changed some defaults to match Pytorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the effort that went in to implementing this, but I don't feel that this implementation makes good use of dfdx's internal features for implementing binary tensor operations. I recommend refactoring this and modeling it after the implementation of 'add', with LeakyReLU corresponding to ScalarAdd, and PReLU corresponding to BinaryAdd. Implementing these operations in this way will also be much simpler, as the high-level kernel logic is already implemented.
Using this version of the PReLU op, I think we should have PReLU0D and PReLU1D modules in nn
to mirror PyTorch's interface. These would work by broadcasting their a
fields to the shape of the input tensor before calling prelu
on them.
src/tensor_ops/prelu/cpu_kernel.rs
Outdated
let r = *rhs.as_vec().get(0).unwrap(); | ||
|
||
// Should I do this? | ||
let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't. Backward operations should always do the following:
input_grad_arg += partial_derivative_arg * output_grad
src/tensor_ops/prelu/prelu.cu
Outdated
} | ||
else { | ||
lhs_grad[i] += rhs * out_grad[i]; | ||
*rhs_grad += lhs[i] * size_f * out_grad[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsound, because several threads could be trying to add to this location at once, which can cause a race condition, and for the sum to "lose" some of the inputs. Use atomicAdd when you need to have several threads modify the same location.
src/tensor_ops/prelu/cpu_kernel.rs
Outdated
impl<E: Float> BinaryDerivative<E> for PReLUKernelOp { | ||
fn f(&self, x: &E, y: &E) -> E { | ||
let zero = E::from(0.0).unwrap(); | ||
x.max(zero) + *y * x.min(zero) | ||
} | ||
|
||
fn dfdx(&self, x: &E, y: &E) -> E { | ||
let zero = E::from(0.0).unwrap(); | ||
let one = E::from(1.0).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is good, and can be directly used to implement the cpu binary kernels. (see next comment)
I did this using broadcast, but wont this have the same issue with concurrent overlaps? |
} | ||
|
||
prelu1d!((B, C), Axis<0>); | ||
prelu1d!((B, C, M), Axes2<0, 2>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this how we should handle the 3d case? I would think this use case would usually occur when dealing with convolutions, in which case this should look like the following to mirror bias2d. In any case, this behavior should be documented.
prelu1d!((B, C, M), Axes2<0, 2>); | |
prelu1d!((C, B, M), Axes2<1, 2>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm pytorch's page says:
Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.
Unsure what we want the behavior to be TBH.
It won't, because the current implementation of the backward pass of binary operations uses chunk_sum, which avoids data races by synchronizing the reads and writes of all threads. The current version looks really good, good job! Only things that need doing besides my previous comment are fixing the workflow errors and considering moving prelu from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh apparently I never actually clicked submit on my review - woops
src/nn/activations.rs
Outdated
impl<Ax: Axes, S, E: Dtype, D: Device<E>, T: Tape<E, D>> Module<Tensor<S, E, D, T>> for LeakyReLU<E> | ||
where | ||
S: Shape<LastAxis = Ax> + ReduceShape<Ax>, | ||
D: PReLUKernel<Tensor<S, E, D>, Tensor<(), E, D>, Output = Tensor<S, E, D>, Elem = E>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add add PReLUKernel to the trait Device list in src/tensor_ops/utilities/devices.rs. Then you shouldn't need this bound 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done
src/tensor_ops/prelu/cpu_kernel.rs
Outdated
let zero = E::from(0.0).unwrap(); | ||
let one = E::from(1.0).unwrap(); | ||
if x > &zero { | ||
one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use num_traits::{Zero, One} here. Would look something like:
let zero = E::from(0.0).unwrap(); | |
let one = E::from(1.0).unwrap(); | |
if x > &zero { | |
one | |
if x > &E::zero() { | |
E::one() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done
src/nn/activations.rs
Outdated
type Error = D::Err; | ||
|
||
fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, D::Err> { | ||
input.try_prelu(self.a.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we even need a specialized tensor_op for PReLU - we can probably implement this using other already implemented ops:
let scale = self.a.retaped::<T>().broadcast_like(t.shape());
let max_0 = input.with_empty_tape().relu();
let min_0 = input.negate().relu().negate();
max_0 + scale * min_0
Thoughts? It would greatly simplify the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have operator fusion, we should avoid doing this, because of the added memory use and overhead this brings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'd prefer simpler code that can be optimized later with operator fusion. Plus once we add operator fusion we don't have to go back and remove all this extra stuff. The less specialized kernels we have to write the better IMO. I think PReLU is niche enough where its not necessarily on the hot path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced that the short term performance loss is worth it being convenient for a long-term optimization, but if we are going to do it for this operator, here's a more efficient implementation:
let scale = self.a.retaped::<T>().broadcast_like(t.shape());
let scaled = input.with_empty_tape() * scale;
(input.scalar_lt(0.0)).choose(scaled, input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to start doing this more and more because:
- It lowers the bar to contributing tensor ops because you don't have to be a cuda expert or even write multiple kernels
- As we look to adding additional devices (e.g. Add OpenCL device #597), each additional tensor op with custom kernel makes this harder.
- Its more maintainable
- It gives us a wider set of test cases for Operator fusion #607, which is becoming higher priority in my mind as I've look at optimizations recently
So, what needs to be done on this? Should I change it to a non-device specific kernel or keep it as is? |
@opfromthestart let's change it to non-device specific kernel |
@coreylowman I removed the device specific kernels, using nkoppel's implementation, and everything works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! I think after next round we can merge 👍
src/tensor_ops/prelu/mod.rs
Outdated
|
||
/// See [prelu] | ||
fn try_prelu(self, rhs: E) -> Result<Self, Self::Err> { | ||
let dev = D::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should pull the device from the tensor instead so we aren't re-allocating devices
let dev = D::default(); | |
let dev = self.device.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/nn/activations.rs
Outdated
type Error = <Tensor<($($InDims),*), E, D, T> as HasErr>::Err; | ||
|
||
fn try_forward(&self, input: Tensor<($($InDims),*), E, D, T>) -> Result<Self::Output, Self::Error> { | ||
input.try_prelu(self.a.clone().broadcast()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to record broadcast on the tape:
input.try_prelu(self.a.clone().broadcast()) | |
input.try_prelu(self.a.retaped::<T>().broadcast()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nn/activations.rs
Outdated
type Error = <Tensor<S, E, D, T> as HasErr>::Err; | ||
|
||
fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, Self::Error> { | ||
input.try_prelu(self.a.clone().broadcast()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input.try_prelu(self.a.clone().broadcast()) | |
input.try_prelu(self.a.retaped::<T>().broadcast()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nn/activations.rs
Outdated
/// Calls [prelu()] with learnable value. | ||
#[derive(Debug, Clone)] | ||
pub struct PReLU<E: Dtype, D: Device<E>> { | ||
a: Tensor<(), E, D>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For accessing scalar outside of crate
a: Tensor<(), E, D>, | |
pub a: Tensor<(), E, D>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/tensor_ops/prelu/mod.rs
Outdated
#[repr(C)] | ||
#[derive(Debug, Default, Clone, Copy)] | ||
pub struct PReLUKernelOp; | ||
|
||
#[repr(C)] | ||
#[derive(Debug, Default, Clone, Copy)] | ||
pub struct LeakyReLUKernelOp<E> { | ||
slope: E, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/tensor_ops/mod.rs
Outdated
pub use normalize::normalize; | ||
pub use permute_to::PermuteTo; | ||
pub use pow::{powf, powi}; | ||
pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU}; | |
pub use prelu::{leakyrelu, prelu, TryPReLU}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
prelu1d!((B, C), Axis<0>); | ||
prelu1d!((B, C, M), Axes2<0, 2>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm pytorch's page says:
Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.
Unsure what we want the behavior to be TBH.
src/nn/activations.rs
Outdated
impl<C: ConstDim, E: Dtype, D: Device<E>> Default for PReLU1D<C, E, D> { | ||
fn default() -> Self { | ||
let dev = D::default(); | ||
Self { | ||
a: dev.tensor(E::from_f32(0.25).unwrap()).broadcast(), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this to make it consistent with the other nn modules (that you have to go through BuildModule)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nn/activations.rs
Outdated
|
||
/// Calls [prelu()] with constant value. | ||
#[derive(Debug, Clone, Copy)] | ||
pub struct LeakyReLU<E: Dtype>(E); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct LeakyReLU<E: Dtype>(E); | |
pub struct LeakyReLU<E: Dtype>(pub E); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/nn/activations.rs
Outdated
impl<E: Dtype, D: Device<E>> Default for PReLU<E, D> { | ||
fn default() -> Self { | ||
let dev = D::default(); | ||
Self { | ||
a: dev.tensor(E::from_f32(0.25).unwrap()), | ||
} | ||
} | ||
} | ||
|
||
impl<E: Dtype, D: Device<E>> From<E> for PReLU<E, D> { | ||
fn from(value: E) -> Self { | ||
let dev = D::default(); | ||
Self { | ||
a: dev.tensor(value), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these so it's consistent with other nn modules - so it has to go through BuildModule to get access to device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for all the work on this! 🚀
Fixes #287, #389, #584
Will add PReLU and LeakyReLU operations and modules.