+
Skip to content

Conversation

opfromthestart
Copy link
Contributor

@opfromthestart opfromthestart commented Mar 19, 2023

Fixes #287, #389, #584

Will add PReLU and LeakyReLU operations and modules.

  • Do NoneTape forward
  • Do OwnedTape forward
  • Do backward for LeakyReLU
  • Do backward for PReLU
  • (maybe later) do CUDA version

@opfromthestart opfromthestart changed the title Add PReLU and LeakyReLU [WIP] Add PReLU and LeakyReLU Mar 19, 2023
@opfromthestart
Copy link
Contributor Author

@coreylowman I believe this is done for the most part.

let r = *rhs.as_vec().get(0).unwrap();

// Should I do this?
let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this okay to do? I thought some normalization might be needed since it is being updated once for every entry in the tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't. Backward operations should always do the following:

input_grad_arg += partial_derivative_arg * output_grad

@opfromthestart opfromthestart changed the title [WIP] Add PReLU and LeakyReLU Add PReLU and LeakyReLU Mar 20, 2023
@opfromthestart
Copy link
Contributor Author

Nevermind it doesn't run.

@opfromthestart opfromthestart changed the title Add PReLU and LeakyReLU [WIP] Add PReLU and LeakyReLU Mar 20, 2023
Actually runs now
Changed some defaults to match Pytorch
Copy link
Contributor

@nkoppel nkoppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the effort that went in to implementing this, but I don't feel that this implementation makes good use of dfdx's internal features for implementing binary tensor operations. I recommend refactoring this and modeling it after the implementation of 'add', with LeakyReLU corresponding to ScalarAdd, and PReLU corresponding to BinaryAdd. Implementing these operations in this way will also be much simpler, as the high-level kernel logic is already implemented.

Using this version of the PReLU op, I think we should have PReLU0D and PReLU1D modules in nn to mirror PyTorch's interface. These would work by broadcasting their a fields to the shape of the input tensor before calling prelu on them.

let r = *rhs.as_vec().get(0).unwrap();

// Should I do this?
let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't. Backward operations should always do the following:

input_grad_arg += partial_derivative_arg * output_grad

}
else {
lhs_grad[i] += rhs * out_grad[i];
*rhs_grad += lhs[i] * size_f * out_grad[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsound, because several threads could be trying to add to this location at once, which can cause a race condition, and for the sum to "lose" some of the inputs. Use atomicAdd when you need to have several threads modify the same location.

Comment on lines 11 to 19
impl<E: Float> BinaryDerivative<E> for PReLUKernelOp {
fn f(&self, x: &E, y: &E) -> E {
let zero = E::from(0.0).unwrap();
x.max(zero) + *y * x.min(zero)
}

fn dfdx(&self, x: &E, y: &E) -> E {
let zero = E::from(0.0).unwrap();
let one = E::from(1.0).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is good, and can be directly used to implement the cpu binary kernels. (see next comment)

@opfromthestart
Copy link
Contributor Author

I did this using broadcast, but wont this have the same issue with concurrent overlaps?

}

prelu1d!((B, C), Axis<0>);
prelu1d!((B, C, M), Axes2<0, 2>);
Copy link
Contributor

@nkoppel nkoppel Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how we should handle the 3d case? I would think this use case would usually occur when dealing with convolutions, in which case this should look like the following to mirror bias2d. In any case, this behavior should be documented.

Suggested change
prelu1d!((B, C, M), Axes2<0, 2>);
prelu1d!((C, B, M), Axes2<1, 2>);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm pytorch's page says:

Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.

Unsure what we want the behavior to be TBH.

@nkoppel
Copy link
Contributor

nkoppel commented Mar 26, 2023

I did this using broadcast, but wont this have the same issue with concurrent overlaps?

It won't, because the current implementation of the backward pass of binary operations uses chunk_sum, which avoids data races by synchronizing the reads and writes of all threads.

The current version looks really good, good job! Only things that need doing besides my previous comment are fixing the workflow errors and considering moving prelu from activations into its own file.

Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh apparently I never actually clicked submit on my review - woops

impl<Ax: Axes, S, E: Dtype, D: Device<E>, T: Tape<E, D>> Module<Tensor<S, E, D, T>> for LeakyReLU<E>
where
S: Shape<LastAxis = Ax> + ReduceShape<Ax>,
D: PReLUKernel<Tensor<S, E, D>, Tensor<(), E, D>, Output = Tensor<S, E, D>, Elem = E>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add add PReLUKernel to the trait Device list in src/tensor_ops/utilities/devices.rs. Then you shouldn't need this bound 😀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done

Comment on lines 18 to 21
let zero = E::from(0.0).unwrap();
let one = E::from(1.0).unwrap();
if x > &zero {
one
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use num_traits::{Zero, One} here. Would look something like:

Suggested change
let zero = E::from(0.0).unwrap();
let one = E::from(1.0).unwrap();
if x > &zero {
one
if x > &E::zero() {
E::one()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done

type Error = D::Err;

fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, D::Err> {
input.try_prelu(self.a.clone())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we even need a specialized tensor_op for PReLU - we can probably implement this using other already implemented ops:

let scale = self.a.retaped::<T>().broadcast_like(t.shape());
let max_0 = input.with_empty_tape().relu();
let min_0 = input.negate().relu().negate();
max_0 + scale * min_0

Thoughts? It would greatly simplify the implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have operator fusion, we should avoid doing this, because of the added memory use and overhead this brings.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'd prefer simpler code that can be optimized later with operator fusion. Plus once we add operator fusion we don't have to go back and remove all this extra stuff. The less specialized kernels we have to write the better IMO. I think PReLU is niche enough where its not necessarily on the hot path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced that the short term performance loss is worth it being convenient for a long-term optimization, but if we are going to do it for this operator, here's a more efficient implementation:

let scale = self.a.retaped::<T>().broadcast_like(t.shape());
let scaled = input.with_empty_tape() * scale;
(input.scalar_lt(0.0)).choose(scaled, input) 

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to start doing this more and more because:

  1. It lowers the bar to contributing tensor ops because you don't have to be a cuda expert or even write multiple kernels
    1. As we look to adding additional devices (e.g. Add OpenCL device #597), each additional tensor op with custom kernel makes this harder.
  2. Its more maintainable
  3. It gives us a wider set of test cases for Operator fusion #607, which is becoming higher priority in my mind as I've look at optimizations recently

@opfromthestart
Copy link
Contributor Author

So, what needs to be done on this? Should I change it to a non-device specific kernel or keep it as is?

@nkoppel
Copy link
Contributor

nkoppel commented Mar 30, 2023

@coreylowman

@coreylowman
Copy link
Owner

@opfromthestart let's change it to non-device specific kernel

@opfromthestart
Copy link
Contributor Author

@coreylowman I removed the device specific kernels, using nkoppel's implementation, and everything works

Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! I think after next round we can merge 👍


/// See [prelu]
fn try_prelu(self, rhs: E) -> Result<Self, Self::Err> {
let dev = D::default();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pull the device from the tensor instead so we aren't re-allocating devices

Suggested change
let dev = D::default();
let dev = self.device.clone();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type Error = <Tensor<($($InDims),*), E, D, T> as HasErr>::Err;

fn try_forward(&self, input: Tensor<($($InDims),*), E, D, T>) -> Result<Self::Output, Self::Error> {
input.try_prelu(self.a.clone().broadcast())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to record broadcast on the tape:

Suggested change
input.try_prelu(self.a.clone().broadcast())
input.try_prelu(self.a.retaped::<T>().broadcast())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

type Error = <Tensor<S, E, D, T> as HasErr>::Err;

fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, Self::Error> {
input.try_prelu(self.a.clone().broadcast())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input.try_prelu(self.a.clone().broadcast())
input.try_prelu(self.a.retaped::<T>().broadcast())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/// Calls [prelu()] with learnable value.
#[derive(Debug, Clone)]
pub struct PReLU<E: Dtype, D: Device<E>> {
a: Tensor<(), E, D>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For accessing scalar outside of crate

Suggested change
a: Tensor<(), E, D>,
pub a: Tensor<(), E, D>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 12 to 20
#[repr(C)]
#[derive(Debug, Default, Clone, Copy)]
pub struct PReLUKernelOp;

#[repr(C)]
#[derive(Debug, Default, Clone, Copy)]
pub struct LeakyReLUKernelOp<E> {
slope: E,
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pub use normalize::normalize;
pub use permute_to::PermuteTo;
pub use pow::{powf, powi};
pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU};
pub use prelu::{leakyrelu, prelu, TryPReLU};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

prelu1d!((B, C), Axis<0>);
prelu1d!((B, C, M), Axes2<0, 2>);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm pytorch's page says:

Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.

Unsure what we want the behavior to be TBH.

Comment on lines 162 to 169
impl<C: ConstDim, E: Dtype, D: Device<E>> Default for PReLU1D<C, E, D> {
fn default() -> Self {
let dev = D::default();
Self {
a: dev.tensor(E::from_f32(0.25).unwrap()).broadcast(),
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this to make it consistent with the other nn modules (that you have to go through BuildModule)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/// Calls [prelu()] with constant value.
#[derive(Debug, Clone, Copy)]
pub struct LeakyReLU<E: Dtype>(E);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct LeakyReLU<E: Dtype>(E);
pub struct LeakyReLU<E: Dtype>(pub E);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 110 to 126
impl<E: Dtype, D: Device<E>> Default for PReLU<E, D> {
fn default() -> Self {
let dev = D::default();
Self {
a: dev.tensor(E::from_f32(0.25).unwrap()),
}
}
}

impl<E: Dtype, D: Device<E>> From<E> for PReLU<E, D> {
fn from(value: E) -> Self {
let dev = D::default();
Self {
a: dev.tensor(value),
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these so it's consistent with other nn modules - so it has to go through BuildModule to get access to device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for all the work on this! 🚀

@coreylowman coreylowman merged commit f74a739 into coreylowman:main Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add tensor_ops::leaky_relu and nn::LeakyReLU

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载