[WIP] Add PReLU and LeakyReLU #586

opfromthestart · 2023-03-19T05:03:37Z

Will add PReLU and LeakyReLU operations and modules.

PReLU forward and backward now working properly

opfromthestart · 2023-03-20T17:59:18Z

@coreylowman I believe this is done for the most part.

opfromthestart · 2023-03-20T18:00:58Z

src/tensor_ops/prelu/cpu_kernel.rs

+        let r = *rhs.as_vec().get(0).unwrap();
+
+        // Should I do this?
+        let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap();


Is this okay to do? I thought some normalization might be needed since it is being updated once for every entry in the tensor.

You shouldn't. Backward operations should always do the following:

input_grad_arg += partial_derivative_arg * output_grad

opfromthestart · 2023-03-20T21:41:50Z

Nevermind it doesn't run.

Actually runs now

Changed some defaults to match Pytorch

nkoppel

I appreciate the effort that went in to implementing this, but I don't feel that this implementation makes good use of dfdx's internal features for implementing binary tensor operations. I recommend refactoring this and modeling it after the implementation of 'add', with LeakyReLU corresponding to ScalarAdd, and PReLU corresponding to BinaryAdd. Implementing these operations in this way will also be much simpler, as the high-level kernel logic is already implemented.

Using this version of the PReLU op, I think we should have PReLU0D and PReLU1D modules in nn to mirror PyTorch's interface. These would work by broadcasting their a fields to the shape of the input tensor before calling prelu on them.

nkoppel · 2023-03-25T18:19:08Z

src/tensor_ops/prelu/cpu_kernel.rs

+        let r = *rhs.as_vec().get(0).unwrap();
+
+        // Should I do this?
+        let scale = E::from_f32(1.0 / lhs_grad.len() as f32).unwrap();


You shouldn't. Backward operations should always do the following:

input_grad_arg += partial_derivative_arg * output_grad

nkoppel · 2023-03-25T18:25:23Z

src/tensor_ops/prelu/prelu.cu

+    }
+    else {
+        lhs_grad[i] += rhs * out_grad[i];
+        *rhs_grad += lhs[i] * size_f * out_grad[i];


This is unsound, because several threads could be trying to add to this location at once, which can cause a race condition, and for the sum to "lose" some of the inputs. Use atomicAdd when you need to have several threads modify the same location.

nkoppel · 2023-03-25T18:45:20Z

src/tensor_ops/prelu/cpu_kernel.rs

+impl<E: Float> BinaryDerivative<E> for PReLUKernelOp {
+    fn f(&self, x: &E, y: &E) -> E {
+        let zero = E::from(0.0).unwrap();
+        x.max(zero) + *y * x.min(zero)
+    }
+
+    fn dfdx(&self, x: &E, y: &E) -> E {
+        let zero = E::from(0.0).unwrap();
+        let one = E::from(1.0).unwrap();


This implementation is good, and can be directly used to implement the cpu binary kernels. (see next comment)

opfromthestart · 2023-03-26T01:41:02Z

I did this using broadcast, but wont this have the same issue with concurrent overlaps?

nkoppel · 2023-03-26T02:27:11Z

src/nn/activations.rs

+}
+
+prelu1d!((B, C), Axis<0>);
+prelu1d!((B, C, M), Axes2<0, 2>);


Is this how we should handle the 3d case? I would think this use case would usually occur when dealing with convolutions, in which case this should look like the following to mirror bias2d. In any case, this behavior should be documented.

Suggested change

prelu1d!((B, C, M), Axes2<0, 2>);

prelu1d!((C, B, M), Axes2<1, 2>);

Hmm pytorch's page says:

Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.

Unsure what we want the behavior to be TBH.

nkoppel · 2023-03-26T02:33:46Z

I did this using broadcast, but wont this have the same issue with concurrent overlaps?

It won't, because the current implementation of the backward pass of binary operations uses chunk_sum, which avoids data races by synchronizing the reads and writes of all threads.

The current version looks really good, good job! Only things that need doing besides my previous comment are fixing the workflow errors and considering moving prelu from activations into its own file.

coreylowman

Oh apparently I never actually clicked submit on my review - woops

coreylowman · 2023-03-21T13:58:53Z

src/nn/activations.rs

+impl<Ax: Axes, S, E: Dtype, D: Device<E>, T: Tape<E, D>> Module<Tensor<S, E, D, T>> for LeakyReLU<E>
+where
+    S: Shape<LastAxis = Ax> + ReduceShape<Ax>,
+    D: PReLUKernel<Tensor<S, E, D>, Tensor<(), E, D>, Output = Tensor<S, E, D>, Elem = E>,


We should add add PReLUKernel to the trait Device list in src/tensor_ops/utilities/devices.rs. Then you shouldn't need this bound 😀

This should be done

coreylowman · 2023-03-21T14:00:07Z

src/tensor_ops/prelu/cpu_kernel.rs

+        let zero = E::from(0.0).unwrap();
+        let one = E::from(1.0).unwrap();
+        if x > &zero {
+            one


I think you can use num_traits::{Zero, One} here. Would look something like:

Suggested change

let zero = E::from(0.0).unwrap();

let one = E::from(1.0).unwrap();

if x > &zero {

one

if x > &E::zero() {

E::one()

This should be done

coreylowman · 2023-03-21T14:10:48Z

src/nn/activations.rs

+    type Error = D::Err;
+
+    fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, D::Err> {
+        input.try_prelu(self.a.clone())


I'm wondering if we even need a specialized tensor_op for PReLU - we can probably implement this using other already implemented ops:

let scale = self.a.retaped::<T>().broadcast_like(t.shape()); let max_0 = input.with_empty_tape().relu(); let min_0 = input.negate().relu().negate(); max_0 + scale * min_0

Thoughts? It would greatly simplify the implementation

Until we have operator fusion, we should avoid doing this, because of the added memory use and overhead this brings.

TBH I'd prefer simpler code that can be optimized later with operator fusion. Plus once we add operator fusion we don't have to go back and remove all this extra stuff. The less specialized kernels we have to write the better IMO. I think PReLU is niche enough where its not necessarily on the hot path.

I'm not really convinced that the short term performance loss is worth it being convenient for a long-term optimization, but if we are going to do it for this operator, here's a more efficient implementation:

let scale = self.a.retaped::<T>().broadcast_like(t.shape()); let scaled = input.with_empty_tape() * scale; (input.scalar_lt(0.0)).choose(scaled, input)

I'd like to start doing this more and more because:

It lowers the bar to contributing tensor ops because you don't have to be a cuda expert or even write multiple kernels

As we look to adding additional devices (e.g. Add OpenCL device #597), each additional tensor op with custom kernel makes this harder.

Its more maintainable

It gives us a wider set of test cases for Operator fusion #607, which is becoming higher priority in my mind as I've look at optimizations recently

opfromthestart · 2023-03-30T13:12:05Z

So, what needs to be done on this? Should I change it to a non-device specific kernel or keep it as is?

nkoppel · 2023-03-30T14:52:22Z

@coreylowman

coreylowman · 2023-03-30T16:20:52Z

@opfromthestart let's change it to non-device specific kernel

opfromthestart · 2023-03-31T19:02:22Z

@coreylowman I removed the device specific kernels, using nkoppel's implementation, and everything works

coreylowman

Looking great! I think after next round we can merge 👍

coreylowman · 2023-04-01T15:48:03Z

src/tensor_ops/prelu/mod.rs

+
+    /// See [prelu]
+    fn try_prelu(self, rhs: E) -> Result<Self, Self::Err> {
+        let dev = D::default();


Should pull the device from the tensor instead so we aren't re-allocating devices

Suggested change

let dev = D::default();

let dev = self.device.clone();

coreylowman · 2023-04-01T15:50:54Z

src/nn/activations.rs

+            type Error = <Tensor<($($InDims),*), E, D, T> as HasErr>::Err;
+
+            fn try_forward(&self, input: Tensor<($($InDims),*), E, D, T>) -> Result<Self::Output, Self::Error> {
+                input.try_prelu(self.a.clone().broadcast())


Will need to record broadcast on the tape:

Suggested change

input.try_prelu(self.a.clone().broadcast())

input.try_prelu(self.a.retaped::<T>().broadcast())

coreylowman · 2023-04-01T15:51:42Z

src/nn/activations.rs

+    type Error = <Tensor<S, E, D, T> as HasErr>::Err;
+
+    fn try_forward(&self, input: Tensor<S, E, D, T>) -> Result<Self::Output, Self::Error> {
+        input.try_prelu(self.a.clone().broadcast())


Suggested change

input.try_prelu(self.a.clone().broadcast())

input.try_prelu(self.a.retaped::<T>().broadcast())

coreylowman · 2023-04-01T15:55:18Z

src/nn/activations.rs

+/// Calls [prelu()] with learnable value.
+#[derive(Debug, Clone)]
+pub struct PReLU<E: Dtype, D: Device<E>> {
+    a: Tensor<(), E, D>,


For accessing scalar outside of crate

Suggested change

a: Tensor<(), E, D>,

pub a: Tensor<(), E, D>,

coreylowman · 2023-04-01T15:55:35Z

src/tensor_ops/prelu/mod.rs

+#[repr(C)]
+#[derive(Debug, Default, Clone, Copy)]
+pub struct PReLUKernelOp;
+
+#[repr(C)]
+#[derive(Debug, Default, Clone, Copy)]
+pub struct LeakyReLUKernelOp<E> {
+    slope: E,
+}


Can remove these

coreylowman · 2023-04-01T15:57:55Z

src/tensor_ops/mod.rs

 pub use normalize::normalize;
 pub use permute_to::PermuteTo;
 pub use pow::{powf, powi};
+pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU};


Suggested change

pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU};

pub use prelu::{leakyrelu, prelu, TryPReLU};

coreylowman · 2023-04-01T15:59:41Z

src/nn/activations.rs

+}
+
+prelu1d!((B, C), Axis<0>);
+prelu1d!((B, C, M), Axes2<0, 2>);


Hmm pytorch's page says:

Channel dim is the 2nd dim of input. When input has dims < 2, then there is no channel dim and the number of channels = 1.

Unsure what we want the behavior to be TBH.

coreylowman · 2023-04-01T16:00:22Z

src/nn/activations.rs

+impl<C: ConstDim, E: Dtype, D: Device<E>> Default for PReLU1D<C, E, D> {
+    fn default() -> Self {
+        let dev = D::default();
+        Self {
+            a: dev.tensor(E::from_f32(0.25).unwrap()).broadcast(),
+        }
+    }
+}


Let's remove this to make it consistent with the other nn modules (that you have to go through BuildModule)

coreylowman · 2023-04-01T16:02:09Z

src/nn/activations.rs


+/// Calls [prelu()] with constant value.
+#[derive(Debug, Clone, Copy)]
+pub struct LeakyReLU<E: Dtype>(E);


Suggested change

pub struct LeakyReLU<E: Dtype>(E);

pub struct LeakyReLU<E: Dtype>(pub E);

coreylowman · 2023-04-01T16:02:33Z

src/nn/activations.rs

+impl<E: Dtype, D: Device<E>> Default for PReLU<E, D> {
+    fn default() -> Self {
+        let dev = D::default();
+        Self {
+            a: dev.tensor(E::from_f32(0.25).unwrap()),
+        }
+    }
+}
+
+impl<E: Dtype, D: Device<E>> From<E> for PReLU<E, D> {
+    fn from(value: E) -> Self {
+        let dev = D::default();
+        Self {
+            a: dev.tensor(value),
+        }
+    }
+}


Let's remove these so it's consistent with other nn modules - so it has to go through BuildModule to get access to device

coreylowman

Awesome, thanks for all the work on this! 🚀

opfromthestart added 3 commits March 19, 2023 00:55

PReLU forward works-ish

f1ce9b5

Doctest works now

068183c

Merge branch 'coreylowman:main' into main

ef62baa

opfromthestart changed the title ~~Add PReLU and LeakyReLU~~ [WIP] Add PReLU and LeakyReLU Mar 19, 2023

opfromthestart added 4 commits March 20, 2023 13:08

Better-ish implementation

f2fe0c4

PReLU forward and backward now working properly

Merge branch 'main' of github.com:opfromthestart/dfdx into main

94f7a20

Now has actual layers to use

f3ef318

fmt, clippy, tests

b132fcb

opfromthestart commented Mar 20, 2023

View reviewed changes

opfromthestart changed the title ~~[WIP] Add PReLU and LeakyReLU~~ Add PReLU and LeakyReLU Mar 20, 2023

opfromthestart added 2 commits March 20, 2023 14:04

Cleaning

a42e0d2

Remove one unneeded generic

c248c7e

opfromthestart changed the title ~~Add PReLU and LeakyReLU~~ [WIP] Add PReLU and LeakyReLU Mar 20, 2023

opfromthestart added 3 commits March 20, 2023 18:34

Cuda maybe working (idk)

4caa8d8

Actually runs now

LeakyReLU now should work

d9c1346

Changed some defaults to match Pytorch

Fix test

641f429

nkoppel suggested changes Mar 25, 2023

View reviewed changes

Conforms to standards

feeedff

nkoppel reviewed Mar 26, 2023

View reviewed changes

coreylowman reviewed Mar 26, 2023

View reviewed changes

opfromthestart added 4 commits March 27, 2023 22:05

Some requested changes

5378860

Formatting

530c5ff

Clippy

0604b57

another fmt

252ae49

opfromthestart added 4 commits March 31, 2023 14:45

No prelu kernel

fbdb13b

Fmt

1a8fc60

Explicit zero

56dca8b

fmt

a885559

Merge branch 'coreylowman:main' into main

066bcea

coreylowman requested changes Apr 1, 2023

View reviewed changes

opfromthestart added 4 commits April 1, 2023 16:10

Actually exports needed things

f136a82

Fixes

c7caf43

fmt

f4660eb

Fix nighly reature

b51a12a

coreylowman approved these changes Apr 1, 2023

View reviewed changes

coreylowman merged commit f74a739 into coreylowman:main Apr 1, 2023

	prelu1d!((B, C, M), Axes2<0, 2>);
	prelu1d!((C, B, M), Axes2<1, 2>);

	input.try_prelu(self.a.clone().broadcast())
	input.try_prelu(self.a.retaped::<T>().broadcast())

	pub use prelu::{leakyrelu, prelu, LeakyReLUKernelOp, PReLUKernelOp, TryPReLU};
	pub use prelu::{leakyrelu, prelu, TryPReLU};

	pub struct LeakyReLU<E: Dtype>(E);
	pub struct LeakyReLU<E: Dtype>(pub E);

Uh oh!

[WIP] Add PReLU and LeakyReLU #586

[WIP] Add PReLU and LeakyReLU #586

Uh oh!

Conversation

opfromthestart commented Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opfromthestart commented Mar 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opfromthestart commented Mar 20, 2023

Uh oh!

nkoppel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opfromthestart commented Mar 26, 2023

Uh oh!

nkoppel Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nkoppel commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coreylowman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opfromthestart commented Mar 30, 2023

Uh oh!

nkoppel commented Mar 30, 2023

Uh oh!

coreylowman commented Mar 30, 2023

Uh oh!

opfromthestart commented Mar 31, 2023

Uh oh!

coreylowman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opfromthestart commented Mar 19, 2023 •

edited

Loading

nkoppel Mar 26, 2023 •

edited

Loading

nkoppel commented Mar 26, 2023 •

edited

Loading