Santiago
Santiago

@svpino

14 Tweets 61 reads Feb 08, 2023
I've been studying machine learning for half a decade, and here is the most popular function we use every day:
f(x) = max(0, x)
Yet many people don't understand a fundamental characteristic of this function.
Here is what you need to know:
The name of this function is "Rectified Linear Unit" or ReLU.
Most people think ReLU is both continuous and differentiable.
They are wrong.
I know because I've asked this question several times in @0xbnomial and interviews.
Many make the same mistake.
Let's start by defining ReLU:
f(x) = max(0, x)
In English: if x <= 0, the function will return 0. Otherwise, the function will return x.
If you plot this function, you'll get the attached chart.
Notice there are no discontinuities in the function.
This should be enough to answer half of the original question: the ReLU function is continuous.
Let's now think about the differentiable part.
A necessary condition for a function to be differentiable: it must be continuous.
ReLU is continuous. That's good, but not enough.
Its derivative should also exist for every individual point.
Here is where things get interesting.
We can compute the derivative of a function using the attached formula.
(I'm not going to explain where this is coming from; trust me on this one.)
We can use this formula to see whether ReLU is differentiable.
Looking at ReLU's plot again, the interesting point is when x = 0.
That's where the function changes abruptly.
If there's going to be an issue with the function's derivative, it's going to be there!
For ReLU to be differentiable, its derivative should exist at x = 0 (our problematic point.)
To see whether the derivate exists, we need to check that the left-hand and right-hand limits exist and are equal at x = 0.
That shouldn't be hard to do.
Going back to our formula:
The first step is to replace f(x) with ReLU's actual function.
It should now look like this:
So let's find out the left-hand limit.
In English: we want to compute the derivative using our formula when h approaches zero from the left.
At x = 0 and h < 0, we end up with the derivative being 0.
We can now do the same to compute the right-hand limit.
In this case, we want h to approach 0 from the right.
At x = 0, and h > 0, we will end up with the derivative being 1.
This is what we have:
1. The left-hand limit is 0.
2. The right-hand limit is 1.
For the function's derivative to exist at x = 0, both the left-hand and right-hand limits should be the same.
This is not the case. The derivative of ReLU doesn't exist at x = 0.
We now have the complete answer:
• ReLU is continuous
• ReLU is not differentiable
But here was the central confusion point:
How come ReLU is not differentiable but we can use it as an activation function when using Gradient Descent?
What happens is that we don't care that the derivative of ReLU is not defined when x = 0. When this happens, we set the derivative to 0 (or any arbitrary value) and move on with our lives.
A nice hack.
This is the reason we can still use ReLU together with Gradient Descent.

Loading suggestions...