Sigmoid in cross-entropy and mean-squared-error

Machine-Learning

I was asked in an interview that what will happen if we use squared loss(after sigmoid) instead of cross entropy in a binary classcification problem?

First of all, if we regard binary classification problem as a regression problem, we will fail due to unranged values. But what will happen if we replace cross entropy(CE) loss with squared loss? Let's see.

Take one example $(x_i, y_i)$ as an example. We have $p_i=sigmoid(wx_i)=\frac{1}{1+e^{-wx_i}}$

$Cross-entropy-loss:-(y_ilog(p_i) + (1-y_i)log(1-p_i))$

$Squared-loss: (y_i - p_i)^2$

We compute derivatives of these function and sigmoid function, we get

$sigmoid'=p_i(1-p_i)x$

$CE'=-\frac{y_i-p_i}{p_i(1-p_i)}$

$Squared'=-(y_i-p_i)$

Then use SGD we can update parameters

$update-ce:w = w + r * (y_i-p_i)x$

$update-squared:w = w + r * (y_i-p_i)*(p_i)*(1-p_i)x$

You can see when using squared loss, $p_i(1-p_i) \approx 0$ when $p_i \approx 1$ or $p_i \approx 0$ , that means the gradient is also zero, so it will fail to update parameter even if $|y_i - p_i| \approx 1$ . But CE doesn't have that kind of the saturate problem.

Finally, it reminds me of something said in DL-book by Bengio, 'You must have some log form loss to cancel the exponential part when your output is sigmoid'