Sigmoid in cross-entropy and mean-squared-error

Machine-Learning


I was asked in an interview that what will happen if we use squared loss(after sigmoid) instead of cross entropy in a binary classcification problem?

First of all, if we regard binary classification problem as a regression problem, we will fail due to unranged values. But what will happen if we replace cross entropy(CE) loss with squared loss? Let's see.

Take one example as an example. We have


We compute derivatives of these function and sigmoid function, we get



Then use SGD we can update parameters


You can see when using squared loss, when or , that means the gradient is also zero, so it will fail to update parameter even if . But CE doesn't have that kind of the saturate problem.

Finally, it reminds me of something said in DL-book by Bengio, 'You must have some log form loss to cancel the exponential part when your output is sigmoid'