Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Discussion] Softmax and SoftmaxLoss #426

Closed
antinucleon opened this issue Oct 29, 2015 · 16 comments
Closed

[Discussion] Softmax and SoftmaxLoss #426

antinucleon opened this issue Oct 29, 2015 · 16 comments

Comments

@antinucleon
Copy link
Contributor

Currently, our softmax is softmax loss, it brings some computation efficiency, however, the drawback is it is hard to change the loss function. I think it is time for us to separate Softmax and SoftmaxLoss, but the problem is historical burden, eg pretrained models.

I think it is an urgent problem need to be fixed in this week, or we may have even heavier burden

Let's discuss a solution.

@pluskid @piiswrong @tqchen @winstywang

@piiswrong
Copy link
Contributor

I think we need 5 softmax modes:

  1. Single output integer label softmax (done)
  2. Multi output integer label softmax (done)
  3. Single output distribution label softmax (todo)
  4. Multi output distribution label softmax (todo)
  5. Softmax as internal layer (todo)

5 can be implemented in cudnn_activation_op. We can rename Softmax to SoftmaxLoss, but install a phony softmax and give a warning when used.

@antinucleon
Copy link
Contributor Author

@piiswrong I think we may introduce a special symbol family called loss symbol, and the graph must end with loss symbol. But composite a loss symbol is still not investigated yet,

@piiswrong
Copy link
Contributor

Loss symbol at top is needed only when you want to train. We can disable
backward otherwise.
When will you need to compose loss symbol?

On Thu, Oct 29, 2015 at 11:14 AM, Bing Xu [email protected] wrote:

@piiswrong https://github.com/piiswrong I think we may introduce a
special symbol family called loss symbol, and the graph must end with loss
symbol. But composite a loss symbol is still not investigated yet,


Reply to this email directly or view it on GitHub
#426 (comment).

@pluskid
Copy link
Contributor

pluskid commented Oct 29, 2015

@antinucleon Can you explain what is the historical burden about pre-trained model? Maybe we just need to edit the json file manually if renaming is necessary?

@pluskid
Copy link
Contributor

pluskid commented Oct 29, 2015

@antinucleon By composing loss, are you talking about networks with multiple loss functions?

I think @piiswrong has a good point here, the loss function is only needed during training. It is kind of inconvenient to have two symbols for training and predicting simply because one is with a loss layer and the other is without.

One possible way to go around this is to distinguish the loss from the output, so

  • A network could have a output symbol, that could be the softmax, as usual, and that is the symbol of the network.
  • A network could have one or more loss attached it, but the loss is not the output node. We will provide an API to collect all the losses (just a couple of real numbers) from the network after one forward pass is run.

Something like this

net = Variable("data")
net = FullyConnected(data=net, num_hidden=100)
net = Softmax(data=net)
net.attach_loss(MulticlassLogisticLoss())

and multiple loss

data = Variable("data")
fc1 = FullyConnected(data=data, name="fc1", num_hidden=100)
net = Softmax(data=fc1, name="smax1")
net2 = FullyConnected(data=fc1, name="fc2", num_hidden=100)
net2 = Softmax(data=net2, name="smax2")
net.attach_loss(MulticlassLogisticLoss())
net2.attach_loss(MulticlassLogisticLoss())

net_comb = Group(net, net2)
net_comb.list_outputs() # => ['smax1_output', 'smax2_output']

For the loss function, we have two choices

  • Make a switch to distinguish training and prediction, during prediction, the forward pass will skip computing the loss functions because no label is presented.
  • Completely ignore the loss functions during forward (that is also the current behavior) because the objective function values are not really needed during back propagation. While this is simpler and more efficient, I have a weak argument that I want to make against this design: sometimes it is useful to allow the users to observe the objective function. It is especially useful for debugging. And if people are working on new optimization algorithms, the objective function value (instead of the accuracy) is the direct things to look at.

@pluskid
Copy link
Contributor

pluskid commented Oct 30, 2015

Also, if we use (N,1) instead of (N,) for the label shape of the current softmax, then there seems to be no need to have different operators (or even different mode of the same operators) for single output softmax or multi-output softmax. As the single output softmax is just the multi-output softmax with 1D output.

@piiswrong
Copy link
Contributor

With one output, multi softmax is much slower than single softmax
On Oct 29, 2015 9:18 PM, "Chiyuan Zhang" [email protected] wrote:

Also, if we use (N,1) instead of (N,) for the label shape of the current
softmax, then there seems to be no need to have different operators (or
even different mode of the same operators) for single output softmax or
multi-output softmax. As the single output softmax is just the multi-output
softmax with 1D output.


Reply to this email directly or view it on GitHub
#426 (comment).

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

There are two ways as far as I can summarize:

  • Explicit have Output Symbol that gives transformation in forward, and gradient in backward
    • Using output distinguish it from "loss" as they are not strictly loss.
    • Softmax can be renamed to SoftmaxOutput, as we already have LinearRegressionOutput
    • Output is attached by user in symbolic construction.
  • Use the strictly math way
    • Softmax and CrossEntropyLoss
      • The softmax will implement a generic backprop for all case, and CrossEntropy Add the log-factor
      • Need to have a bit careful in thres-holding, for stability.
    • Alternative LogSoftmax and CrossEntropyOnLogProb
      • More stable, but usually user want probability, which means LogSoftmax was a bit abnormal to basic users.

In terms of implementation, we can blend two implementation together by detecting the shapes. I think I am fine with either way, as usual I think the most effective way is to list proposal and we vote for a way we like. @pluskid @antinucleon @piiswrong

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

For the strict math case, maybe we could to add a lite graph rewriting proc in binding to rewrite the the softmax-cross-entropy-chain to One operator.

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

Also the two solution not necessarily conflict with each other,Output Symbol and strict math symbol can both be supported, and reuse the functions by declare common functions like softmaxforward etc.

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

And here is the fun fact about different user types.

  • For users who just started, they want "LinearRegressionOutput", "MulticlassProbOutput"
  • For expert users or researchers, maybe the mathy version is preferred.

@mli
Copy link
Contributor

mli commented Oct 30, 2015

i'm also agree separating the loss and the output is a good idea. the output is the one we get for prediction, which the loss penalizes our predicted output with the ground truth.

in math, softmax means the softmax normalization, why it involves with the label in ML https://en.wikipedia.org/wiki/Softmax_function

we can use softmax as what we have now, but softmaxoutout to means the exp normalization (no loss). and we'd better have a clear doc about the terminology we are using

@mli
Copy link
Contributor

mli commented Oct 30, 2015

besides, we'd better have a detailed document about how to implement a new loss function. this is often one of the most frequent questions asked by a DL researcher who want to use mxnet to implement their proposed network

@pluskid
Copy link
Contributor

pluskid commented Oct 30, 2015

I agree. Behind the scene, code could be shared, but for the interface, I think

  • Single-output and multi-output should be unified and transparent to the user (however, if we do this, need to change the convention of having shape (N,) for labels to shape (N,1))
  • It could be good to have 3 separate things (though I'm not very sure how to make the naming less confusing)
    • SoftmaxLoss as the multiclass logistic loss (maybe we should better name it MulticlassLogisticLoss or CrossEntropyLoss or something clearer than SoftmaxLoss, because in Caffe, SoftmaxLoss means Softmax+MulticlassLogisticLoss)
    • SoftmaxOutput as only doing the softmax, without a loss functionality, so cannot backpropagate unless a loss is attached to it.
    • Softmax as a shortcut for the two above, meaning that it comes with an attached loss by itself.

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

I actually think we can make two case, this choice remove the need of introducing attach_loss

  • SoftmaxOutput or MulticlassProbOutput as current Softmax, attached backprop gradient, output prediction value(multiclass prob)
  • Basic user do not know how to choose the loss combination with output transformation, and usually there are limited number of them, e.g. softmax better be composed with
  • The convention is XXXOutput as the "task oriented" special symbol that gives desired output, and training rule(gradient) at backprop.
  • New Softmax that backprops correctly in all cases, and we can future compose any loss on it.
    • This can be done in longer term

@tqchen
Copy link
Member

tqchen commented Oct 30, 2015

opened vote in #434

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants