[Discussion] Softmax and SoftmaxLoss #426

antinucleon · 2015-10-29T17:41:29Z

Currently, our softmax is softmax loss, it brings some computation efficiency, however, the drawback is it is hard to change the loss function. I think it is time for us to separate Softmax and SoftmaxLoss, but the problem is historical burden, eg pretrained models.

I think it is an urgent problem need to be fixed in this week, or we may have even heavier burden

Let's discuss a solution.

@pluskid @piiswrong @tqchen @winstywang

piiswrong · 2015-10-29T17:49:18Z

I think we need 5 softmax modes:

Single output integer label softmax (done)
Multi output integer label softmax (done)
Single output distribution label softmax (todo)
Multi output distribution label softmax (todo)
Softmax as internal layer (todo)

5 can be implemented in cudnn_activation_op. We can rename Softmax to SoftmaxLoss, but install a phony softmax and give a warning when used.

antinucleon · 2015-10-29T18:14:08Z

@piiswrong I think we may introduce a special symbol family called loss symbol, and the graph must end with loss symbol. But composite a loss symbol is still not investigated yet,

piiswrong · 2015-10-29T18:16:25Z

Loss symbol at top is needed only when you want to train. We can disable
backward otherwise.
When will you need to compose loss symbol?

On Thu, Oct 29, 2015 at 11:14 AM, Bing Xu [email protected] wrote:

@piiswrong https://github.com/piiswrong I think we may introduce a
special symbol family called loss symbol, and the graph must end with loss
symbol. But composite a loss symbol is still not investigated yet,

—
Reply to this email directly or view it on GitHub
#426 (comment).

pluskid · 2015-10-29T18:19:13Z

@antinucleon Can you explain what is the historical burden about pre-trained model? Maybe we just need to edit the json file manually if renaming is necessary?

pluskid · 2015-10-29T18:28:31Z

@antinucleon By composing loss, are you talking about networks with multiple loss functions?

I think @piiswrong has a good point here, the loss function is only needed during training. It is kind of inconvenient to have two symbols for training and predicting simply because one is with a loss layer and the other is without.

One possible way to go around this is to distinguish the loss from the output, so

A network could have a output symbol, that could be the softmax, as usual, and that is the symbol of the network.
A network could have one or more loss attached it, but the loss is not the output node. We will provide an API to collect all the losses (just a couple of real numbers) from the network after one forward pass is run.

Something like this

net = Variable("data")
net = FullyConnected(data=net, num_hidden=100)
net = Softmax(data=net)
net.attach_loss(MulticlassLogisticLoss())

and multiple loss

data = Variable("data")
fc1 = FullyConnected(data=data, name="fc1", num_hidden=100)
net = Softmax(data=fc1, name="smax1")
net2 = FullyConnected(data=fc1, name="fc2", num_hidden=100)
net2 = Softmax(data=net2, name="smax2")
net.attach_loss(MulticlassLogisticLoss())
net2.attach_loss(MulticlassLogisticLoss())

net_comb = Group(net, net2)
net_comb.list_outputs() # => ['smax1_output', 'smax2_output']

For the loss function, we have two choices

Make a switch to distinguish training and prediction, during prediction, the forward pass will skip computing the loss functions because no label is presented.
Completely ignore the loss functions during forward (that is also the current behavior) because the objective function values are not really needed during back propagation. While this is simpler and more efficient, I have a weak argument that I want to make against this design: sometimes it is useful to allow the users to observe the objective function. It is especially useful for debugging. And if people are working on new optimization algorithms, the objective function value (instead of the accuracy) is the direct things to look at.

pluskid · 2015-10-30T04:18:46Z

Also, if we use (N,1) instead of (N,) for the label shape of the current softmax, then there seems to be no need to have different operators (or even different mode of the same operators) for single output softmax or multi-output softmax. As the single output softmax is just the multi-output softmax with 1D output.

piiswrong · 2015-10-30T04:23:03Z

With one output, multi softmax is much slower than single softmax
On Oct 29, 2015 9:18 PM, "Chiyuan Zhang" [email protected] wrote:

Also, if we use (N,1) instead of (N,) for the label shape of the current
softmax, then there seems to be no need to have different operators (or
even different mode of the same operators) for single output softmax or
multi-output softmax. As the single output softmax is just the multi-output
softmax with 1D output.

—
Reply to this email directly or view it on GitHub
#426 (comment).

tqchen · 2015-10-30T04:25:51Z

There are two ways as far as I can summarize:

Explicit have Output Symbol that gives transformation in forward, and gradient in backward
- Using output distinguish it from "loss" as they are not strictly loss.
- Softmax can be renamed to SoftmaxOutput, as we already have LinearRegressionOutput
- Output is attached by user in symbolic construction.
Use the strictly math way
- Softmax and CrossEntropyLoss
  - The softmax will implement a generic backprop for all case, and CrossEntropy Add the log-factor
  - Need to have a bit careful in thres-holding, for stability.
- Alternative LogSoftmax and CrossEntropyOnLogProb
  - More stable, but usually user want probability, which means LogSoftmax was a bit abnormal to basic users.

In terms of implementation, we can blend two implementation together by detecting the shapes. I think I am fine with either way, as usual I think the most effective way is to list proposal and we vote for a way we like. @pluskid @antinucleon @piiswrong

tqchen · 2015-10-30T04:31:39Z

For the strict math case, maybe we could to add a lite graph rewriting proc in binding to rewrite the the softmax-cross-entropy-chain to One operator.

tqchen · 2015-10-30T04:38:24Z

Also the two solution not necessarily conflict with each other,Output Symbol and strict math symbol can both be supported, and reuse the functions by declare common functions like softmaxforward etc.

tqchen · 2015-10-30T04:45:18Z

And here is the fun fact about different user types.

For users who just started, they want "LinearRegressionOutput", "MulticlassProbOutput"
For expert users or researchers, maybe the mathy version is preferred.

mli · 2015-10-30T05:19:17Z

i'm also agree separating the loss and the output is a good idea. the output is the one we get for prediction, which the loss penalizes our predicted output with the ground truth.

in math, softmax means the softmax normalization, why it involves with the label in ML https://en.wikipedia.org/wiki/Softmax_function

we can use softmax as what we have now, but softmaxoutout to means the exp normalization (no loss). and we'd better have a clear doc about the terminology we are using

mli · 2015-10-30T05:24:25Z

besides, we'd better have a detailed document about how to implement a new loss function. this is often one of the most frequent questions asked by a DL researcher who want to use mxnet to implement their proposed network

pluskid · 2015-10-30T08:58:40Z

I agree. Behind the scene, code could be shared, but for the interface, I think

Single-output and multi-output should be unified and transparent to the user (however, if we do this, need to change the convention of having shape (N,) for labels to shape (N,1))
It could be good to have 3 separate things (though I'm not very sure how to make the naming less confusing)
- SoftmaxLoss as the multiclass logistic loss (maybe we should better name it MulticlassLogisticLoss or CrossEntropyLoss or something clearer than SoftmaxLoss, because in Caffe, SoftmaxLoss means Softmax+MulticlassLogisticLoss)
- SoftmaxOutput as only doing the softmax, without a loss functionality, so cannot backpropagate unless a loss is attached to it.
- Softmax as a shortcut for the two above, meaning that it comes with an attached loss by itself.

tqchen · 2015-10-30T16:18:34Z

I actually think we can make two case, this choice remove the need of introducing attach_loss

SoftmaxOutput or MulticlassProbOutput as current Softmax, attached backprop gradient, output prediction value(multiclass prob)
Basic user do not know how to choose the loss combination with output transformation, and usually there are limited number of them, e.g. softmax better be composed with
The convention is XXXOutput as the "task oriented" special symbol that gives desired output, and training rule(gradient) at backprop.
New Softmax that backprops correctly in all cases, and we can future compose any loss on it.
- This can be done in longer term

tqchen · 2015-10-30T16:30:28Z

opened vote in #434

antinucleon added Call for Contribution Discussion labels Oct 29, 2015

tqchen mentioned this issue Oct 30, 2015

[Vote] Softmax and Loss Convention #434

Closed

tqchen closed this as completed Oct 31, 2015

piiswrong mentioned this issue Nov 3, 2015

SoftmaxTransform Operator #472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Softmax and SoftmaxLoss #426

[Discussion] Softmax and SoftmaxLoss #426

antinucleon commented Oct 29, 2015

piiswrong commented Oct 29, 2015

antinucleon commented Oct 29, 2015

piiswrong commented Oct 29, 2015

pluskid commented Oct 29, 2015

pluskid commented Oct 29, 2015

pluskid commented Oct 30, 2015

piiswrong commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

mli commented Oct 30, 2015

mli commented Oct 30, 2015

pluskid commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

[Discussion] Softmax and SoftmaxLoss #426

[Discussion] Softmax and SoftmaxLoss #426

Comments

antinucleon commented Oct 29, 2015

piiswrong commented Oct 29, 2015

antinucleon commented Oct 29, 2015

piiswrong commented Oct 29, 2015

pluskid commented Oct 29, 2015

pluskid commented Oct 29, 2015

pluskid commented Oct 30, 2015

piiswrong commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015

mli commented Oct 30, 2015

mli commented Oct 30, 2015

pluskid commented Oct 30, 2015

tqchen commented Oct 30, 2015

tqchen commented Oct 30, 2015