-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Discussion] Softmax and SoftmaxLoss #426
Comments
I think we need 5 softmax modes:
5 can be implemented in cudnn_activation_op. We can rename Softmax to SoftmaxLoss, but install a phony softmax and give a warning when used. |
@piiswrong I think we may introduce a special symbol family called loss symbol, and the graph must end with loss symbol. But composite a loss symbol is still not investigated yet, |
Loss symbol at top is needed only when you want to train. We can disable On Thu, Oct 29, 2015 at 11:14 AM, Bing Xu [email protected] wrote:
|
@antinucleon Can you explain what is the historical burden about pre-trained model? Maybe we just need to edit the json file manually if renaming is necessary? |
@antinucleon By composing loss, are you talking about networks with multiple loss functions? I think @piiswrong has a good point here, the loss function is only needed during training. It is kind of inconvenient to have two symbols for training and predicting simply because one is with a loss layer and the other is without. One possible way to go around this is to distinguish the
Something like this net = Variable("data")
net = FullyConnected(data=net, num_hidden=100)
net = Softmax(data=net)
net.attach_loss(MulticlassLogisticLoss()) and multiple loss data = Variable("data")
fc1 = FullyConnected(data=data, name="fc1", num_hidden=100)
net = Softmax(data=fc1, name="smax1")
net2 = FullyConnected(data=fc1, name="fc2", num_hidden=100)
net2 = Softmax(data=net2, name="smax2")
net.attach_loss(MulticlassLogisticLoss())
net2.attach_loss(MulticlassLogisticLoss())
net_comb = Group(net, net2)
net_comb.list_outputs() # => ['smax1_output', 'smax2_output'] For the loss function, we have two choices
|
Also, if we use |
With one output, multi softmax is much slower than single softmax
|
There are two ways as far as I can summarize:
In terms of implementation, we can blend two implementation together by detecting the shapes. I think I am fine with either way, as usual I think the most effective way is to list proposal and we vote for a way we like. @pluskid @antinucleon @piiswrong |
For the strict math case, maybe we could to add a lite graph rewriting proc in binding to rewrite the the softmax-cross-entropy-chain to One operator. |
Also the two solution not necessarily conflict with each other, |
And here is the fun fact about different user types.
|
i'm also agree separating the loss and the output is a good idea. the output is the one we get for prediction, which the loss penalizes our predicted output with the ground truth. in math, softmax means the softmax normalization, why it involves with the label in ML https://en.wikipedia.org/wiki/Softmax_function we can use |
besides, we'd better have a detailed document about how to implement a new loss function. this is often one of the most frequent questions asked by a DL researcher who want to use mxnet to implement their proposed network |
I agree. Behind the scene, code could be shared, but for the interface, I think
|
I actually think we can make two case, this choice remove the need of introducing
|
opened vote in #434 |
Currently, our softmax is softmax loss, it brings some computation efficiency, however, the drawback is it is hard to change the loss function. I think it is time for us to separate Softmax and SoftmaxLoss, but the problem is historical burden, eg pretrained models.
I think it is an urgent problem need to be fixed in this week, or we may have even heavier burden
Let's discuss a solution.
@pluskid @piiswrong @tqchen @winstywang
The text was updated successfully, but these errors were encountered: