Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The second original Swish paper #1

Open
EliasHasle opened this issue Oct 31, 2018 · 4 comments
Open

The second original Swish paper #1

EliasHasle opened this issue Oct 31, 2018 · 4 comments

Comments

@EliasHasle
Copy link

EliasHasle commented Oct 31, 2018

Here: https://www.semanticscholar.org/paper/Searching-for-Activation-Functions-Ramachandran-Zoph/c8c4ab59ac29973a00df4e5c8df3773a3c59995a

It was published in Arxiv before your paper, so it should be cited and commented, in my opinion. They have found (or "found") through a search, the swish function with a beta factor inside the sigmoid, whereas you add one outside.

As far as I can see, for unconstrained weights a beta outside the sigmoid does exactly the same as increasing all the weights from the node, so the network will be able to represent exactly the same functions as pure swish (except the last layer may have no weights out). And a beta inside the sigmoid is equivalent to changing all the weights into the node (except the first layer may have no weights into it).

So basically, the beta parameters only affect the learning process, and will obviously interact with other learning parameters/choices and regularization. (Using SGD instead of Adam for a comparison based on another paper counts as such a choice.)

Please enlighten me if I am wrong.

@MichaelFomenko
Copy link

He clearly dont understand anything about Deep Learning, he only published this paper to have a published paper for his career.

@hypnopump
Copy link
Owner

Hi, @EliasHasle
I do cite the paper by Ramachandran et al. in my paper already.
wrt to the concern about the beta parameter, it's not the same:

  • As you increase the beta in swish, it resembles a ReLU function
  • As you increase the beta in e-swish, you augment the properties of the x*sigmoid(x) function.

I hope the image is clarifying!

dafuq

@hypnopump
Copy link
Owner

@MichaelFomenko always glad to recieve constructive criticism

@MichaelFomenko
Copy link

Sorry EricAlcaide to tell you the truth, but you clearly don't understand Deep Learning, if you would understand Deep Learning you would know that the Beta in your E-Swish Function is just the Weight of the next Layer. This means that Mathematicaly there is no diverence between your E-Swish and the Swish Function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants