3

Please clarify if the Annotated Transformer Encoder LayerNorm implementation is correct.

enter image description here

Transformer paper says the output of the sub layer is LayerNorm(x + Dropout(SubLayer(x))).

enter image description here

LayerNorm should be applied after the DropOut(SubLayer(x)) as per the paper:

enter image description here

However, the Annotated Transformer implementation does x + DropOut(SubLayer(LayerNorm(x))) where LayerNorm is applied before Sublayer, which is the other way around.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))   # <--- LayerNorm before SubLayer

1 Answer 1

5

Original paper applied Dropout to the Sub-Layer (Multi Head Attention) before Residual Connection and Layer Normalization. This is called Post Normalization.

dropout to the output of each sub-layer, before it is added to the sub-layer input (x) and (layer) normalized.

However, recent approach is Pre Normalization where LayerNorm is applied to the input x into the sub-layer as explained in Let's build GPT: from scratch, in code, spelled out.

Very few details about the Transformer have changed in the last five years, but there is something slightly departs from the original paper. You see that Add and Norm is applied after the transformation (Multi Head Attention). But now it is more common to apply LayerNorm before the transformation, so there is a reshuffling of the Layer Norm. This is called pre-norm formulation and that is the one we are going to implement as well.

This is proposed in On Layer Normalization in the Transformer Architecture.

enter image description here

The Annotated Transformer is also following this approach.

Sign up to request clarification or add additional context in comments.

1 Comment

Pre-LN normalize inputs immediately before info going through further computation, such as MHA, FFN, which exploits maximum potential of Normalization. And in the perspective of Residual connection, Pre-LN is a more pure and direct highway for info flow. Of course this is hindsight judgement, performance speaks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.