Annotated Transformer - Why x + DropOut(Sublayer(LayerNorm(x)))?

Question

Please clarify if the Annotated Transformer Encoder LayerNorm implementation is correct.

Transformer paper says the output of the sub layer is LayerNorm(x + Dropout(SubLayer(x))).

LayerNorm should be applied after the DropOut(SubLayer(x)) as per the paper:

However, the Annotated Transformer implementation does x + DropOut(SubLayer(LayerNorm(x))) where LayerNorm is applied before Sublayer, which is the other way around.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))   # <--- LayerNorm before SubLayer

mon · Accepted Answer · 2024-01-24 02:19:05Z

5

Original paper applied Dropout to the Sub-Layer (Multi Head Attention) before Residual Connection and Layer Normalization. This is called Post Normalization.

dropout to the output of each sub-layer, before it is added to the sub-layer input (x) and (layer) normalized.

However, recent approach is Pre Normalization where LayerNorm is applied to the input x into the sub-layer as explained in Let's build GPT: from scratch, in code, spelled out.

Very few details about the Transformer have changed in the last five years, but there is something slightly departs from the original paper. You see that Add and Norm is applied after the transformation (Multi Head Attention). But now it is more common to apply LayerNorm before the transformation, so there is a reshuffling of the Layer Norm. This is called pre-norm formulation and that is the one we are going to implement as well.

This is proposed in On Layer Normalization in the Transformer Architecture.

The Annotated Transformer is also following this approach.

edited Jan 24, 2024 at 2:19

answered Jan 24, 2024 at 2:03

mon

23k33 gold badges152 silver badges261 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kuo Over a year ago

Pre-LN normalize inputs immediately before info going through further computation, such as MHA, FFN, which exploits maximum potential of Normalization. And in the perspective of Residual connection, Pre-LN is a more pure and direct highway for info flow. Of course this is hindsight judgement, performance speaks.

Collectives™ on Stack Overflow

Annotated Transformer - Why x + DropOut(Sublayer(LayerNorm(x)))?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related