diff --git a/big-bird.md b/big-bird.md index e76a4bd939..11eb2955b5 100644 --- a/big-bird.md +++ b/big-bird.md @@ -179,11 +179,11 @@ Random attention is ensuring that each query token will attend a few random toke ```python # r1, r2, r are some random indices; Note: r1, r2, r3 are different for each row 👇 -Q[1] x [Q[r1], Q[r2], ......, Q[r]] +Q[1] x [K[r1], K[r2], ......, K[r]] . . . -Q[n-2] x [Q[r1], Q[r2], ......, Q[r]] +Q[n-2] x [K[r1], K[r2], ......, K[r]] # leaving 0th & (n-1)th token since they are already global ``` @@ -209,7 +209,7 @@ Attention score for \\(\mathbf{q}_{1}\\) represented by \\(a_1\\) where \\(a_1=S --- -For calculating attention score for tokens in seconcd block, we are gathering the first three blocks, the last block, and the fifth block. Then we can compute \\(a_2 = Softmax(q_2 * concat(k_1, k_2, k_3, k_5, k_7)\\). +For calculating attention score for tokens in second block, we are gathering the first three blocks, the last block, and the fifth block. Then we can compute \\(a_2 = Softmax(q_2 * concat(k_1, k_2, k_3, k_5, k_7)\\). ![BigBird block sparse attention](assets/18_big_bird/q2.png)