<< Back to posts

Notes for Harvard's CS 287 NLP Course

Posted on March 14, 2023 • Tags: AI nlp lecture notes

NOTE: These are all taken from Chris Tanner’s great NLP course CS 287 (taught at Harvard). None of this is my own work. This is just a collection of screenshots and notes from my own reading of the course’s lecture slides for my own reference and understanding. I would highly recommend reading the full lecture slides available here.


Types of Tokens

  • <S> = start of sentence token

    • Insert these manually at start of each sentence

    • To generate a new sentence, feed <S> into model and take most likely prediction

  • <UNK> = unknown word (i.e. not in vocabulary V)
  • <EOS> = end of text generation
  • <CLS> = class token
    • Insert at start of sequence fed into BERT
    • Used to represent overall aggregate learned representation of a sequence
  • <SEP> = separator token
    • Insert between sentences in sequence fed into BERT


  • Logit = Log of a probability
    • $F(p): [0,1] \rightarrow [-\infty, \infty] = \log(\frac{p}{1 - p})$
    • Typically the last layer of a classifier
    • If logit < 0, then prob. < 0.5. If logit > 0, then prob. > 0.5
  • Softmax = Turns logits back into probabilities
    • $F(\mathbf{x})i: [-\infty, \infty] \rightarrow [0,1] = \frac{e^{x_i}}{\sum{i = j}^N e^{x_j}}$
    • Immediately after the logit layer
    • Normalizes the sum of outputs to be 1 (otherwise, same as sigmoid)
    • Here, $\mathbf{x}$ is the output vector where each element corresponds to a class of labels
    • Used for multi-class classification where outputs are mutually exclusive
  • Sigmoid = 2-class special case of softmax, where you only apply the sigmoid to one class to find $P(Y = 1)$ (and simply do $P(Y = 0) = 1 - P(Y = 1)$)
    • $F(\mathbf{x}_i): [ -\infty, \infty] \rightarrow [0, 1] = \frac{1}{1 + e^{-x_i}}$
    • Immediately after the logit layer
    • Used for 2-class classification where outputs are mutually exclusive
    • Used for multi-class classification where outputs are not mutually exclusive, i.e. apply to each class independently
  • High temperature = more uniform softmax probability distribution
  • Low temperature = sharper (i.e. increased likelihood of high probability outcomes)
    • Temperature = 0 <=> Greedy decoding

Helpful explainer: https://huggingface.co/blog/how-to-generate

Where to find papers

  • nlpprogress.com
  • connectedpapers.com
  • paperswithcode.com/sota

3) Language Modeling [Source]


Our document is comprised of words ${w_1, w_2,…,w_T}$

$V$ is our vocabulary of unique words


Language Model: “A Language Model estimates the probability of any sequence of words”

Token: Specific occurrence of a word, e.g. “I ran and ran and ran” -> {I, ran, and, ran, and, ran }

Type: General form of word, e.g. “I ran and ran and ran” -> {I, ran, and, }

OOV: “Out of vocabulary” words, replace with <UNK>


Screen Shot 2022-08-17 at 2.27.26 AM


Assume each 1-gram is independent. \(P(w_1,...,w_T) = \prod_{t = 1}^T P(w_t)\) Screen Shot 2022-08-16 at 5.58.17 PM

Screen Shot 2022-08-16 at 6.00.37 PM


Assume each 2-gram is independent \(P(w_1,...,w_T) = \prod_{t = 2}^T P(w_t | w_{t-1})\)

Screen Shot 2022-08-16 at 6.01.37 PM

Screen Shot 2022-08-16 at 6.01.27 PM

Screen Shot 2022-08-16 at 10.11.49 PM

N-Gram Model

Condition on all previous words in document \(P(w_1,...,w_T) = \prod_{t = 1}^T P(x_t | x_{t-1},...,x_1)\)



  • Interpretation: Avg number of bits needed to represent a word
\[H = \frac{1}{N} \sum_{i = 1}^N \log_2(P(w_i))\]


  • Definition: Inverse probability of test set, normalized by number of words
    • Aka: “exponentiated, per-word cross-entropy”
  • Interpretation: Branching factor needed to predict next word – more branches = more uncertainty
\[\begin{align*} PP(w_1,...,w_T) &= \sqrt[T]{\frac{1}{P(w_1,...,w_T)}}\\ &= 2^{\frac{1}{T} \sum_{t = 1}^T \log_2{P(w_t)}} \end{align*}\]
  • Good models have $PP \in [40, 100]$

Screen Shot 2022-08-16 at 10.05.49 PM

  • If model assumes uniform distribution of words, then $PP = V $

Screen Shot 2022-08-16 at 10.05.33 PM

Featurized Model

Goal: Have $# features « V $

Screen Shot 2022-08-16 at 10.07.34 PM

Screen Shot 2022-08-16 at 10.13.00 PM

Screen Shot 2022-08-16 at 10.08.20 PM

Goal: Learn embedding $v_i$ for each word $w_i$ i.e. learn the red bias vector and blue matrix $N\times V$

Screen Shot 2022-08-16 at 10.14.14 PM

4) Neural Nets [Source]


Distributional: Meaning of word is determined by its context, i.e. “You shall know a word by the company it keeps”

Distributional Representation: Dense embedding vectors convey meaning of token by factoring in context

  • Word Embedding (“Type-based”): **Distributional representation unique for each **word type, i.e. all “banks” have the same learned vector
    • Examples: Bengio 2003, Word2Vec
  • Contextualized Embedding (“Token-based”): Distributional representation unique for each token, i.e. the word “banks” can have different vectors depending on where it is used in the document
    • Examples: RNNs, LSTMs, ELMo

Autoregressive LM: Predict next word only using previous words (previous outputs become inputs)

  • e.g. I want a _____
  • Evaluation: Perplexity

Masked LM: Predict “masked” word in middle of sequence using before/after words

  • e.g. I want to _____ a bagel.
  • Evaluation: Downstream NLP tasks which use learned embeddings

Bengio (2003)

Idea: Simultaneously learn representations + do modeling

Screen Shot 2022-08-16 at 10.53.00 PM


  1. Use weight matrix + bias to calculate probability of label
  2. Calculate CE loss
  3. Backprop to calculate gradients
  4. Update weight matrix + bias

Word2Vec (2013)

Goal: Create word embeddings such that words with the same context have identical embeddings

Approach 1) Continuous Bag-of-Words (CBOW)

Goal: Predict current word based on surrounding words


  1. Iterate over corpus using sliding “context window” of size $N$, step size 1
  2. Use $2N$ context words (except for word at center of window) to predict center word
  3. Apply softmax, calculate loss

Screen Shot 2022-08-17 at 2.02.18 AM \(\begin{align*} Hx &= (D \times V)(V \times 2N)\\ &= D \times 2N\\ sum(Hx) &= \text{row-wise sum across each index in the feature vector for each context word}\\ &= D \times 1\\ U * sum(Hx) &= (V \times D) (D \times 1)\\ &= V \times 1\\ y &= V \times 1 \end{align*}\)

Approach 2) Skip-gram + negative sampling

Goal: Predict surrounding words given current word


  1. Iterate over corpus using sliding “context window” of size $N$, step size 1
  2. Use center word to predict all $2N$ context words
  3. Apply softmax, calculate loss

Screen Shot 2022-08-17 at 2.10.21 AM

Need to add negative samples, otherwise model will simply learn to predict 1.

General Word2Vec Takeaways


  • Smaller window size -> similar embeddings means “interchangeable” words
  • Larger window size -> similar embeddings means “related” words

Screen Shot 2022-08-17 at 2.34.28 AM

Screen Shot 2022-08-17 at 2.33.50 AM

Screen Shot 2022-08-17 at 2.31.21 AM


Word similarity

Use SimLex-999 dataset to compare embedding distance with word similarity

Screen Shot 2022-08-17 at 2.24.49 AM

Word analogy

Check whether analogies hold in embedding space

Screen Shot 2022-08-17 at 2.24.58 AM

Downstream NLP tasks

“External” to model to evaluate utility of embeddings

Screen Shot 2022-08-17 at 2.34.04 AM

5) RNNs [Source]

Definition: NN with a non-linear combination of recurrent state (i.e. “hidden layer”) and the input

Goal: Model long-range dependencies in language (i.e. have “infinite” concept of past words, as opposed to fixed window used by Word2Vec)

Idea: Re-use hidden layer from previous output to predict next output

Hidden layer represents the “meaning” of a word


  • Regardless of output predictions, feed in the actual ground truth inputs at each step (i.e. not autoregressive)
  • Total loss = avg across all words

Screen Shot 2022-08-17 at 2.42.53 AM

Alternative view of the same RNN model:Screen Shot 2022-08-17 at 2.38.23 AM


  • Feed previous output as input to next output
  • NOTE: Same word (“Harry”) can yield different most probable outputs depending on context (thanks to hidden embedding, unlike vanilla NNs + n-grams)

Screen Shot 2022-08-17 at 2.48.07 AM

Strengths + Issues

Screen Shot 2022-08-17 at 2.49.19 AM

Issue: Exploding + Vanishing Gradients

Caused by taking chain rule of many time steps

  • Small gradient = far away context is “forgotten”

  • Large gradient = recency bias without context

Screen Shot 2022-08-17 at 1.59.35 PM

Solution: Gradient Clipping

Fixes the exploding gradients problem

  • Sets max magnitude (i.e. “norm”) of gradient to some threshold
  • Helps with numerical stability of training
  • Doesn’t make model more accurate

Screen Shot 2022-08-17 at 1.36.25 PM

6) LSTMs [Source]

Idea: Better RNN by fixing long-term forgetting issue

Solution: Add a dedicated memory cell $C$ for long-term memory, in addition to the usual hidden state $h$

Modify $C$ via three gates:

  1. Forget
  2. Input

Each gate looks like this in diagrams:

Screen Shot 2022-08-17 at 4.44.30 PM


Screen Shot 2022-08-17 at 4.40.32 PM

Images in below sections are taken from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cell State

Cell state $C$ is a “conveyor belt” – it runs through the LSTM and gets modified by $x$ and $h$

Screen Shot 2022-08-17 at 4.41.07 PM


1) Forget Gate

  • Goal: Decide what info to throw out from $C_{t-1}$, based on $h_{t-1}$ and $x_t$
  • Interpretation: Forget old memories
  • $\sigma$ generates a value between $[0,1]$ for each element in $C_{t-1}$
    • Value of 1 = “completely keep this element in long-term memory $C_{t-1}$”
    • Value of 0 = “completely forget this element from long-term memory $C_{t-1}$”

Screen Shot 2022-08-17 at 4.38.35 PM

2) Input Gate

  • Goal: Decide what info to change in $C_{t-1}$, based on $h_{t-1}$ and $x_t$
  • Interpretation: Make new memories
  • $\sigma$ generates a value between $[0,1]$ for each element in $C_{t-1}$ to decide which values to update
  • $tanh$ squashes $h_{t-1}, x_t$ into new “candidate” values for $C_{t-1}$


2b) Make Updates

Forget old memories: $f_t * C_{t-1}$

Update memory with new values (scaled by importance $i$): $i_t * \tilde{C}_t$

Screen Shot 2022-08-17 at 4.54.05 PM

3) Output Gate

  • Goal: Decide what to output, based on $C_t, h_{t-1}, x_t$
  • Interpretation: Make new memories
  • $\sigma$ generates a value between $[0,1]$ for each element in $C_{t-1}$ to decide which values of cell state go into new hidden state $h_t$
  • $tanh$ squashes $C_t$ into new hidden state values $h_t$

Screen Shot 2022-08-17 at 4.58.29 PM

Bi-Directional LSTMs

If full text is available at test time, then use context in both left-to-right and right-to-left directions

Screen Shot 2022-08-17 at 6.58.23 PM

Screen Shot 2022-08-17 at 6.59.16 PM

Stacked LSTMs

Screen Shot 2022-08-17 at 6.59.52 PM

ELMo = Stacked, Bi-directional LSTM

  • Yields “incredibly good contextualized embeddings”

Screen Shot 2022-08-17 at 7.01.30 PM


Screen Shot 2022-08-17 at 6.56.16 PM

7) Sequence Generation [Source]

Types of prediction

  Input Output
Regression I love hiking! 0.9
Binary classification I love hiking! + or - sentiment
Multi-class classification I love hiking! Category 1, 2, 3, 4, or 5
Structured I love hiking! PRP VBP NN

Unconditioned + Conditional Prediction

Screen Shot 2022-08-22 at 2.52.46 PM

Seq2Seq Model

Problem: With LSTMs/RNNs, we can only have a fixed length output (either equal to the input sequence length or some fixed constant).

  • We go from $N \rightarrow { 1, N }$

What if we want a variable length output? e.g. when translating between languages

  • We want to go from $N \rightarrow M$

Solution: Treat “sequences” as the fundmanetal unit we work with

  • Have two RNNs, one “encoder” and one “decoder” – “seq2seq” model

Screen Shot 2022-08-22 at 2.28.57 PM

Training + Inference

Training: Backprop loss from decoder outputs all the way back to beginning (i.e. both encoder and decoder)

Testing: Run decoder until outputs <S> token. Each decoder output $\hat{y}i$ becomes subsequent input $x{i + 1}$


  • Main benefit of seq2seq = having a separate encoder/decoder allows outputs to be variable length


Insight: Instead of just paying attention to the last embedding, pay attention (weighted by importance) to all hidden states generated by the encoder as it reads the input sequence.

Definition: Attention allows a decoder, at each time step, to focus/use different amounts of the encoder’s hidden states

  1. Each hidden state $h_i^E$ gets a unique raw weight $e_i$ based on its relevance to the decoder state $h_j^D$. This raw weight is calculated via a separate NN (but can be calculated via any arbitrary function).

Screen Shot 2022-08-22 at 2.36.09 PM

  1. Softmax across all $e_i$ to get an “attention score” $a_i$

Screen Shot 2022-08-22 at 2.39.29 PM

  1. Multiply each hidden state $h_i^E$ by $a_i$, then sum across all hidden states to get a context vector $c_j^D$

Screen Shot 2022-08-22 at 2.39.57 PM

  1. Use $h_j^D, c_j^D$ to predict $\hat{y}_j$

Screen Shot 2022-08-22 at 2.39.16 PM

Attention Formulas + Functions

Screen Shot 2022-08-22 at 2.42.23 PM

Screen Shot 2022-08-22 at 2.43.38 PM


  • Greatly improves seq2seq results by conditionally weighting model’s focus
  • Allows us to visualize contribution that each encoding word has for each decoder output

Screen Shot 2022-08-22 at 3.13.43 PM


  • LSTMs was SoTA on most NLP (2014-18)
  • Seq2seq + Attention is even better
    • Place appropriate weight on encoder’s hidden states when decoding
  • Drawback: LSTMs still require iteratively reading each word and waiting until we’ve read the entire sentence before we can start predicting

8) Machine Translation [Source]

**Definition: ** Machine translation (MT) = convert text from one language into another

Seq2Seq Decoding for MT

Greedy Decoding: Pick most likely word at each step

Beam Search: Sequentially consider $k$ most likely words at each step. Prune before expanding

__TODO: Add here__

Strengths / Issues of Seq2Seq


  • SoTA performance
  • Uses context robustly
  • Minimal feature engineering
  • End-to-end optimization


  • OOV
  • Training data (domain mismatch, low resource languages, biases)
  • Long context still hard
  • Not interpretable
  • Hard to control
  • Exploding/vanishing gradient issues

Screen Shot 2022-08-22 at 3.18.36 PM

BLEU Score

“Bilingual Evaluation Understudy”


  • Eval metric that measures similarity between gold-standard and candidate translation


  • Perfect match: $s = 1$
  • Perfect mismatch: $s = 0$

The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. […] on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.


  • Weighted geometric mean of “modified n-gram precisions” for $n \in {1, 2, 3, N }$, multipled by a “brevity penalty” for translations that are too short
  • Typically use $N = 4$, i.e. use 4-gram as maximum n-gram size
  • Clip” repeated n-grams, i.e. treat sentence as a set of n-grams where repeated words count only once

“The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.”

Unigram Precision = % of unigrams in the model’s output sentence that occur in at least one reference translation

Formula [Source]


  • $\hat{y}^x$ be the candidate translation for $y^{(i, x)}. \forall i \in S$
  • $\hat{S}$ be the set of candidate translations
    • Let $ \hat{S} = M$ be the number of candidate translations
  • $S$ be the set of reference translations for each candidate translation
    • Let $ S_i = N_i$ be the number of reference translations for translation $i$
    • e.g. for $\hat{y}^x \in \hat{S}$, there is a set of reference translations $y^{(x, 1)}, …, y^{(x, N_i)} \in S_i$ taken from $N_i$ different reference sources
  • $w_n$ be the weight assigned to the $n$​-grams
    • $W$ be max $n$-gram length
  • $r$ be the length of the reference corpus
  • $c$ be the length of the candidate corpus
\[\hat{y}^x := \text{"The yellow dog walked"}\\ \hat{y}^{(x, i)} := \text{"The yellow dog started to run"}\\ \hat{S} := (\hat{y}^1,...,\hat{y}^M)\\ S := \{ S_1,...,S_M \}\\ S_i := (\hat{y}^{(i,1)},...,\hat{y}^{(i,N_i)})\\ w := (w_1, ..., w_N) \text{ , such that } \sum_{i = 1}^{N} w_i = 1 \text{ and } w_i \in [0, 1]\\ p_n(\hat{S}, S) = \frac{ \sum_{i = 1}^M \sum_{s \in G_n(\hat{y}^i)} \min(C(s, \hat{y}^i), \max_{y \in S_i} C(s, y)) }{ \sum_{i = 1}^M \sum_{s \in G_n(\hat{y}^i)} C(s, \hat{y}^i) } \\ BP(\hat{S}, S) = -\max(r/c-1)\\ BLEU_w(\hat{S}, S) = \exp\{BP(\hat{S}, S) + \sum_{n = 1}^N w_n \log(p_n(\hat{S}, S)) \}\]

Where… \(C(s, y) = \text{\# of times $s$ appears as substring of $y$}\\ G_n(y) = \text{set of $n$-grams in $y$}\\ \sum_{s \in G_n(\hat{y})} C(s, y) = \text{\# of total occurrences that $n$-grams in $y$ appear in $\hat{y}$}\)

Example [Source]

Screen Shot 2022-08-23 at 9.25.57 PM


  • $m$ = # of words in candidate that are found in the reference set

  • $w_t$ = total # of words in candidate

Unigram Precision $P$: \(P = \frac{m}{w_t} = \frac{7}{7} = 1\)


  • $m_{max}$ = max total count of a word in any of the reference translations

Modified Unigram Precision $P$: \(P = \frac{m_{max}}{w_t} = \frac{2}{7}\) Screen Shot 2022-08-23 at 9.30.30 PM \(\begin{align*} \text{Unigram Precision} &= \text{"the", "the", "cat"} = \frac{1 + 1 + 1}{3} = 1\\ \text{Modified Unigram Precision} &= \text{"the", "cat"} = \frac{1 + 1}{3} = \frac{2}{3}\\ \text{Bigram Precision} &= \text{"the the", "the cat"} = \frac{0 + 1}{2} = \frac{1}{2}\\ \text{BLEU} &= \text{weighted geometric mean of above entries} \end{align*}\)

Strengths / Issues of BLEU [Source]


  • Fast and simple


  • Ignores meaning
  • Ignores sentence structure
  • Assumes sentence is already tokenized (so must use same tokenizer to compare different models)
    • Solution: Use SacreBLEU as alternative method

9) Self-Attention

Issues with LSTMs:

  • Unparallelizable (sequential in nature)
  • No explicit modeling of long- and short-range dependencies
  • We don’t use attention to craft our encoder representations

Question: Can we use attention to improve not just our decoder’s output, but also our encoder’s contexutalized representations (i.e. embeddings)?


  • Have each word determine how much it should be influenced by its neighbors
  • Preserve positionality

Self-attention allows us to “create great, context-aware representations”

Screen Shot 2022-08-22 at 3.25.59 PM

Screen Shot 2022-08-22 at 3.33.02 PM

Screen Shot 2022-08-22 at 3.49.46 PM

Screen Shot 2022-08-22 at 3.50.09 PM

Screen Shot 2022-08-22 at 4.21.52 PM

Screen Shot 2022-08-22 at 4.26.54 PM

10) Transformers

Self-Attention v. Seq2Seq Attention

Screen Shot 2022-08-22 at 4.31.18 PM

Screen Shot 2022-08-22 at 4.31.48 PM

Screen Shot 2022-08-22 at 4.32.00 PM

Transformer Encoder

Goal: Create good contextualized embedding $r_i$ for each word $x_i$

Screen Shot 2022-08-22 at 4.47.09 PM

Screen Shot 2022-08-22 at 4.47.54 PM

Position Encodings

Under the set-up above, ordering of words is ignored by model.

Need to add positional info to each word’s embedding to preserve positional info.

Screen Shot 2022-08-22 at 4.50.49 PM

Multi-head attention: Have multiple query/key/value matrices $W_q, W_k, W_v$, then concatenate the $z_i$’s they generate.

Screen Shot 2022-08-22 at 4.52.57 PM

Stack multiple transfomers together:

Screen Shot 2022-08-22 at 4.54.59 PM

Transformer Decoder

Goal: Generate new sequence of text

Screen Shot 2022-08-22 at 4.56.16 PM

Decoder has two attention heads:

  1. Masked Multi-Head Attention (aka “Self-Attention Head”)
    1. query, key, value = outputs of the previous decoder layer
    2. Each position can only attend to previous words (so mask future words) – preserves auto-regressive LM behavior
  2. Multi-Head Attention (aka “Attention Head”) in between the Self-Attention and FFNN layers
    1. query = output of the previous decoder layer
    2. key, value = outputs of the encoder

Screen Shot 2022-08-22 at 4.58.32 PM

Overview Diagrams

Screen Shot 2022-08-28 at 9.52.01 PM

Screen Shot 2022-08-22 at 5.09.13 PM

Loss function: Cross-entropy of output word

Simplified Encoder Diagram

Screen Shot 2022-09-07 at 10.20.04 PM

Simplified Decoder Diagram

Two key differences from Encoder:

  1. Has attention over Encoder’s outputs
  2. Has masked self-attention which masks out future tokens

Screen Shot 2022-09-07 at 10.20.16 PM

Types of Attention

  1. Encoder-Decoder Attention: Decoder attends to all of encoder’s inputs
  2. Encoder Self-Attention: Encoder attends to all of its inputs
  3. Decoder Masked Self-Attention: Decoder attends to all of its prior outputs

Screen Shot 2022-08-28 at 11.11.14 PM

BERT uses encoder self-attention

GPT-2 only uses decoder masked self-attention

Big-O Analysis


  • $n$ = input sequence length
  • $d$ = length of embedding vector

Screen Shot 2022-08-23 at 7.12.02 PM

Shorter maximum path lengths = stronger learned dependencies between words

11) BERT

Model Categorization

Types of Data


  • Raw text (web pages, books)
  • Parallel corpora (translations)


  • Unstructured
    • N-to-1 (sentiment analysis)
    • N-to-N (POS tagging)
    • N-to-M (summarization)
  • Structured
    • Dependency parse trees
    • Constituency parse trees
    • Semantic role labelling

Types of Learning

Style of Learning

  • Multi-task: Train on multiple tasks
  • Transfer learning: Subset of multi-task learning where we only care about one downstream task
  • Pre-training: Subset of transfer learning where we first focus on one task, then apply it to multiple downstream tasks

Ideally, tasks are closely related

Multi-task is most useful on tasks with limited data.

Type of Data Available

  • Supervised
  • Unsupervised
  • Self-supervised
  • Semi-supervised

BERT (Bidirectional Encoder Representations from Transformers)

Goal: Language model that builds rich representations


Model: Transformer encoders (xN)


  1. Input a “sequence”, which is simply a list of tokens that can represent either 1 or 2 sentences
    1. <SEP> to separate sentences within a sequence
    2. Also add a learned embedding to each token to indicate if its in the 1st or 2nd sentence
  2. <CLS> is always the 1st token of each sequence, and rerpesents aggregate sequence representation
  3. Use WordPiece embeddings with a 30k token vocab


  1. Masked language modeling (“MLM”), i.e. predict masked word
  2. Next-sentence prediction (“NSP”), i.e. predict if two sentences are next to each other


  1. BooksCorpus
  2. Wikipedia

Training Objectives

  1. Predict a masked word (e.g. CBOW)

    1. 15% of input words randomly masked
      1. 80% => [MASK]
      2. 10% => revert back
      3. 10% => deliberately wrong words

    Screen Shot 2022-08-28 at 10.10.05 PM

  2. Given two sentences, predict if the second follows the first

    1. Start the first sentence with a <CLS> token
    2. Separate each sentence with <SEP> token
    3. 50% of the time, 2nd sentence actually follows 1s sentence 50% of time, 2nd sentence is randomly sampled from corpus
    4. Make prediction based on embedding of <CLS> token

    Screen Shot 2022-08-28 at 10.10.46 PM


Input tokens are represented as the sum of three embeddings:

  1. Token embeddings – Taken from WordPiece, which is a “sub-word tokenization learns to merge and use characters based on which pairs maximize the likelihood of the training data if added to the vocab”
  2. Segment Embeddings – Indicates if token is from the 1st or 2nd sentence
  3. Position Embeddings – Indicates token’s position in overall sequence

Screen Shot 2022-08-28 at 10.13.20 PM


Concatenate last 4 hidden layers to serve as contextualized embedding of input sentence

Screen Shot 2022-08-23 at 11.41.39 PM


Screen Shot 2022-08-28 at 11.07.47 PM


Takeaway: BERT learns SoTA contextualized embeddings, great for downstream tasks (e.g. classification)

Limitation: Can’t generate new sentences b/c no decoder.



  • ALBERT - A Lite BERT
  • RoBERTa - Robustly Optimized BERT
  • DistilBERT - Small BERT
  • ELECTRA - Pre-training Text Encoders as Discriminators not Generators
  • Longformer - Long-Document Transformer


  • XLNet
  • GPT - Generative Pre-Training
  • CTRL - Conditional Transformer LM for Controllable Generation
  • Reformer

12) GPT-2

Goal: Generate a new output sequence

Idea: Use only Decoders (no Encoders); use only self-attention ((no encoder-attention))

Differences with BERT

  • BERT uses only encoders, GPT-2 only uses decoders
  • BERT uses encoder-attention, GPT-2 only uses self-attention
  • GPT2 is autogressive – BERT is not
    • Autogression = use previous outputs as model inputs

Screen Shot 2022-08-28 at 11.09.13 PM

It only uses masked self-attention (mask out future words)

Screen Shot 2022-08-28 at 11.10.46 PM

Screen Shot 2022-09-27 at 11.52.13 AM


  • Uses BytePair Encodings (i.e. sub-words) instead of words, similar to BERT’s WordPieces

Screen Shot 2022-08-28 at 11.14.10 PM

BytePair Encodings (BPE)

Look at individual characters and repeatedly merge most frequent pairs (e.g. agglomerative clustering)

Stops after $N$ merges. GPT uses $N = 40k$

Screen Shot 2022-09-07 at 10.27.15 PM

All 1,024 positions in the input are given a unique positional encoding

Screen Shot 2022-09-07 at 10.28.00 PM

Masked Attention

Top row of Scores is how much the word “robot” should attend to each of the other words (“robot” in 1st column, “must” in 2nd column, “obey” in 3rd column, “orders” in 4th column).

We mask out the words “must obey orders” b/c they happen in the future.

Thus, the softmax’d attention for “robot” is all on “robot”

Screen Shot 2022-09-01 at 11.40.39 AM

Screen Shot 2022-09-01 at 11.41.03 AM

Screen Shot 2022-09-01 at 11.41.19 AM

Forward Pass

  • Each decoder has its own weights ($W_q, W_k, W_v$ )
  • But entire models shares one token embedding matrix and one positional encoding matrix

Screen Shot 2022-10-13 at 2.17.26 PM

Sampling Words

Tune “Top-K” parameter to have GPT-2 consider $K$ most probable words


Screen Shot 2022-09-27 at 11.57.11 AM

Downstream Tasks

Take the learned embeddings from the last layer and add a FFNN to the end of GPT to do downstream tasks:

Screen Shot 2022-09-27 at 11.58.40 AM

Machine Translation

Screen Shot 2022-10-13 at 2.21.04 PM


Screen Shot 2022-10-13 at 2.21.50 PM

Training data:

Screen Shot 2022-10-13 at 2.22.04 PM

Music Generation

Map musical notes to one-hot vectors, treat musical piece as a sentence of notes, pass through decoder.

Screen Shot 2022-10-13 at 2.26.59 PM

Screen Shot 2022-10-13 at 2.26.47 PM

Screen Shot 2022-10-13 at 2.25.58 PM

Model Scales

Model Dataset Architecture # Params
BERT-Base BooksCorpus (800M words) + English Wikipedia (2.5B words) 12 transformer blocks + 12 attention heads 100M
BERT-Large BooksCorpus (800M words) + English Wikipedia (2.5B words) 24 transformer blocks + 16 attention heads 340M
GPT-2 40GB text data 12-48 decoders 1.5B
GPT-3     175B


  • $2.5 - 50k = 110M params
  • $10k - 200k = 340M params
  • $80k - 1.6M = 1.5B params

Screen Shot 2022-09-27 at 12.10.31 PM

14) Summarization

Types of Input

  • Single-doc = Given a single document, produce summary
  • Multi-doc = Given multiple documents, produce summary

Types of Output

  • Extractive = Select spans from source text that capture key info
  • Abstractive = Generate new text to summarize key info

Types of Focus

  • Generic = Summarize content of docs
  • Query-focused = Summarize with respect to a user’s query (e.g. answer a question by summarizing a doc that has info to construct the answer)


These are the most commonly used datasets for summarization NLP tasks:

Screen Shot 2022-09-27 at 12.48.55 PM


ROUGE-N = (“Recall Oriented Understudy for Gisting Evaluation with $N$ Grams”)

  • $n$-gram based comparison motivated by BLEU
  • Match machine translation $X$ against $h$ reference human summaries, count total number of $n$-gram overlap between $X$ and $h$ reference summaries

Has very high correlation with human evaluation

Traditional Methods


  1. Content selection - choose sentences to extract
    1. Choose sentences with “salient/informative” words
      1. tf-idf - weight each word’s informativeness inverse to occurrence
      2. topic signature - choose a small set of salient words that appear in query
    2. Weight sentence by average weight of its words
  2. Information ordering - order extracted sentences
  3. Sentence realization - clean up ordered sentences into a summary

15) Entity Linking (Named Entity Disambiguation)

Task: Identify all named mentions (not nominal mentions) of an entity, and disambiguate them by linking them to nodes in an external knowledge graph (KG)

Two Stage Process:

  1. Identify mention in text
  2. Link mentions to entities

Discourse and Pragmatics

Pragmatics is a branch of linguistics dealing with language use in context (non-local meaning phenomena)

Screen Shot 2022-10-13 at 2.29.52 PM


  • TACKBP-2010
    • 2k annotated mention/entity pairs
    • Linked to TAC Reference Knowledgebase w/ 818k entities
    • 26k annotated mention/entity pairs


  • Disambiguation-only
    • Micro-precision = % of correctly disambiguated entities in full corpus
    • Macro-precision = % of correctly disambiguated entities, averaged by doc
  • End-to-end approaches
    • Micro-F1 =
    • Macro-F1
  • Recall @ N
  • Accuracy


Link popularity

  1. Build dictionary of name variants for each entity
  2. Inspect all KB entities that have a name variant which matches the query mention
  3. Choose the entity w/ highest # of inlinks with query



End-to-End DL Approach


16) Coreference Resolution

This section was taken directly from: https://web.stanford.edu/~jurafsky/slp3/21.pdf

Task: Determine which words refer to the same real-world entity (basically a clustering task)

Evaluation: Given text $T$, find all entities + coreference links between them. Compare our graph to a goldstandard human-annotated graph for $T$.

  • Lack of singletons in evaluation set makes task easier, b/c singletons are harder to detect
  1. Identify mentions of entities
  2. Cluster discourse entities into set of coreferring expressions (aka “coreference chains” or ““clusters”)
  3. Link discourse entities to real-world entities via ontologies

In below excerpt, superscripts corefer to the same entity

Screen Shot 2022-11-08 at 10.15.03 PM


Anaphor = expression that references a previously mentioned entity

Antecedent = entity that anaphor references

Singleton = entity with only one mention (e.g. no antecedent)

Linguistic Background

Types of referring expressions:

  1. Indefinite Noun Phrases (NPs)
    1. “a XXXX
    2. Introduces a new entity to the hearer
  2. Definite NPs
    1. “the XXXX
    2. References a previously mentioned entity or entity already known to the hearer (e.g. “the USA”)
  3. Pronouns
    1. Pronouns: “he/she/it/they
    2. Demonstrative: “this/that/these/those”
  4. Names
    1. IBM/John Smith/New York

Information status of referring expressions:

  1. New NPs
    1. Introduce new entities into discourse
  2. Old NPs (“evoked NPs”)
    1. Entities already in discourse
  3. Inferrables
    1. Can be inferred from prior in the conversation to exist via a “bridging inference”
    2. e.g. “I went to a restaurant yesterday. The chef had just opened it.”

Non-referring expressions:

  1. Appositives
    1. Sometimes counted as a coreferential (e.g. OntoNotes) even though describe head NP rather than corefer to it
    2. e.g. “Victoria, CFO of Megabucks, saw that…”
    3. e.g. “United, a unit of UAL, matched the fares…”
  2. Predicative and Prenominal NPs
    1. Describe properties of a head entity, rather than referring to that distinct entity
    2. e.g. “United is a unit of UAL
    3. e.g. “her pay jumped to $2.3 million
  3. Expletives
    1. Pronouns that don’t refer to anything
    2. e.g. “It was Emma who founded the company.”
    3. e.g. “We hit it off”
  4. Generics
    1. Expression that doesn’t refer back to the entity explicitly referencing it in the text
    2. e.g. “I love mangoes. They are tasty.”

Properties of coreference relationship:

  1. Number agreement
    1. Referring expresions and referents must usually agree in number (e.g. “she” v. “they”)
  2. Person agreement
    1. 1st/2nd/3rd person – e.g. “I” v. “he” v. “you”
  3. Gender or noun class agreement
    1. e.g. “John is great. He is awesome.”
  4. Binding theory constraints
    1. e.g. “Jane bought herself a bottle of sauce.”
  5. Recency
    1. e.g. “Sally doctor found an old map in the captain’s chest. Jim found an even older map hidden on the shelf. It described an island.”, where it refers to Jim’s map
  6. Grammatical role
    1. Entities in subject are more salient than object which are more salient than oblique references
  7. Verb semantics
    1. e.g. “John called Bill. He had lost the laptop”. He refers to John
    2. e.g. “John criticized Bill. He had lost the laptop.” He refers to Bill


  1. Mention detection
    1. Emphasizes recall
    2. Run parser that identifies every NP, pronoun, or named entity
  2. Run anaphoricity detector on parsed entities
    1. Only keep anaphoric mentions


Two approaches:

  1. Entity-based – represent each entity in discourse model
  2. Mention-based – consider each mention independently

Mention-Pair Architecture

  • Input: (anaphor, antecedent)
  • Output: 1 if coreferring, 0 else

Screen Shot 2022-11-08 at 11.08.31 PM

  • Training sample selection strategy:
    • Choose closest antecedent as (+) example
    • All pairs between as (-) examples
      • Avoids flooding traing set with (-) examples
  • Evaluation on test set:
    • For each mention $i$ in document, consider each of prior $i-1$ mentions
    • Closest-first clustering – run classifier from $i-1$ to 1, and first antecedent with prob > 0.5 is linked to $i$
    • Best-first – run classifier from $i-1$ to 1, antecedent with highest overall prob is linked to $i$


  1. Doesn’t directly compare candidate antecedents
  2. Ignores discourse model
  3. Only considers local pairwise info

Mention-Rank Architecture

Idea: Directly compare candidate antecedents

For the $i$th mention (anaphor), we have a rv $y_i \in {1, …, i-1, \epsilon }$ where $\epsilon$ means there is no antecedent

Screen Shot 2022-11-08 at 11.18.12 PM

  • Training sampling strategy:
    • Need to choose which of possible legal gold antecedents to train on – instead, can just sum over probability assigned to all legal antecedents
  • Evaluation on test set:
    • Compute one softmax over all antecedents (and $\epsilon$)

Entity-based Models

Idea: Instead of linking mentions to previous mentions, link them to previous entities (i.e. clusters of mentions)

  • Can turn a mention-ranking model -> entity-ranking model by having the classifier make decisions over clusters of mentions rather than individual mentions


Screen Shot 2022-11-08 at 11.24.50 PM

e2e-coref Model

  • Mention-ranking algorithm

  • Given document $D$ with $T$ words, considers all $n$-grams up to size $n \le 10$

  • Task: Assign each span $i$ an antecedent $y_i \in {1, …, i - 1, \epsilon }$.

  • For each pair of spans $i,j$, assign a score $s(i,j)$ for the coreference link between $i$ and $j$.

    • $P(y_i) = softmax(s(i, y_i))$

    • where

      • $s(i,j) = m(i) + m(j) + c(i,j)$

      • $m(i)$ = 1 if $i$ is a mention

      • $m(j) = 1$ if $j$ is a mention

      • $c(i,j)$ = 1 if $j$ is antecedent of $i$

      • and $s(i, \epsilon) = 0$ is fixed

We define:

Screen Shot 2022-11-08 at 11.39.21 PM

To generate span representations $g_i$:

  • Run each paragraph through BERT to generate embedding $h_i$ for each token $i$
  • Define $h_{att}$ as the likely head-word of the span, $h_{start/end}$ as the start and end word of the span
  • Each span $g_i$ is:
    • $g_i = [ h_{start}, h_{end}, h_{att} ]$

Screen Shot 2022-11-08 at 11.38.43 PM

Screen Shot 2022-11-08 at 11.41.18 PM

Screen Shot 2022-11-08 at 11.42.08 PM


  • OntoNotes
    • Hand-annotated Chinese + English of ~1M words each + 300k words of Arabic newswire
    • No labels for singletons (which are ~70% of all entities)
  • ISNotes - portion of OntoNotes annotated for info status
  • LitBank = 210k tokens from 100 novels (includes singletons)
  • ARRAU = 350k English words (includes singletons)
    • Diverse genre of content
  • ECB+ = 982 short documents on “event coreference”
    • Example: Screen Shot 2022-11-08 at 10.09.59 PM
  • Winograd
    • Example: Screen Shot 2022-11-08 at 11.58.58 PM
    • Example: Screen Shot 2022-11-08 at 10.08.51 PM
  • WinoBias
    • Example: Screen Shot 2022-11-09 at 12.01.12 AM

Evaluation + Metrics

Model outputs series of clusters $H$ v. gold standard set of clustesr $R$

  1. MUC F-Measure
    1. Link-based
    2. Based on # of coreference links (i.e. pairs of mentions) common to $H$ and $R$
    3. Precision = # of common links / # of links in $H$
    4. Recall = # of common links / number of links in $R$
    5. CONs
      1. Biased toward models that produce large chains
      2. Ignores singletons
  2. B^3
    1. Mention-based
    2. Given mention $i$, the set of correct mentions in $H_i$ is $H_i \and R_i$
      1. Precision = $\frac{ H_i \and R_i }{H_i}$
      2. Recall = $\frac{ H_i \and R_i }{R_i}$
    3. Total precision/recall = weighted sum of precision across all mentions in $R$

17) Common Sense


Screen Shot 2022-10-18 at 8.37.34 PM

Screen Shot 2022-10-18 at 8.39.02 PM

Screen Shot 2022-10-18 at 8.39.19 PM

Screen Shot 2022-10-18 at 8.39.26 PM

Screen Shot 2022-10-18 at 8.39.33 PM

Knowledge Bases

Screen Shot 2022-10-18 at 8.42.31 PM

Screen Shot 2022-10-18 at 8.43.16 PM

Screen Shot 2022-10-18 at 8.43.36 PM


COMET = Common Sense Transformer model

  • Generate commonsense knowledge for any input concept using a language model
  • Input: (Head entity, relation)
  • Output: (Target entity)

Screen Shot 2022-10-18 at 8.50.18 PM

18) Adversarial NLP

Textual Entailment = Task of predicting whether facts of Sentence 1 necessarily imply facts of Sentence 2

Screen Shot 2022-10-18 at 8.52.37 PM

Threat Model

Let $(x,y)$ be (input, output) and $x’$ be an altered version of $x$ which yields $y’$

Successful attack minimizes $ x - x’ $ while maximizing $ y - y’ $ to get $class(y) \ne class(y’)$


How can we change text while preserving its meaning?

Word-Level Substitutions (aka lexical)

Substitute synonyms at the word level to preserve sentence meaning.

  • Embeddings - Search for nearest-neighbor in embedding space
  • Thesaurus - Lookup word in thesaurus, WordNet, PPDB
  • Hybrid - Search for nearest-neighbors in “counter-fitted” embedding space
    • Inject antonymy + synonymy constraints into vector space representations
    • Separates conceptual association from semantic similarity
    • Screen Shot 2022-10-18 at 9.10.49 PM

Sentence-Level Substitutions (aka syntactic)

Goal: Get adversarial inputs that are grammatical, preserve input semantics, have minimal lexical substitution, and high syntactic diversity

  • Cosine similarity btwn $x$ and $x’$ sentence embeddings
  • Substitute phrases (PPDB)
  • Machine translation

Screen Shot 2022-10-18 at 9.12.25 PM

Screen Shot 2022-10-18 at 9.16.23 PM

Screen Shot 2022-10-18 at 9.16.34 PM

“TextAttack” Framework

Checklist for adversarial attacks

  1. Goal function - determines whether attack is successful
    1. e.g. minimum BLEU score
  2. Constraints - determine if perturbation $x’$ is valid w/ respect to original input $x$
    1. e.g. max word embedding distance
  3. Transformation - generates perturbation $x’$ from $x$
    1. e.g. thesaurus word swap
  4. Search method - select promising $x’$ by querying model
    1. e.g. beam search

Screen Shot 2022-10-18 at 9.20.39 PM

19) Debugging


  • Is loss function going down?
  • Is loss going to 0 after running for long enough?
  • Is loss going to 0 using a small training set?

Deliberately try to overfit. If not, error.

Model Size

Try increasing model size.

Larger models learn with fewer steps

Screen Shot 2022-10-18 at 9.53.56 PM


  • Optimizer - Adam or SGD?
  • Learning rate - Standard or decay?
  • Initialization - Uniform or Glorot?
  • Minibatch - Large enough batch?

Longer $K$ should yield better performance

Early Stopping

Early stop on the evaluation metric not the loss b/c the two aren’t necessarily correlated

Screen Shot 2022-10-18 at 9.57.03 PM

Common NLP Types of Errors

Screen Shot 2022-10-18 at 9.59.31 PM

Interpretable Methods

  • LIME
  • Attention


  • BERTology - access all hidden states of BERT
  • AllenNLP Interpret


Goal: Understand + interpret the language features that the model is encoding in its embeddings

Idea: Fix BERT, extract an input’s representation, feed that representation into a classifier to predict some linguistic property of the input, if the classifier is able to predict it then that property must be encoded somewhere in the representation.

Want “high selectivity” (Hewitt et al. 2019)

Screen Shot 2022-10-18 at 10.14.41 PM


Screen Shot 2022-10-18 at 10.16.37 PM

Screen Shot 2022-10-18 at 10.15.29 PM

Screen Shot 2022-10-18 at 10.15.54 PM

Screen Shot 2022-10-18 at 10.18.12 PM

Screen Shot 2022-10-18 at 10.19.10 PM