## 0. Environment Setup
- Check if python3 installed: `python3 --version`
    - If not: https://www.python.org/downloads/
- Create a directory for this class. From that directory, create and activate a virtual environment:
    - `python3 -m venv responsible-ml-env`
    - `source responsible-ml-env/bin/active`
- Install data science / ML packages: 
    - `pip3 install jupyterlab pandas scikit-learn torch transformers`
- Download this notebook from the course website and place it in the class directory. Launch it with:
    - `python3 -m jupyter lab`

## 1. Data

**Background**: Google/Jigsaw's conversation AI team built something called *Perspective API*, which is a publicly accessible toxicity detection model for text. In 2018, they sponsored a Kaggle competition to incentivize the creation of better models to predict specific types of toxicity (e.g. obscenity vs. identity-based hate). The data consists of Wikipedia comments labeled by human raters for toxic behavior. 

*Content Warning: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.*

Download data from [Kaggle](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data). Unzip into same directory as notebook. 

## 2. Exploratory Data Analysis

In [None]:
import pandas as pd
data = pd.read_csv("jigsaw-toxic-comment-classification-challenge/train.csv.zip")

In [None]:
# Let's take a look at some random example data points
data.sample(5, random_state=0)

**Question 1**: What is our input variable?

**Question 2**: What are our outcome variables? 

**Question 3** What kind of outcome variables do we have (continuous, ordinal, binary)?

In [None]:
# How many positive examples of each outcome variable do we have?
outcomes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
data[outcomes].sum()

In [None]:
# What is the distribution of each outcome variable?
data[outcomes].mean()

**Question 4**: What are some ethically relevant things we don't know about this dataset?

## 3. Evaluation Part 1

In [None]:
from sklearn.model_selection import train_test_split

# We're taking a small sample of the data so neural network prediction and training doesn't take too long
# We're focusing on one outcome variable: toxic
sample = data.sample(2000, random_state=0) 
X_train, X_val, y_train, y_val = train_test_split(
    sample["comment_text"], 
    sample["toxic"], 
    test_size=0.5,
    random_state=0,
    stratify=sample["toxic"]
)

**Question 5**: Why is it important to split our dataset into a separate training and validation subsets?

**HW 1**: What is cross validation and how is it different from what we did above? What is one benefit it has over a single split of the data? What is one drawback?

## 4. Modeling Approach 1

a. First, we have to decide how to represent our complex, high-dimensional input feature (the comment text)

In [None]:
# One simple representation is called a "bag of words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)

In [None]:
# How many words are in our vocabulary?
len(vectorizer.get_feature_names_out())

In [None]:
# Here is our one comment in the sample
X_train.iloc[1]

In [None]:
# And its "bag of words" representation (very sparse)
X_train_bow[1]

b. Second, we have to decide how to model `Pr(toxic|comment_text)`

In [None]:
# a baseline model for binary outcomes is logistic regression (https://en.wikipedia.org/wiki/Logistic_regression)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(penalty=None, max_iter=1000)
logreg = logreg.fit(X_train_bow, y_train)

## 5. Evaluation Part 2

In [None]:
# first, we'll use accuracy to evaluate our model
from sklearn.metrics import accuracy_score
X_val_bow = vectorizer.transform(X_val)
y_pred = logreg.predict(X_val_bow)
accuracy_score(y_val, y_pred)

**Question 6**: 90% accuracy! Is this good performance?

In [None]:
# what if we predicted 0 for every example?
accuracy_score(y_val, [0] * 1000)

In [None]:
# let's look at two other better metrics: precision and recall
# https://en.wikipedia.org/wiki/Precision_and_recall
from sklearn.metrics import precision_score, recall_score
print(f"precision: {precision_score(y_val, y_pred)}")
print(f"recall: {recall_score(y_val, y_pred)}")

**Question 7**: Should we care more about precision or recall for detecting toxic comments?

**HW 2**: Another useful metric to evaluate a binary classifier is the ROC curve. Plot the ROC curve and compute the closely related ROC-AUC score. (*Hint: this [section](https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc) is helpful*). How is the ROC-AUC score related to the ROC curve? 

## 6. Modeling Approach 2

In [None]:
# let's replace our BOW representation with vectors from a pretrained language model
# https://jalammar.github.io/illustrated-bert/
# https://huggingface.co/nreimers/MiniLM-L6-H384-uncased

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nreimers/MiniLM-L6-H384-uncased")
model = AutoModel.from_pretrained("nreimers/MiniLM-L6-H384-uncased")
model = model.eval() # some layers behave differently in training vs. prediction/inference

In [None]:
# first we need to convert text into input tokens
def tokenize(comment):
    return tokenizer(
        comment, 
        truncation=True, 
        max_length=100, 
        padding="max_length", 
        return_tensors="pt"
    )

X_train_tokenized = [tokenize(comment) for comment in X_train]
X_val_tokenized = [tokenize(comment) for comment in X_val]

In [None]:
# How many tokens are in our vocabulary?
tokenizer.vocab_size

In [None]:
# let's look at one tokenized example
X_train_tokenized[1]["input_ids"]

In [None]:
X_train_tokenized[1]["attention_mask"]

In [None]:
import numpy as np
import torch

# now we'll use the pretrained model to convert the tokens into a dense vector representation
def vectorize(comments):
    vecs = []
    for comment in comments:
        with torch.no_grad():
            vec = model(**comment)["pooler_output"]
        vec = vec.detach().numpy()
        vecs.append(vec)
    return np.concatenate(vecs)

X_train_vecs = vectorize(X_train_tokenized)

In [None]:
# finally, we'll fit the same type of model to this other representation of the inputs
logreg = logreg.fit(X_train_vecs, y_train)

In [None]:
# now let's vectorize the validation comments and see how good our predictions are
X_val_vecs = vectorize(X_val_tokenized)
y_pred = logreg.predict(X_val_vecs)
print(f"precision: {precision_score(y_val, y_pred)}")
print(f"recall: {recall_score(y_val, y_pred)}")

**Question 7**: Features from the pretained language model performed worse!? Why might this be?

**HW 3**: Use the original BERT model instead of the smaller model we used in class (use the string "bert-base-cased"). Note: vectorizing the text will take longer because BERT is a bigger model (it took ~4 minutes on my computer). How did precision, recall, and ROC-AUC change when using the representations from this model? 

**HW 4**: Look at BERT's [model card](https://huggingface.co/bert-base-uncased). What data was BERT trained on? What were its two training objectives? How does this influence what biases it might have learned?

## 6. Modeling Approach 3

In [None]:
# now we'll finetune (i.e. update) the model specifically for our classification task
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("nreimers/MiniLM-L6-H384-uncased", num_labels=2)

In [None]:
# pytorch (developed by Facebook) is one framework for training neural networks
# it requires you to define a custom dataset class for your data 
from torch.utils.data import Dataset, DataLoader

class CommentDataset(Dataset):
    def __init__(self, vecs, labels):
        self.vecs = vecs
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val.reshape(-1) for key, val in self.vecs[idx].items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CommentDataset(X_train_tokenized, y_train)
val_dataset = CommentDataset(X_val_tokenized, y_val)

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8)

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # we'll use a popular variant of stochastic gradient descent to optimize the parameters 
model.train()

for batch in train_dataloader:
    outputs = model(**batch) # predict with current model parameters
    loss = outputs.loss # for classification: cross entropy loss
    loss.backward() # backpropagation (ie which nodes in the network were responsible for the loss)

    optimizer.step() # update model parameters: one step in the direction of the gradient of the loss
    optimizer.zero_grad() # clears, or zeros, the current gradients

In [None]:
# finally, we'll make predictions on our validation dataset
all_preds = []
all_labels = []
model.eval()
for batch in val_dataloader:
    with torch.no_grad():
        outputs = model(**batch)
    preds = torch.argmax(outputs.logits, dim=-1)
    all_preds.append(preds.detach().numpy())
    all_labels.append(batch["labels"].detach().numpy())
    
all_preds = np.concatenate(all_preds)
all_labels = np.concatenate(all_labels)

In [None]:
# and see how we did!
print(f"precision: {precision_score(all_labels, all_preds)}")
print(f"recall: {recall_score(all_labels, all_preds)}")

**HW 5**: If you restart the notebook and re-run modeling approach 3 (fine-tuning) do you get the same precision and recall scores on the validation dataset? Why or why not?