"Sushi was great, the food was awesome, but the service was terrible" :name: bow-sushi | sushi | was | great | the | food | awesome | but | service | terrible | |-------|-----|-------|-----|------|---------|-----|----------|----------| | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | ``` The terminology "bag of words" comes from the fact that this representation does not care about the order that the words appear in the sentence. In this feature representation, the sentences "John ate an apple" and "an apple at John" are equivalent. Obviously this is a simplification of all of the details necessary to represent text, but it is one that works decently in practice. There are many more advanced ways of representing text that we won't discuss in too much detail other than the fact that many of them produce some vector of numbers in $\mathbb{R}^D$. An unspecified note is what to do with punctuation such as periods or apostrophes. This is generally quite tricky in practice, so we will use the simplifying assumption that we don't care about punctuation and simply remove them. In effect, this would treat the words "it's" and "its" as the same word, which you can tell is a limitation. So if we gathered a set of input data that is the review (text) and the predicted sentiment, we can apply this computation to every sentence in the dataset to come up with a numeric vector for each. `scikit-learn` provides a helpful pre-processor called [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to help us with this task. In the code cell below, we show how to use this class on a very small dataset. In our notation, we would treat the count in the output at row $i$ and index $j$ to be the feature $h_j(x_i)$. ```{code-cell} ipython3 import pandas as pd from sklearn.feature_extraction.text import CountVectorizer # Make input data data = pd.DataFrame([ {"review": "Sushi was great, the food was awesome, but the service was terrible", "sentiment": +1}, {"review": "Terrible food; the sushi was rancid.", "sentiment": +1}, ]) print("Input Data") print(data) print() # Transform into features with bag of words representation counter = CountVectorizer() bag_of_words = counter.fit_transform(data["review"]).toarray() print("Feature names") print(counter.get_feature_names_out()) print() print("Features") print(bag_of_words) ``` ## Building a Classifier Let's now turn to how we can develop the ideas and concepts to build a classification model. In this chapter and the next, we will outline the idea for our first classification model called **logistic regression**. But before we get to that final model in the next chapter, we will start simple and build up to there with two other simpler models as starting points. The hope by telling the story in this way is we build up some important intuition and ideas before heading to some more complicated math and notation. 1. Simple Threshold Classifier 2. Simple Linear Classifier 3. End point: Logistic Regression ### 1. Simple Threshold Classifier ```{admonition} Big Idea :class: tip Use a list of "positive" and "negative" words to classify the review by which type of word appears most frequently. ``` Let's start by assuming that some hard working netizen on the internet has done most of the hard work for us to identify which words in English are more likely to be related to "positive sentiment" and ones that are more likely to be related to "negative sentiment". Suppose we downloaded their table of word categories shown in the table below. ```{table} Pre-defined list of words and their sentiments :name: simple-model-1 | word | sentiment | |----------|-----------| | sushi | Neutral | | was | Neutral | | great | Good | | the | Neutral | | food | Neutral | | but | Neutral | | awesome | Good | | service | Neutral | | terrible | Bad | | rancid | Bad | | ... | | ``` Again, assuming we had such a reference in the first place, then we could build our first idea of a classifier. ```{prf:algorithm} Idea 1: Simple Threshold Classifier :label: simple-threshold-classifier **Input** A sentence $x$, and a reference table of word sentiments $T$ **Output** A prediction of the class $\hat{y} \in \{+1, -1\}$ 1. `num_positive` $\gets$ Count the number of words in $x$ that are *positive* according to $T$ 2. `num_negative` $\gets$ Count the number of words in $x$ that are *negative* according to $T$ 3. if `num_positive` > `num_negative`: * predict $\hat{y} = +1$ 4. otherwise: * predict $\hat{y} = -1$ ``` So if we had this magical reference table, then this algorithm seems somewhat intuitive but you might imagine has some major limitations: * Even amongst positive sentiment words, some words convey a more intense sentiment than others. "Awesome" is often considered to be a more positive claim than "good". So we will likely need a way to weight how positive or negative a word is instead. * Single words are often not enough to capture meaning. According to the algorithm above, the sentence "not good" would likely be considered a positive sentiment! In reality, looking one word at a time may not sufficiently catch the context of the preceding sentence. * To be fair, this is not really a fault of the Simple Threshold Classifier, but a fault of our features only containing a single word. We will discuss later some approaches to potentially address this problem. * It's not clear how we can even get this list of positive/negative words in the first place! ### 2. Simple Linear Classifier In our last section of the book on regression, we discussed the surprising effectiveness of using linear models where we learn coefficients related to our features to use in in our predictions. What if we try to borrow that idea for a classifier? Instead of using a list of good/bad words, we can instead somehow learn coefficients for each word. We will come back and discuss details for how to learn coefficients for the words in a bit, but our goal would be to learn something like: ```{table} Pre-defined list of words and their sentiments :name: linear-model-1 | feature | word | sentiment | |---------|----------|-----------| | w_1 | sushi | 0 | | w_2 | was | 0 | | w_3 | great | 1 | | w_4 | the | 0 | | w_5 | food | 0 | | w_6 | but | 0 | | w_7 | awesome | 2 | | w_8 | service | 0 | | w_9 | terrible | -1 | | ... | ... | ... | ``` The intent here is that the more positive a coefficient is for a word, the more positive its sentiment is, and vice versa for negative coefficients. Words with coefficient 0 have neutral sentiment. Then using our bag of words feature extraction $h(x)$ on a sentence $x$, we can now use these coefficients to make predictions. We define the notion of a $Score(x_i)$ to be a number that, if more positive means the sentence has more positive sentiment, and if more negative, has more negative sentiment. We use the notation $s_i$ as shorthand for $Score(x_i)$. $$s_i = Score(x_i) = \sum_{j=0}^D w_j h_j(x) = w^Th(x)$$ Using this score, we can then make our prediction $+1$ when the $Score$ is positive, otherwise negative. The $sign$ function returns $+1$ if the number is positive, and $-1$ otherwise. In the case of the $Score$ being exactly 0, you have to make a determination of if you should predict positive, negative, or something else entirely. So now our predicted labels are: $$\hat{y}_i = sign(Score(x_i)) = sign(\hat{s}_i) = sign(\hat{w}^Th(x_i))$$ With the sentence "The sushi was great, the food was awesome, but the service was terrible" (ignoring all 0 coefficients), $s_i = 1 c\dot 1 + 2 \cdot 1 -1 \cdot 1 = 2$, which means our prediction for this sentence is $\hat{y}_i = +1$. We actually won't discuss how to learn the coefficients for our linear model in this chapter, instead saving that discussion for the next chapter. So for now, let's just pretend we have some learning algorithm that tells us the coefficients based on our training data. ## Decision Boundaries Consider a simplified model where we only have two features/coefficients ```{table} Pre-defined list of words and their sentiments :name: linear-model-2 | feature | word | sentiment | |-----------------|---------------|-----------| | $\hat{w}_0$ | *(intercept)* | 0.0 | | $\hat{w}_1$ | awesome | 1.0 | | $\hat{w}_2$ | awful | -1.5 | ``` To understand the predictions of this model, it sometimes helps to visualize what the model would predict for any setting of the features $h_1(x)$ and $h_2(x)$. In general, the score for a sentence with using these coefficients is the following. $$\hat{s} = 1 \cdot \text{#awesome} - 1.5 \cdot \text{#awful}$$ So let's consider plotting all possible values of $(\text{#awesome}, \text{#awful})$ and plotting all of the ones with score $\hat{s} > 0$ as positive and the one with score $\hat{s} < 0$ as negative. We highlight the region where $\hat{s} = 0$ exactly as the **decision boundary** in black. Note that all of the places below the line are predicted positive, since in that region all of the scores are *positive*. Above the line all of the scores are *negative* so the predictions are negative. Also note that the decision $\hat{s} = 0$ is a line! We call this model a *linear classifier* because its decision boundary, where the decisions would switch from positive/negative, is a linear function. To make this plot, we actually just imagined making a prediction for every possible $(\text{#awesome}, \text{#awful})$, for example $(5, 2)$, and then finding its prediction $Score((5, 2)) = 1 \cdot 5 - 1.5 \cdot 2 = 2$. ```{figure} 2d-decision-boundary.png :alt: A picture of a decision boundary (explained in last paragraph) ``` ```{margin} {{ref_hyperplane}}\. Also called a hyperplane ``` While we show this decision boundary as a 2D plot (using color to highlight positive/negative predictions), we can also think about these predictions as a 3D plot where the z-axis shows the value of the $Score(x)$. Because coefficients are a linear function, when visualizing the outputs $Score(x)$ for every input $(h_1(x), h_2(x))$, we would draw this function as a *plane*

"Sushi was great, the food was awesome, but the service was terrible" :name: bigram-sushi | sushi was | was good | good the | the food | food was | the service | service was | was not | not good | |-----------|----------|----------|----------|----------|-------------|-------------|---------|----------| | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | ``` We can generalize this concept to an **n-gram** model that looks at the last $n$ words at each point. Clearly the larger $n$ is, the more complicated our set of features will be as there are more and more possible combinations of $n$ words as $n$ grows. Everything we have discussed before about model complexity applies here when considering using more and more context in the counts. ## Evaluating Our Classifier TODO Quality Metric highlight ML Pipeline So again, supposing we have some way to learn coefficients for our linear classifier, let's discuss how we might evaluate it. This will help us start the conversation of how we actually learn these coefficients in the next chapter. In this section, we will outline the *Quality Metric* we are interested in using to evaluate how good a particular predictor is. In some sense, it is a bit easier to discuss the error of a classification system. In binary classification, in particular, there are really only two possibilities for a prediction: it is either right or it is wrong. We can actually break down the "wrong" case into two possibilities that are often useful to name explicitly. The two types of being wrong are: * If the true label was positive ($y = +1$), but the predicted label was negative ($\hat{y} = -1$). This is often called a **False Negative**. * If the true label was negative ($y = -1$), but the predicted label was positive ($\hat{y} = +1$). This is often called a **False Positive**. In some scenarios you might care specifically about which type of error is happening, but in many cases, we simplify to ignore these sub-cases and just consider an error to be an error, regardless of type. ```{margin} {{ref_indicator}}\. 📝 *Notation:* $\indicator{c}$ is an indicator function that evaluates to 1 if $c$ is true, else 0. ``` We can then define the **classification error** of our predictor to be the following. Note that, as defined, the error will be a number between 0 and 1, where a higher number indicates a larger fraction of the examples were classified incorrectly. $$error(\hat{f}) = \frac{\text{# mistakes}}{\text{# examples}} = \frac{1}{n}\sum_{i=1}^n \indicator{\hat{y}_i \neq y}$$ Similarly, we can also define the concept of **classification accuracy** as we do below. Note that the special relationship in how accuracy and error are opposites in the binary classification setting. $$accuracy(\hat{f}) = \frac{\text{# correct}}{\text{# examples}} = 1 - error(\hat{f}) = \frac{1}{n}\sum_{i=1}^n \indicator{\hat{y}_i = y}$$ ### Interpreting Error/Accuracy While, in some sense, classification error/accuracy are easier to understand that our more complicate $MSE$ from regression, we need to be careful with how we interpret our evaluation metrics here. We might expect that your model should beat at least random guessing. So if you compare your model to a baseline model that just randomly picks a label, our accuracy should be better than random. For binary classification, we might expect that accuracy to be at least 50% and for multi-class classification an accuracy of at least $1/k$. Besides that, we might believe that a accuracy closer to 1 inherently means a model that is good at the learning task. However, consider the case of building a model to predict whether or not you would click on an ad on a webpage (positive label is you click, negative is you don't). It turns out, that if we just always predict a negative label, we would have an accuracy around [94%](https://www.smartinsights.com/internet-advertising/internet-advertising-analytics/display-advertising-clickthrough-rates/)! The reason for this is in this example, there is a **class imbalance** present, where most of the labels are negative since most people don't click on most ads they see. This simple classifier that just always predicts the majority class is called the **majority class classifier**. So in order for an accuracy to be considered "good", it should be doing at least better than the majority class classifier. So what should we do about this? Well, we should always be asking critical questions of why we see the accuracy we do and how we should interpret our results. So a common workflow to consider your model's performance might ask the following questions: * Is the a class imbalance present in my data? * How does my model's accuracy compare to some sort of baseline (e.g., a random guesser, or a majority class classifier) * *What does my application need in terms of our accuracy?* What's good enough for the user experience? What are the impacts of the errors we make. Importantly with that last question, we don't ask "What if an error happens?", but "What is the cost of that error?" In machine learning applications, errors are a fact of life that cannot be avoided. Our hope is to minimize errors, but with statistical modeling, errors are guaranteed. So the question we have to ask is not "if" an error happens, but "how many" and what those errors "cost". (classification:confusion_matrix)= ## Confusion Matrix While classification accuracy and error are intuitive, they are often lacking in the fact that they smooth over all possible mistakes that can happen. We indicated before, that we might care about the specific types of error happening: * $\hat{y} = +1, y = -1$ or a False Positive * $\hat{y} = -1, y = +1$ or a False Negative It often helps to plot out the types of errors that occur in a **confusion matrix**. The rows of this matrix correspond to the possible true labels (positive, negative) and the columns, the possible predictions (positive, negative). The matrix outlines all possible combinations of these cases. ```{figure} confusion_matrix.png :alt: A grid showing the confusion matrix (described below) ``` The four regions of this matrix correspond to the possible combinations of our labels and predictions. * $\hat{y} = +1, y = +1$ is a **True Positive** * $\hat{y} = +1, y = -1$ is a **False Negative** * $\hat{y} = -1, y = +1$ is a **False Positive** * $\hat{y} = -1, y = -1$ is a **True Negative** In the following code sample, we show a `scikit-learn` function for plotting these confusion matrices on some synthetic examples of labels and predictions. ```{code-cell} ipython3 --- mystnb: image: width: 400px align: center --- from sklearn.metrics import ConfusionMatrixDisplay y = [1, 1, -1, 1, -1, 1, 1, 1, 1, -1] y_hat = [1, -1, 1, 1, -1, 1, -1, -1, 1, -1] _ = ConfusionMatrixDisplay.from_predictions(y, y_hat, labels=[1, -1]) ``` This confusion matrix encodes more information than accuracy/error do on their own. You can always recover these from the confusion matrix though. For example, in that last code output the accuracy of that classifier would be $\frac{4 + 2}{4 + 3 + 1+ 2} = 0.6$. ```{margin} {{ref_positive}}\. Note deciding which label is considered "positive" and "negative" is somewhat arbitrary. All of the terminology can be flipped if you change your terminology. Note that in many cases where are model is trying to detect something (e.g., spam or a disease), the "positive label" is often the presence of that thing even if that thing being present isn't actually a good thing. ``` So with this idea of a confusion matrix in mind, we might now ask "Which is worse? A false negative or a false positive?" Like most of our interpretations of our ML models, this entirely depends on context. Consider the "cost" of these two types of errors in two different scenarios. What is the cost incurred by each mistake in the following situations?