CSC 4220 · Machine Learning

The Optimization Lens

Most algorithms in machine learning can be largely understood by asking two questions. Your job today is to apply that lens to an algorithm we never covered — and find out how far it reaches.

Question 1: What is being minimized or maximized? Is there an explicit loss function, or is the objective implicit in the procedure?

Question 2: How does the algorithm move toward that objective? Note that optimization does not have to be iterative, and it does not have to be global. A single-pass closed-form solution is optimization. A greedy local choice repeated at each step is optimization. Gradient descent is just one optimizer among many.

Worked Example · Read This First

Cobweb

Unsupervised Learning · Incremental Concept Formation

Root Category A birds Category B mammals robins eagles ? place new example, maximize CU

Cobweb is an incremental concept formation algorithm. It takes unlabeled examples one at a time and organizes them into a hierarchy of categories — a tree where each node represents a concept defined by the probability distributions of feature values in that group. It requires no predefined number of clusters and builds its structure as it sees data.

Given: a stream of unlabeled examples, each with categorical features

Initialize: empty root node

For each new example x:

    Starting at the root, evaluate four operators:
        1. Place x into the best existing category
        2. Create a new category containing only x
        3. Merge the two best categories into one
        4. Split the best category into its children

    For each operator, compute the Category Utility (CU)
    of the resulting tree

    Choose the operator that produces the highest CU
    Recurse down the chosen category until x is placed

Category Utility (CU):
    A score that measures how much knowing an example's category
    improves your ability to predict its feature values,
    compared to not knowing the category at all.
    Higher CU = more useful category structure.

Each time a new example arrives, Cobweb tries four possible ways to fit it into the existing tree and picks whichever one produces the most useful category structure. "Useful" means: knowing which category an example belongs to gives you the most additional information about its feature values.


1

Find what the algorithm is trying to improve

Look at the pseudocode for the thing being evaluated at every decision point. Before placing each example, Cobweb computes the Category Utility of four possible outcomes and picks the highest. That computation happens every single step — which means CU is what the algorithm is optimizing. Start there.

2

Understand what the objective is measuring

CU measures the predictive usefulness of a category structure — how much knowing which category an example belongs to improves your ability to predict its feature values. Good categories are ones where feature values cluster predictably within them. The exact formula isn't important here; what matters is that CU is a well-defined, computable score, and Cobweb is explicitly maximizing it at every step.

3

Name the objective

CU is a measure of predictability — how well the category structure explains the feature distributions. Maximizing it means finding categories where features are as predictable as possible given membership. This is sometimes called predictive accuracy of the partition.

4

Find the optimizer

There is no gradient here. Cobweb never computes a derivative. Instead, at each step it evaluates a small fixed set of candidate moves (the four operators), computes CU for each outcome, and picks the best one. That is greedy local search — also called hill climbing. It makes the locally optimal choice at each step without backtracking or considering the global effect of that choice.

5

Notice what this shares with the course

Greedy local search is the same family of optimizer as the splitting criterion in Decision Trees — at each node, pick the split with the highest information gain and move on. Neither algorithm reconsiders earlier decisions. Neither uses a gradient. Both embed the objective directly in a greedy selection procedure. This is a pattern you will see repeatedly in classical algorithms.

Answer

What is being maximized? Category Utility — a measure of how much knowing an example's category improves your ability to predict its feature values, compared to not knowing the category at all.

How? Greedy local search. At each step, evaluate four candidate placements and pick the one with the highest CU. No gradient. No backtracking. The objective is implicit in the selection procedure — it is not written as a loss function to be minimized globally, but it is present and precise.

Handout 1

Random Forest

Classification & Regression · Ensemble of Decision Trees

majority vote

Random Forest is a classification and regression algorithm. Given a set of labeled examples, it learns to predict the label of examples it has never seen.

A Decision Tree makes predictions by asking a sequence of yes/no questions about the input features. At each step it picks the question that best splits the data into pure groups — groups where most examples share the same label. It keeps splitting until the groups are pure enough or the tree hits a size limit. A single Decision Tree tends to overfit — it learns the noise in training data and performs poorly on unseen data.

Given: dataset of N examples, each with F features and a label
Choose: number of trees T, features per split m (typically √F)

For t = 1 to T:
    1. Draw a bootstrap sample of N examples
       (random with replacement — some appear twice, some not at all)

    2. Grow a Decision Tree on the bootstrap sample:
          At each node:
              Select m features randomly from the F available
              Find the feature and split point among those m
              that best separates the labels
              Split the node into two child nodes
          Stop when nodes are pure or too small

    3. Store the trained tree

Prediction on new example x:
    Pass x through all T trees → majority vote (or average for regression)

Random Forest builds many Decision Trees, each on a slightly different random sample of the data using a random subset of features at each split. When predicting, every tree votes and the majority wins. The randomness ensures trees make different mistakes, so averaging them out produces more reliable predictions than any single tree.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Handout 2

AdaBoost

Classification · Adaptive Boosting

h₁ weak α₁=0.3 h₂ better α₂=0.6 h₃ stronger α₃=1.1 reweight errors each round w↑

AdaBoost (Adaptive Boosting) is a classification algorithm. It combines many simple, weak classifiers into a single strong classifier that performs well on examples the weak classifiers struggle with.

A weak classifier performs only slightly better than random guessing — for example, a Decision Tree with only one split (a decision stump). Alone, it is not useful. AdaBoost's insight is that many weak classifiers, combined carefully, can produce a powerful one.

Given: dataset of N examples with labels y_i ∈ {-1, +1}
Choose: number of rounds T

Initialize: weights w_i = 1/N for all examples

For t = 1 to T:
    1. Train weak classifier h_t using current weights w
       (examples with higher weight matter more to get right)

    2. Compute weighted error:
       error_t = sum of w_i for examples where h_t(x_i) ≠ y_i

    3. Compute classifier importance:
       alpha_t = 0.5 * ln( (1 - error_t) / error_t )
       (lower error → higher alpha → louder vote)

    4. Update example weights:
       if h_t got example i right:  w_i = w_i * exp(-alpha_t)
       if h_t got example i wrong:  w_i = w_i * exp(+alpha_t)
       Normalize all w_i so they sum to 1

Final classifier: H(x) = sign( Σ alpha_t * h_t(x) )

AdaBoost runs in rounds. Each round it trains a simple classifier, then increases the weight of examples it got wrong, so the next classifier pays more attention to them. Classifiers that perform better earn a louder vote in the final answer. The result is a sequence of classifiers that progressively correct each other's mistakes.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Handout 3

Support Vector Machine

Classification · Maximum Margin Classifier

margin maximize margin width

A Support Vector Machine (SVM) is a classification algorithm. Given labeled examples, it finds a decision boundary that separates the classes — and specifically tries to find the boundary that is as far as possible from the nearest examples on either side.

A logistic regression model finds a hyperplane that separates two classes. There are infinitely many valid boundaries for a linearly separable dataset. Logistic regression picks one based on maximum likelihood. SVM picks one based on a different criterion: maximum margin — the widest possible gap between the two classes.

Given: labeled examples (x_i, y_i) where y_i ∈ {-1, +1}
Find: weights w and bias b defining a hyperplane

The margin = 2 / ||w||
Maximizing the margin = minimizing ||w||

Hard margin (perfectly separable data):
    Minimize:   (1/2) * ||w||²
    Subject to: y_i * (w · x_i + b) ≥ 1   for all i

Soft margin (data with noise / overlap):
    Introduce slack variables ξ_i ≥ 0 (allow some violations)

    Minimize:   (1/2) * ||w||²  +  C * Σ ξ_i
    Subject to: y_i * (w · x_i + b) ≥ 1 - ξ_i   for all i

    C is a hyperparameter:
        large C = penalize violations heavily (tight, may overfit)
        small C = allow violations (wider margin, more robust)

Support vectors: the examples that sit exactly on the margin.
    They are the ONLY examples that determine the boundary.
    All other examples could be removed without changing it.

Prediction:
    sign( w · x + b )   →   +1 or -1

SVM finds the decision boundary with the widest possible gap between the two classes. The examples nearest that boundary — the support vectors — are the only ones that matter. The soft margin version allows a small number of violations, trading some accuracy for robustness.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Handout 4

K-Means Clustering

Unsupervised Learning · Partition into K Groups

+ = centroids

K-Means is a clustering algorithm. Given a set of unlabeled examples, it groups them into K clusters such that examples within the same cluster are similar to each other. There are no labels — the algorithm finds structure in the data on its own.

A centroid is the center point of a cluster — the average position of all examples assigned to it. K-Means represents each cluster by its centroid and assigns each example to whichever centroid it is closest to.

Given: dataset of N examples with F features (no labels)
Choose: number of clusters K

1. Initialize: randomly place K centroids in feature space

Repeat until centroids stop moving:

    2. Assignment step:
       For each example x_i:
           Assign x_i to the nearest centroid
           nearest_k = argmin over k of  ||x_i - centroid_k||²

    3. Update step:
       For each cluster k:
           centroid_k = mean of all examples assigned to cluster k

Return: K centroids and cluster assignment for every example

Prediction on new example x:
    nearest_k = argmin over k of  ||x - centroid_k||²

K-Means alternates between two steps: assign every example to its nearest centroid, then recompute each centroid as the average of its assigned examples. These two steps repeat until assignments stop changing. The result is K clusters where every example is closer to its own cluster center than to any other.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Handout 5

Naïve Bayes

Classification · Probabilistic Model

SPAM P(spam|words) 0.85 P(class|features) ∝ P(features|class)·P(class)

Naïve Bayes is a classification algorithm. Given labeled examples, it learns to predict the probability that a new example belongs to each class. It is particularly common in text classification — for example, deciding whether an email is spam.

Bayes' theorem gives a way to compute the probability of a class given observed features: P(class | features) = P(features | class) × P(class) / P(features). The "naïve" assumption is that all features are independent of each other given the class. This is rarely true in practice but makes the math tractable and the algorithm fast.

Given: dataset of N examples, each with F features and a class label

Training — one pass through the data:

    For each class c:
        1. Compute the prior:
           P(c) = (examples with class c) / N

        2. For each feature f:
           P(feature_f = value | class = c) =
               (examples in class c where feature f = value)
               / (examples in class c)
               + smoothing to avoid zero probabilities

Return: all priors P(c) and likelihoods P(feature_f | c)

Prediction on new example x with features x_1 ... x_F:

    For each class c:
        score(c) = log P(c) + Σ log P(x_f | c)
        (log turns products of small numbers into sums)

    Return: class c with the highest score

Naïve Bayes learns two things: how common each class is, and how often each feature value appears within each class. At prediction time it scores each class by combining these and picks the most probable one. Taking the log turns a product of probabilities into a sum, which is more numerically stable.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Handout 6

Variational Autoencoder

Unsupervised Learning · Generative Model

7 input x Enc μ σ² z Loss = recon + KL divergence push z → N(0,1)

A Variational Autoencoder (VAE) is an unsupervised learning algorithm. Like the autoencoder built in this course, it learns a compressed representation of data. But it goes further: it learns a structured latent space that allows new data to be generated by sampling from it.

In this course you built an autoencoder that compressed MNIST images into a small latent vector and reconstructed them. The loss was reconstruction error. The limitation: the latent space has no guaranteed structure — points sampled randomly from it may decode into nonsense, because the encoder was never required to organize the space in any particular way.

Given: dataset of N unlabeled examples x_i

Key difference from standard autoencoder:
    Standard AE: encoder maps x → latent vector z
    VAE:         encoder maps x → mean vector μ and variance σ²
                 z is then sampled: z ~ N(μ, σ²)

For each example x_i during training:

    1. Encode:
       Run x_i through encoder → output μ_i and σ²_i

    2. Sample (reparameterization trick):
       z_i = μ_i + σ_i * ε    where ε ~ N(0, 1)
       (keeps the sampling step differentiable for backprop)

    3. Decode:
       Run z_i through decoder → reconstructed x̂_i

    4. Compute loss (two terms):

       Term 1 — Reconstruction loss:
           How different is x̂_i from x_i?
           (same MSE as the course autoencoder)

       Term 2 — KL divergence:
           How different is N(μ_i, σ²_i) from N(0, 1)?
           KL = -0.5 * Σ( 1 + log(σ²) - μ² - σ² )
           (penalizes encoder for straying from standard normal)

       Total loss = Reconstruction loss + KL divergence

    5. Backpropagate and update weights with gradient descent

Generation (not possible with standard autoencoder):
    Sample z ~ N(0,1) → run through decoder → new example

The VAE adds one idea to the autoencoder: instead of compressing each input to a fixed point in latent space, it compresses it to a probability distribution. The KL divergence term pushes all those distributions toward a standard normal, organizing the latent space into a coherent structure you can sample from to generate new data.

Your Questions

1

What is being minimized or maximized?

2

How does the algorithm move toward that objective?

Worked Example · Hard Case

DBSCAN

Unsupervised Learning · Density-Based Spatial Clustering

ε noise dense regions = clusters · sparse = noise

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density regions as noise. Unlike K-Means it requires no predefined number of clusters and can find clusters of arbitrary shape.

Given: dataset of N points, two hyperparameters: ε (radius), minPts (minimum neighbors)

Label each point as one of:
    Core point:    has at least minPts neighbors within distance ε
    Border point:  within ε of a core point but has fewer than minPts neighbors
    Noise:         not within ε of any core point

For each unvisited core point p:
    Create a new cluster C
    Add p to C
    Collect all points within ε of p into seed set S
    For each point q in S:
        If q is unvisited:
            Mark q as visited
            If q is a core point:
                Add q's neighbors to S   ← expand the frontier
        If q is not yet in any cluster:
            Add q to C

Return: cluster assignments and noise labels

Key definitions:
    Directly density-reachable: q is within ε of core point p
    Density-reachable:          chain of directly density-reachable steps from p to q
    Density-connected:          both p and q are density-reachable from some core point
    Cluster:                    maximal set of density-connected points

DBSCAN starts at a core point, expands outward along density-reachable paths, and stops when it hits low-density space. Everything reachable in one connected dense region becomes one cluster. Isolated points with no dense neighbors are labeled noise.


1

Look for an explicit loss function

There isn't one. Unlike the six group handouts — and unlike Cobweb — DBSCAN has no formula being evaluated at each step. No score is computed. No quantity is compared before and after each decision. This is the first thing to notice: the objective is not written down anywhere in the algorithm.

2

Look at what the algorithm produces instead

Focus on the definition of a cluster: a maximal set of density-connected points. "Maximal" is doing important work here — it means the algorithm never stops expanding a cluster while there are more density-reachable points to include. Every cluster is as large as it can possibly be under the density constraints. That is not arbitrary — it reflects a specific preference baked into the procedure.

3

Reconstruct the implicit objective from the maximality rule

Because every cluster is maximally expanded, DBSCAN implicitly produces the clustering with the fewest possible clusters consistent with the density constraints — it refuses to split a dense region into two clusters when one will do. Simultaneously, the density thresholds (ε and minPts) enforce a minimum intra-cluster density. Together, the algorithm implicitly maximizes density-reachability within clusters while minimizing the total number of clusters. Neither of these objectives is written in the pseudocode — they emerge from the definition of density-connectivity and the maximality expansion rule.

4

Find the optimizer

The seed set expansion loop is the optimizer. Starting from each core point, it greedily absorbs every reachable point into the cluster in a single deterministic pass. There is no iteration, no score comparison, no backtracking. The algorithm does not try multiple configurations and pick the best — it constructs the only valid solution under the density definition directly. This is closer to a constraint-satisfaction procedure than an optimization loop, but it is pursuing a well-defined implicit objective.

5

Compare to the other algorithms

Cobweb's objective (CU) was implicit in its selection procedure but still computed explicitly at each step. DBSCAN's objective is implicit in its definitions — it never computes anything like CU. The objective has to be reconstructed by reading what the algorithm's constraints force it to produce. This is a harder case than Cobweb, but the framework still applies — it just requires more work to find where the objective is hiding.

Answer

What is being maximized? Implicitly: cluster density-reachability (all density-connected points end up together) while minimizing the number of clusters (maximality prevents arbitrary splitting of dense regions). No explicit loss function exists — the objective is embedded in the definitions of density-connectivity and maximal expansion.

How? Greedy deterministic expansion. From each core point, absorb all reachable points in a single pass. No gradient, no score comparison, no iteration. The procedure directly constructs the unique valid solution under the density constraints rather than searching for it.

Key lesson: The optimization framework still applies, but the objective must be reconstructed from the algorithm's behavior rather than read off a formula. This is common in classical algorithms — the objective is baked into the procedure rather than separated out as a swappable component.

Worked Example · The Hard Question

K-Nearest Neighbors

Classification & Regression · Instance-Based Learning

? k=3 neighbors training data stored no parameters learned

K-Nearest Neighbors is a classification and regression algorithm. To predict the label of a new example, it finds the k most similar examples in the training set and returns their majority label (classification) or average value (regression). The entire training dataset is stored and used directly at prediction time.

Training:
    Store all N training examples. That's it.
    No parameters are learned. No optimization is performed.

Prediction on new example x:
    1. Compute distance from x to every stored training example
       (usually Euclidean: d(x, x_i) = ||x - x_i||)

    2. Find the k examples with the smallest distance to x

    3. Classification: return majority label among the k neighbors
       Regression:     return average value among the k neighbors

Hyperparameters:
    k:        number of neighbors to consult
    distance: Euclidean, Manhattan, cosine, etc.

KNN stores every training example and makes predictions by looking up the most similar ones. There is no training phase in any meaningful sense. All computation happens at prediction time.


1

Look for an optimization loop

There is none. Read the training procedure again: store all examples. That is the entirety of training. No loss function is evaluated. No parameters are updated. No gradient is computed. No score is compared before and after. The algorithm does not get better at anything during training — because nothing happens during training.

2

Look for parameters

Every algorithm in this course had parameters — values that were adjusted during training to minimize a loss. Linear regression learned m and b. The neural network learned weights W and biases b. KNN has no such parameters. The distance metric and k are hyperparameters chosen before training, not learned from data. The training examples themselves are stored, not learned — they are the data, not a compressed representation of it.

3

Apply the framework honestly

The course framework requires: a model (parameterized function), a loss (scalar measure of badness), and an optimizer (mechanism to reduce the loss). KNN satisfies none of these in the standard sense. There is no parameterized function — the prediction rule is fixed. There is no loss being minimized during training. There is no optimizer because there is nothing to optimize. The framework does not fit.

4

But KNN still generalizes — so what is it doing?

KNN can predict on examples it has never seen. It achieves the practical goal of machine learning — generalization — without the mechanism the course described. It generalizes by proximity: the assumption that nearby examples share labels. This is an inductive bias (a built-in assumption about the world), not a learned one. Every algorithm has inductive bias, but most algorithms adjust their parameters to fit data within that bias. KNN does not adjust anything. The bias is the entire model.

5

Recognize what KNN actually is

KNN belongs to a category formally called lazy learners. The name is precise: the algorithm does no work during training and defers everything to inference time. The model is the data — not a representation derived from the data, not parameters adjusted to fit the data, but the raw data itself stored verbatim. At prediction time it performs retrieval: find the nearest stored examples, return their labels. There is no learned function. There is no optimization. By the definition this course has used all semester — learning is optimization — KNN technically does not learn anything.

Answer — As Far As the Framework Reaches

What is being minimized? Nothing. KNN has no loss function and no training objective. Nothing is evaluated, compared, or improved during training.

How does it optimize? It doesn't. Training is storage. The optimization framework does not apply because there is no optimization. The model is the data.

The Takeaway

KNN doesn't learn anything. It stores the training data and performs retrieval at inference time. This has a formal name: lazy learning. The word lazy is precise — no work is done during training, all work is deferred to prediction.

Learning, in the sense this course has used all semester, involves building a representation that is optimized for some criterion. A linear model learns weights. A neural network learns features. Both produce something smaller and more structured than the raw data — a representation adjusted through optimization to capture what matters for the task. KNN produces no such representation. The data goes in unchanged and comes back out at query time. There is no optimization, no criterion, no learned structure. That is why it sits outside the framework — not as an edge case, but as a fundamentally different kind of algorithm.