CSC 4220 · Machine Learning
Most algorithms in machine learning can be largely understood by asking two questions. Your job today is to apply that lens to an algorithm we never covered — and find out how far it reaches.
Question 1: What is being minimized or maximized? Is there an explicit loss function, or is the objective implicit in the procedure?
Question 2: How does the algorithm move toward that objective? Note that optimization does not have to be iterative, and it does not have to be global. A single-pass closed-form solution is optimization. A greedy local choice repeated at each step is optimization. Gradient descent is just one optimizer among many.
What is Cobweb?
Cobweb is an incremental concept formation algorithm. It takes unlabeled examples one at a time and organizes them into a hierarchy of categories — a tree where each node represents a concept defined by the probability distributions of feature values in that group. It requires no predefined number of clusters and builds its structure as it sees data.
The Algorithm
Given: a stream of unlabeled examples, each with categorical features
Initialize: empty root node
For each new example x:
Starting at the root, evaluate four operators:
1. Place x into the best existing category
2. Create a new category containing only x
3. Merge the two best categories into one
4. Split the best category into its children
For each operator, compute the Category Utility (CU)
of the resulting tree
Choose the operator that produces the highest CU
Recurse down the chosen category until x is placed
Category Utility (CU):
A score that measures how much knowing an example's category
improves your ability to predict its feature values,
compared to not knowing the category at all.
Higher CU = more useful category structure.
Plain English
Each time a new example arrives, Cobweb tries four possible ways to fit it into the existing tree and picks whichever one produces the most useful category structure. "Useful" means: knowing which category an example belongs to gives you the most additional information about its feature values.
Worked Answer — How to Apply the Optimization Lens
Find what the algorithm is trying to improve
Look at the pseudocode for the thing being evaluated at every decision point. Before placing each example, Cobweb computes the Category Utility of four possible outcomes and picks the highest. That computation happens every single step — which means CU is what the algorithm is optimizing. Start there.
Understand what the objective is measuring
CU measures the predictive usefulness of a category structure — how much knowing which category an example belongs to improves your ability to predict its feature values. Good categories are ones where feature values cluster predictably within them. The exact formula isn't important here; what matters is that CU is a well-defined, computable score, and Cobweb is explicitly maximizing it at every step.
Name the objective
CU is a measure of predictability — how well the category structure explains the feature distributions. Maximizing it means finding categories where features are as predictable as possible given membership. This is sometimes called predictive accuracy of the partition.
Find the optimizer
There is no gradient here. Cobweb never computes a derivative. Instead, at each step it evaluates a small fixed set of candidate moves (the four operators), computes CU for each outcome, and picks the best one. That is greedy local search — also called hill climbing. It makes the locally optimal choice at each step without backtracking or considering the global effect of that choice.
Notice what this shares with the course
Greedy local search is the same family of optimizer as the splitting criterion in Decision Trees — at each node, pick the split with the highest information gain and move on. Neither algorithm reconsiders earlier decisions. Neither uses a gradient. Both embed the objective directly in a greedy selection procedure. This is a pattern you will see repeatedly in classical algorithms.
Answer
What is being maximized? Category Utility — a measure of how much knowing an example's category improves your ability to predict its feature values, compared to not knowing the category at all.
How? Greedy local search. At each step, evaluate four candidate placements and pick the one with the highest CU. No gradient. No backtracking. The objective is implicit in the selection procedure — it is not written as a loss function to be minimized globally, but it is present and precise.
What problem does this solve?
Random Forest is a classification and regression algorithm. Given a set of labeled examples, it learns to predict the label of examples it has never seen.
Background: Decision Trees
A Decision Tree makes predictions by asking a sequence of yes/no questions about the input features. At each step it picks the question that best splits the data into pure groups — groups where most examples share the same label. It keeps splitting until the groups are pure enough or the tree hits a size limit. A single Decision Tree tends to overfit — it learns the noise in training data and performs poorly on unseen data.
The Algorithm
Given: dataset of N examples, each with F features and a label
Choose: number of trees T, features per split m (typically √F)
For t = 1 to T:
1. Draw a bootstrap sample of N examples
(random with replacement — some appear twice, some not at all)
2. Grow a Decision Tree on the bootstrap sample:
At each node:
Select m features randomly from the F available
Find the feature and split point among those m
that best separates the labels
Split the node into two child nodes
Stop when nodes are pure or too small
3. Store the trained tree
Prediction on new example x:
Pass x through all T trees → majority vote (or average for regression)
Plain English
Random Forest builds many Decision Trees, each on a slightly different random sample of the data using a random subset of features at each split. When predicting, every tree votes and the majority wins. The randomness ensures trees make different mistakes, so averaging them out produces more reliable predictions than any single tree.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What problem does this solve?
AdaBoost (Adaptive Boosting) is a classification algorithm. It combines many simple, weak classifiers into a single strong classifier that performs well on examples the weak classifiers struggle with.
Background: Weak Classifiers
A weak classifier performs only slightly better than random guessing — for example, a Decision Tree with only one split (a decision stump). Alone, it is not useful. AdaBoost's insight is that many weak classifiers, combined carefully, can produce a powerful one.
The Algorithm
Given: dataset of N examples with labels y_i ∈ {-1, +1}
Choose: number of rounds T
Initialize: weights w_i = 1/N for all examples
For t = 1 to T:
1. Train weak classifier h_t using current weights w
(examples with higher weight matter more to get right)
2. Compute weighted error:
error_t = sum of w_i for examples where h_t(x_i) ≠ y_i
3. Compute classifier importance:
alpha_t = 0.5 * ln( (1 - error_t) / error_t )
(lower error → higher alpha → louder vote)
4. Update example weights:
if h_t got example i right: w_i = w_i * exp(-alpha_t)
if h_t got example i wrong: w_i = w_i * exp(+alpha_t)
Normalize all w_i so they sum to 1
Final classifier: H(x) = sign( Σ alpha_t * h_t(x) )
Plain English
AdaBoost runs in rounds. Each round it trains a simple classifier, then increases the weight of examples it got wrong, so the next classifier pays more attention to them. Classifiers that perform better earn a louder vote in the final answer. The result is a sequence of classifiers that progressively correct each other's mistakes.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What problem does this solve?
A Support Vector Machine (SVM) is a classification algorithm. Given labeled examples, it finds a decision boundary that separates the classes — and specifically tries to find the boundary that is as far as possible from the nearest examples on either side.
Background: Decision Boundaries
A logistic regression model finds a hyperplane that separates two classes. There are infinitely many valid boundaries for a linearly separable dataset. Logistic regression picks one based on maximum likelihood. SVM picks one based on a different criterion: maximum margin — the widest possible gap between the two classes.
The Algorithm
Given: labeled examples (x_i, y_i) where y_i ∈ {-1, +1}
Find: weights w and bias b defining a hyperplane
The margin = 2 / ||w||
Maximizing the margin = minimizing ||w||
Hard margin (perfectly separable data):
Minimize: (1/2) * ||w||²
Subject to: y_i * (w · x_i + b) ≥ 1 for all i
Soft margin (data with noise / overlap):
Introduce slack variables ξ_i ≥ 0 (allow some violations)
Minimize: (1/2) * ||w||² + C * Σ ξ_i
Subject to: y_i * (w · x_i + b) ≥ 1 - ξ_i for all i
C is a hyperparameter:
large C = penalize violations heavily (tight, may overfit)
small C = allow violations (wider margin, more robust)
Support vectors: the examples that sit exactly on the margin.
They are the ONLY examples that determine the boundary.
All other examples could be removed without changing it.
Prediction:
sign( w · x + b ) → +1 or -1
Plain English
SVM finds the decision boundary with the widest possible gap between the two classes. The examples nearest that boundary — the support vectors — are the only ones that matter. The soft margin version allows a small number of violations, trading some accuracy for robustness.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What problem does this solve?
K-Means is a clustering algorithm. Given a set of unlabeled examples, it groups them into K clusters such that examples within the same cluster are similar to each other. There are no labels — the algorithm finds structure in the data on its own.
Background: Centroids
A centroid is the center point of a cluster — the average position of all examples assigned to it. K-Means represents each cluster by its centroid and assigns each example to whichever centroid it is closest to.
The Algorithm
Given: dataset of N examples with F features (no labels)
Choose: number of clusters K
1. Initialize: randomly place K centroids in feature space
Repeat until centroids stop moving:
2. Assignment step:
For each example x_i:
Assign x_i to the nearest centroid
nearest_k = argmin over k of ||x_i - centroid_k||²
3. Update step:
For each cluster k:
centroid_k = mean of all examples assigned to cluster k
Return: K centroids and cluster assignment for every example
Prediction on new example x:
nearest_k = argmin over k of ||x - centroid_k||²
Plain English
K-Means alternates between two steps: assign every example to its nearest centroid, then recompute each centroid as the average of its assigned examples. These two steps repeat until assignments stop changing. The result is K clusters where every example is closer to its own cluster center than to any other.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What problem does this solve?
Naïve Bayes is a classification algorithm. Given labeled examples, it learns to predict the probability that a new example belongs to each class. It is particularly common in text classification — for example, deciding whether an email is spam.
Background: Bayes' Theorem
Bayes' theorem gives a way to compute the probability of a class given observed features: P(class | features) = P(features | class) × P(class) / P(features). The "naïve" assumption is that all features are independent of each other given the class. This is rarely true in practice but makes the math tractable and the algorithm fast.
The Algorithm
Given: dataset of N examples, each with F features and a class label
Training — one pass through the data:
For each class c:
1. Compute the prior:
P(c) = (examples with class c) / N
2. For each feature f:
P(feature_f = value | class = c) =
(examples in class c where feature f = value)
/ (examples in class c)
+ smoothing to avoid zero probabilities
Return: all priors P(c) and likelihoods P(feature_f | c)
Prediction on new example x with features x_1 ... x_F:
For each class c:
score(c) = log P(c) + Σ log P(x_f | c)
(log turns products of small numbers into sums)
Return: class c with the highest score
Plain English
Naïve Bayes learns two things: how common each class is, and how often each feature value appears within each class. At prediction time it scores each class by combining these and picks the most probable one. Taking the log turns a product of probabilities into a sum, which is more numerically stable.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What problem does this solve?
A Variational Autoencoder (VAE) is an unsupervised learning algorithm. Like the autoencoder built in this course, it learns a compressed representation of data. But it goes further: it learns a structured latent space that allows new data to be generated by sampling from it.
Background: The Autoencoder You Built
In this course you built an autoencoder that compressed MNIST images into a small latent vector and reconstructed them. The loss was reconstruction error. The limitation: the latent space has no guaranteed structure — points sampled randomly from it may decode into nonsense, because the encoder was never required to organize the space in any particular way.
The Algorithm
Given: dataset of N unlabeled examples x_i
Key difference from standard autoencoder:
Standard AE: encoder maps x → latent vector z
VAE: encoder maps x → mean vector μ and variance σ²
z is then sampled: z ~ N(μ, σ²)
For each example x_i during training:
1. Encode:
Run x_i through encoder → output μ_i and σ²_i
2. Sample (reparameterization trick):
z_i = μ_i + σ_i * ε where ε ~ N(0, 1)
(keeps the sampling step differentiable for backprop)
3. Decode:
Run z_i through decoder → reconstructed x̂_i
4. Compute loss (two terms):
Term 1 — Reconstruction loss:
How different is x̂_i from x_i?
(same MSE as the course autoencoder)
Term 2 — KL divergence:
How different is N(μ_i, σ²_i) from N(0, 1)?
KL = -0.5 * Σ( 1 + log(σ²) - μ² - σ² )
(penalizes encoder for straying from standard normal)
Total loss = Reconstruction loss + KL divergence
5. Backpropagate and update weights with gradient descent
Generation (not possible with standard autoencoder):
Sample z ~ N(0,1) → run through decoder → new example
Plain English
The VAE adds one idea to the autoencoder: instead of compressing each input to a fixed point in latent space, it compresses it to a probability distribution. The KL divergence term pushes all those distributions toward a standard normal, organizing the latent space into a coherent structure you can sample from to generate new data.
Your Questions
What is being minimized or maximized?
How does the algorithm move toward that objective?
What is DBSCAN?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density regions as noise. Unlike K-Means it requires no predefined number of clusters and can find clusters of arbitrary shape.
The Algorithm
Given: dataset of N points, two hyperparameters: ε (radius), minPts (minimum neighbors)
Label each point as one of:
Core point: has at least minPts neighbors within distance ε
Border point: within ε of a core point but has fewer than minPts neighbors
Noise: not within ε of any core point
For each unvisited core point p:
Create a new cluster C
Add p to C
Collect all points within ε of p into seed set S
For each point q in S:
If q is unvisited:
Mark q as visited
If q is a core point:
Add q's neighbors to S ← expand the frontier
If q is not yet in any cluster:
Add q to C
Return: cluster assignments and noise labels
Key definitions:
Directly density-reachable: q is within ε of core point p
Density-reachable: chain of directly density-reachable steps from p to q
Density-connected: both p and q are density-reachable from some core point
Cluster: maximal set of density-connected points
Plain English
DBSCAN starts at a core point, expands outward along density-reachable paths, and stops when it hits low-density space. Everything reachable in one connected dense region becomes one cluster. Isolated points with no dense neighbors are labeled noise.
Worked Answer — How to Apply the Optimization Lens
Look for an explicit loss function
There isn't one. Unlike the six group handouts — and unlike Cobweb — DBSCAN has no formula being evaluated at each step. No score is computed. No quantity is compared before and after each decision. This is the first thing to notice: the objective is not written down anywhere in the algorithm.
Look at what the algorithm produces instead
Focus on the definition of a cluster: a maximal set of density-connected points. "Maximal" is doing important work here — it means the algorithm never stops expanding a cluster while there are more density-reachable points to include. Every cluster is as large as it can possibly be under the density constraints. That is not arbitrary — it reflects a specific preference baked into the procedure.
Reconstruct the implicit objective from the maximality rule
Because every cluster is maximally expanded, DBSCAN implicitly produces the clustering with the fewest possible clusters consistent with the density constraints — it refuses to split a dense region into two clusters when one will do. Simultaneously, the density thresholds (ε and minPts) enforce a minimum intra-cluster density. Together, the algorithm implicitly maximizes density-reachability within clusters while minimizing the total number of clusters. Neither of these objectives is written in the pseudocode — they emerge from the definition of density-connectivity and the maximality expansion rule.
Find the optimizer
The seed set expansion loop is the optimizer. Starting from each core point, it greedily absorbs every reachable point into the cluster in a single deterministic pass. There is no iteration, no score comparison, no backtracking. The algorithm does not try multiple configurations and pick the best — it constructs the only valid solution under the density definition directly. This is closer to a constraint-satisfaction procedure than an optimization loop, but it is pursuing a well-defined implicit objective.
Compare to the other algorithms
Cobweb's objective (CU) was implicit in its selection procedure but still computed explicitly at each step. DBSCAN's objective is implicit in its definitions — it never computes anything like CU. The objective has to be reconstructed by reading what the algorithm's constraints force it to produce. This is a harder case than Cobweb, but the framework still applies — it just requires more work to find where the objective is hiding.
Answer
What is being maximized? Implicitly: cluster density-reachability (all density-connected points end up together) while minimizing the number of clusters (maximality prevents arbitrary splitting of dense regions). No explicit loss function exists — the objective is embedded in the definitions of density-connectivity and maximal expansion.
How? Greedy deterministic expansion. From each core point, absorb all reachable points in a single pass. No gradient, no score comparison, no iteration. The procedure directly constructs the unique valid solution under the density constraints rather than searching for it.
Key lesson: The optimization framework still applies, but the objective must be reconstructed from the algorithm's behavior rather than read off a formula. This is common in classical algorithms — the objective is baked into the procedure rather than separated out as a swappable component.
What is KNN?
K-Nearest Neighbors is a classification and regression algorithm. To predict the label of a new example, it finds the k most similar examples in the training set and returns their majority label (classification) or average value (regression). The entire training dataset is stored and used directly at prediction time.
The Algorithm
Training:
Store all N training examples. That's it.
No parameters are learned. No optimization is performed.
Prediction on new example x:
1. Compute distance from x to every stored training example
(usually Euclidean: d(x, x_i) = ||x - x_i||)
2. Find the k examples with the smallest distance to x
3. Classification: return majority label among the k neighbors
Regression: return average value among the k neighbors
Hyperparameters:
k: number of neighbors to consult
distance: Euclidean, Manhattan, cosine, etc.
Plain English
KNN stores every training example and makes predictions by looking up the most similar ones. There is no training phase in any meaningful sense. All computation happens at prediction time.
Worked Answer — Applying the Optimization Lens
Look for an optimization loop
There is none. Read the training procedure again: store all examples. That is the entirety of training. No loss function is evaluated. No parameters are updated. No gradient is computed. No score is compared before and after. The algorithm does not get better at anything during training — because nothing happens during training.
Look for parameters
Every algorithm in this course had parameters — values that were adjusted during training to minimize a loss. Linear regression learned m and b. The neural network learned weights W and biases b. KNN has no such parameters. The distance metric and k are hyperparameters chosen before training, not learned from data. The training examples themselves are stored, not learned — they are the data, not a compressed representation of it.
Apply the framework honestly
The course framework requires: a model (parameterized function), a loss (scalar measure of badness), and an optimizer (mechanism to reduce the loss). KNN satisfies none of these in the standard sense. There is no parameterized function — the prediction rule is fixed. There is no loss being minimized during training. There is no optimizer because there is nothing to optimize. The framework does not fit.
But KNN still generalizes — so what is it doing?
KNN can predict on examples it has never seen. It achieves the practical goal of machine learning — generalization — without the mechanism the course described. It generalizes by proximity: the assumption that nearby examples share labels. This is an inductive bias (a built-in assumption about the world), not a learned one. Every algorithm has inductive bias, but most algorithms adjust their parameters to fit data within that bias. KNN does not adjust anything. The bias is the entire model.
Recognize what KNN actually is
KNN belongs to a category formally called lazy learners. The name is precise: the algorithm does no work during training and defers everything to inference time. The model is the data — not a representation derived from the data, not parameters adjusted to fit the data, but the raw data itself stored verbatim. At prediction time it performs retrieval: find the nearest stored examples, return their labels. There is no learned function. There is no optimization. By the definition this course has used all semester — learning is optimization — KNN technically does not learn anything.
Answer — As Far As the Framework Reaches
What is being minimized? Nothing. KNN has no loss function and no training objective. Nothing is evaluated, compared, or improved during training.
How does it optimize? It doesn't. Training is storage. The optimization framework does not apply because there is no optimization. The model is the data.
The Takeaway
KNN doesn't learn anything. It stores the training data and performs retrieval at inference time. This has a formal name: lazy learning. The word lazy is precise — no work is done during training, all work is deferred to prediction.
Learning, in the sense this course has used all semester, involves building a representation that is optimized for some criterion. A linear model learns weights. A neural network learns features. Both produce something smaller and more structured than the raw data — a representation adjusted through optimization to capture what matters for the task. KNN produces no such representation. The data goes in unchanged and comes back out at query time. There is no optimization, no criterion, no learned structure. That is why it sits outside the framework — not as an edge case, but as a fundamentally different kind of algorithm.