Modern machine learning models, such as deep neural networks, have given us the ability to observe how models with immense hypothesis classes (i.e. the set of functions that are learnable by the mode) are able to perform on complex, high dimensional data sets. Based on what we know about the bias-variance tradeoff, if we keep on increasing the complexity of our model (keeping the dataset fixed), we will reach a model capacity that causes the model to overfit and perform worse on out-of-sample data. However, today’s neural networks seem to contradict this. Even with millions of parameters, they seemingly perform better than smaller models on most tasks. This has led to the conventional wisdom in the deep learning community that “larger models are better”. The ability of massive machine learning models to consistently outperform smaller models has led researchers to believe that our current model of the bias-variance tradeoff may be incomplete.
Recent empirical evidence suggests that there exists more than one “training regime” in today’s machine learning practices. This second regime is proposed to exhibit itself when our hypothesis class is so large that our model is well past simply interpolating the data (i.e. when our model’s empirical loss, \(\mathcal{L}_{S}(h) = 0\)). Typically, we would consider a model with \(0\) training loss to be overfitting the data, but this may not be the case. A 2018 paper by Belkin et al. shows that interpolating the training data can achieve good generalization in nonparametric regression problems.
Binary Classification | Regression |
---|---|
Fig 1. Our Reproduction of the Interpolating Nadaraya-Watson Estimator for Classification and Regression
Since this paper was published, this idea has been extended to deep learning models, and the results have matched Belkin’s results with smaller-scale nonparametric regression.
Fig 2. The Double Descent Curve
Because of the two U-shaped curves in the plot of test risk vs model capacity, this phenomenon where the model achieves better generalization as the capacity of the hypothesis class, \(\mathcal{H}\), increases is called the double descent phenomenon.
Since double descent is a new phenomenon, there is no central hub where researchers can compare and contrast each other’s results. Researchers Ishaan Gulrajani and David Lopez-Paz from Facebook noticed there was a similar problem in the field of domain generalization. This led them to create DomainBed, a testbed for domain generalization. They implemented seven multi-domain datasets, nine baseline algorithms, and three model selection criteria and are allowing domain generalization researchers to contribute to the testbed.
The goal of this project was to create a similar platform for double descent researchers. DomainBed takes proven algorithms and allows researchers to consistently reproduce results and directly compare algorithms. Given that double descent has been shown to appear in several models with different datasets, creating a testbed that includes these models and datasets would fix a major issue in this research field. Understanding the double descent phenomenon can potentially lead to more robust and accurate machine learning algorithms at no “extra cost”, therefore it is imperative that there is some kind of standardized way to research it.
Fig 3. Architecture of the Double Descent Testbed
The testbed has been designed in an object oriented way. This allows users to simply import models from the module, run two or three commands, and have complex experiments running without any boilerplate code. This platform is designed for scientists, though, so users will be given access to the source code and all of the included utilities. This will aid in allowing unique experiments to be conducted using the platform as users can choose the level of abstraction or granularity that they are comfortable with without having to write most of the code themselves.
The project consists of four sub-modules: models, data, utils, and plots. The models sub-module contains abstracted versions of models that have exhibited double descent. At this time, models consists of a fully connected neural network model and a random forest classifier. The data sub-module contains abstracted versions of datasets that have been used in double descent experiments. The models are written in PyTorch or scikit-learn, so the data sub-module is divided into two parts, TorchData and SKLearnData. This allows for more compatibility when a user wants to add a model from either library. The utils sub-module contains any tools that are needed to train models, process data, etc. The main feature of utils is a parameter count generation algorithm that will be discussed later on in the blog post. Lastly, the plots sub-module contains a class with utilities that are specific to plotting, written using Matplotlib. These utilities can be used with the data that is returned by the models after training.
The data module has two versions of the MNIST dataset: a PyTorch implementation and a scikit-learn implementation. The PyTorch implementation contains two PyTorch dataloaders along with exposed parameters such as batch sizes and number of training samples. The PyTorch dataloaders are iterable objects that contain a dataset object within them. By using a dataloader, we can apply transformations to the data and shuffle the dataset prior to training our model. The scikit-learn implementation of MNIST simply returns an \(N \times 784\) matrix where \(N\) is the number of training samples and each row is a 784-dimensional vector that encodes a 28 x 28 image. Both of these implementations download and save the dataset to a local repository where it can be reused without having to wait for the data to be downloaded a second time.
Fig 4. A Sample of Images from the MNIST Dataset
The multilayer perceptron is made up of 3 layers, an input layer, a hidden layer, and an output layer (This type of model is sometimes referred to as a two-layer neural network). All three have variable size depending on the dataset that is being trained on. The input layer has \(d = n \cdot m\) units where \(n\) and \(m\) are the dimensions of the input data, as the images are flattened before being passed through the network. The hidden layer has \(h_{i}\) units, where \(h_{i}\) is computed using a desired number of total parameters and the formula for total parameters in Figure 3. The output layer has \(K\) units where \(K\) is the number of output classes. This model also has ReLU activation functions on both the input layer and hidden layer.
Fig 5. Equation to Calculate Number of Parameters in a Two-Layer Neural Network
Fig 6. Visualization of our Two-Layer Neural Network
One main feature of the multilayer perceptron wrapper is its built-in TensorBoard functionality. TensorBoard is a visualization dashboard for machine learning experiments. It runs in a web server and reads from a log directory that is produced by the neural network training loop in the double descent testbed. Throughout the training process, we log all training and testing losses for each individual model, as well as the final losses of each model. On this dashboard, we also expose the architecture of the current model that is being experimented on (i.e. its computational graph) and a sample of the dataset that is being used to train the model.
Fig 7. A Screenshot of Tensorboard (Test Loss vs. Model Capacity in # parameters)
The random forest classifier wrapper follows suit by exposing certain features and parameters, such as the maximum number of leaf nodes, the number of trees, and the criterion, to the user when they create an instance of it. Unlike the multilayer perceptron wrapper, which is implemented in PyTorch, the random forest wrapper is implemented in scikit-learn. This means that there is no GPU support, and we cannot take advantage of built-in PyTorch dataloaders. This is not an issue that needs immediate attention, though, as these classifiers only took about half a minute to train in our experiments. Unifying all models under the PyTorch umbrella will be a main area of future work on this testbed.
When designing the neural network wrapper, we noticed that each of the models took a while to train. Even though the individual models are quite small, when training several of them sequentially for a large number of epochs, single experiments could take several days. To help reduce the time that it takes to run experiments, we developed a parameter count generation algorithm to intelligently choose the next model to train using the number of parameters in the previous model and the final test loss that the model produced. This allowed us to avoid iterating through model capacities (i.e numbers of parameters in the model) at a fixed, constant value. The algorithm was designed to have the highest resolution around the area where models interpolate, or overfit, the data, as this is where the double descent curve is supposed to exhibit itself. We do this by assuming that the double descent curve should look roughly like a \(3^{rd}\) degree polynomial, fitting a \(3^{rd}\) degree polynomial to the model capacity vs. test loss graph, and examining the first derivative of the polynomial that was fit to the data. Algorithm 1 shows the pseudocode for this parameter count generation algorithm.
The algorithm takes a list of previous parameter counts, a list of previous test losses, a flag to determine if the interpolation threshold has been reached, and a tuning parameter \(\alpha\) as input
def Parameter_Count_Generation(param_counts, test_losses, past_interpolation_threshold, alpha):
current_count = param_counts[-1] # Take last element of param_counts
# Weight more recent parameter counts more heavily
weight_vector = [1/n, ... 1/2, 1] # where n = len(param_counts)
poly = fit_polynomial(param_counts, losses, w)
# Examine the first derivative of the polynomial
dy = dy_poly(current_count + eps)
if dy < 0:
sgn = 1
else:
sgn = 0
past_interpolation_threshold = True
next_count = sgn*max(alpha * dy, 3) + 1
if sgn and past_interpolation_threshold:
return ceil(next_count) + current_count + 10, past_interpolation_threshold
else
return ceil(next_count) + current_count, past_interpolation_threshold
Fig 8. Our Parameter Counts Generation Algorithm Running on a Synthetic Double Descent Curve
Each of the models in the testbed has an associated double_descent method that performs the same training procedure for a model over several different capacities. In general, the double_descent method loops over a chosen list of parameter counts for a model (or generates them in an online fashion using the parameter count generation algorithm) and trains the model to any completion criterion that the user has chosen. At the end of each training procedure, final train losses, final test losses, and parameter counts for models of different capacities are aggregated and output to a dictionary of arrays that can be used for visualization or analysis. In the case of the MultilayerPerceptron class, this method contains TensorBoard functionality. To speed up convergence on the multilayer perceptron model, we can toggle a reuse_weights
flag to use the weights from the previous model as an initialization for the next model.
If there is a pre-spectified list of parameters, the double descent training loop is the following:
for i in range(len(param_counts)):
current_index = i
current_parameter_count = parameter_counts[current_index]
# Creates new model with specified number of parameters
reinitialize_model(current_parameter_count)
losses = train_model()
# Log losses into TensorBoard or arrays
If the user wants to generate the parameter counts, after a single iteration of training (above), the double descent training loop is the following:
if generate_parameters_enabled:
steps_past_dd = 0
past_interpolation_threshold = False
while steps_past_dd < 4:
next_count, past_interpolation_threshold = Parameter_Count_Generation(parameter_counts, test_losses)
parameter_counts.append(next_count)
current_parameter_counts = parameter_counts[current_index]
reinitialize_model(current_parameter_count)
losses = train_model()
# Log losses into TensorBoard or arrays
current_index += 1
if past_interpolation_threshold:
steps_past_dd += 1
where \(F_{n}\) is the \(n^{th}\) Fibonacci number (the recursive definition for the Fibonacci numbers is \(F_{n} = F_{n-1} + F_{n-2}\) where \(F_{1} = 1\) and \(F_{2} = 1\). We can also include \(F_{0} = 0\))
Typically, Binet’s formula over \(\mathbb{N}\) gives us \(F_{1} = 1, F_{2} = 1, F_{3} = 2\) …, but what happens when we use Binet’s formula to find the “\(0.5^{th}\) Fibonacci number” or the “\(\pi^{th}\) Fibonacci number” (if they even exist)? Well, if we try to find \(F_{\pi}\), what we end up with is roughly \(2.11702 + 0.04244i\). We end up with complex numbers because trying to find \(F_{n}\) where \(n \not\in \mathbb{N}\) leads to complex outputs. So, let’s take a look at the outputs of Binet’s formula over some continuous, real domain (e.g. \(\left[0, 5\right]\)).
Notice that the only places where Binet’s formula has real outputs on this interval are at the natural numbers, where the outputs are the typical Fibonacci numbers. What about the “negative Fibonacci numbers”? Let’s see what the outputs of Binet’s formula look like on the interval \(\left[-5, 0\right]\).
We end up with \(F_{-1} = 1, F_{-2} = -1, F_{-3} = 2, F_{-4} = -3\) … This large spiral that’s travelling around the complex plane actually intersects the real line at the usual Fibonacci numbers with alternating signs! There is actually a generalization of the typical recurrence relation that allows us to have negative values for \(n\):
\[F_{-n} = \left(-1\right)^{n+1}F_{n}\]Extending discrete mathematical structures, such as the Fibonacci sequence, to have continuous properties often leads to interesting results. In this example, we saw how Binet’s formula allows us to find complex and negative “Fibonacci numbers”. The field of math that seeks to solve discrete problems about integers using tools from analysis is known as analytic number theory, and it has provided number theorists with other interesting results, such as bounds for the prime counting function and solutions to Diophantine equations.
]]>I am currently taking an introductory class in abstract algebra, and we have been learning about different types of groups. One of these groups is called the Symmetric Group. The symmetric group defined over any set , \(\Omega\), is denoted as \(S_{\Omega}\). This group is comprised of all of the bijections, \(\sigma : \Omega \rightarrow \Omega\). of the set onto itself, and its group operation is defined as the composition of these bijections. Since we will be looking at finite symmetric groups, we can denote the symmetric group over a finite set of \(n\) symbols as \(S_{n}\).
An example of an element in \(S_{10}\) could be the permutation (map) \(\sigma\) that rearranges [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] to be [3, 4, 6, 8, 10, 7, 9, 2, 1, 5].
(i.e. \(\sigma([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) = [3, 4, 6, 8, 10, 7, 9, 2, 1, 5]\))
Now that we know how the elements of \(S_{n}\) act on the underlying set, what are their cyclic decompositions and orders? The order of an element in a group refers to the smallest positive integer, \(m\) such that \(\sigma^{m} = \sigma \circ \sigma \circ ... \circ \sigma = \textbf{id}\) where id is that group’s identity element. The cyclic decomposition of one of these group elements refers to the “cycles” formed when repeatedly applying the same permutation on the underlying set. In other words, the cyclic decomposition refers to the “path” that each individual set element takes under a repeated permutation to get mapped back to itself.
Viewing the permutation as a mapping of individual elements instead of a rearrangement of the entire set can aid in understanding how cyclic decompositions and repeated permutations work. Using the \(\sigma\) that we defined above, we can write out the following mapping:
\[1 \mapsto 3\] \[2 \mapsto 4\] \[3 \mapsto 6\] \[4 \mapsto 8\] \[5 \mapsto 10\] \[6 \mapsto 7\] \[7 \mapsto 9\] \[8 \mapsto 2\] \[9 \mapsto 1\] \[10 \mapsto 5\]If we repeat this individual mapping repeatedly, we will eventually encounter elements that map back to their original positions. While sitting in class, it became apparent that this process could be automated using a directed graph and a depth-first search algorithm. The nodes of the graph would represent the set elements and the edges would represent their mapping under \(\sigma\). When the graph is drawn, the cyclic decompositions become obvious. The directed graph representing our \(\sigma\) on the set of 10 symbols is the following:
Now that we can see the cycles in the form of a directed graph, let’s take a look at the code that would allow us to generalize the process of finding the cyclic decomposition adn order of any permutaion.
from math import gcd
Firstly, we can use a dictionary to stand in as our \(\sigma\) as dictionaries have keys that map to values. Since our \(\sigma\) is bijective, a dictionary with unique (key, value) pairs is precisely what we need.
sigma = {1: 3, 2: 4, 3: 6, 4: 8, 5: 10, 6: 7, 7: 9, 8: 2, 9: 1, 10: 5}
dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dict_values([3, 4, 6, 8, 10, 7, 9, 2, 1, 5])
Now, we use the following algorithm to find the cyclic decomposition of \(\sigma\):
1: Instantiate an array, cycles
to store cycles and a set, already_seen
to store elements that have been encountered
2: Iterate over the values of the underlying set
IF the current value is not in already_seen
, use DFS to repeat \(\sigma\) until the value is repeated.
cycles
and update already_seen
to include these elements.3: Return cycles
In Python code, this algorithm would be
def find_cyles(sigma):
"""Find cycles of a map using a depth first search"""
cycles = []
already_seen = set()
for element in sigma.keys():
if element not in already_seen:
cycles.append(dfs(sigma, element, set(), []))
already_seen.update(cycles[-1])
return cycles
def dfs(sigma, element, memo, cycle):
"""DFS Helper Function"""
if element in memo:
return
memo.add(element)
dfs(sigma, sigma[element], memo, cycle)
return list(memo)
DFS (depth-first search) is a graph traversal algorithm that starts at a “root” node and explores as far down that each of root’s branches as possible before backtracking and moving onto the next branch. Below is a gif that shows how DFS traverses a graph (Source)
To find the order, we can use a theorem which states that the order, \(m\),of a permutation is the least common multiple of the lengths of each cycle. By using this theorem, we can use the result from our DFS and avoid having to use the brute-force solution where we would compose \(\sigma\) with itself until we map back to the original ordering of elements.
def find_order(cycle_list):
"""Compute LCM of Cycle Lengths"""
cycle_lengths = [len(x) for x in cycle_list]
lcm = cycle_lengths[0]
for length in cycle_lengths[1:]:
lcm = lcm*length//gcd(lcm, length)
return lcm
Now that we have these Python functions, we can use them in conjunction to find the order and cyclic decomposition for any map that belongs to a finite symmetric group! (Note: the notation for the cyclic decomposition (1 3 6 7 9) (8 2 4) (10 5) refers to the disjoint cycles that are produced. 1 maps to 3, 3 maps to 6, and so on.)
cycles = find_cyles(sigma)
cycles_string = ''
for cycle in cycles:
cycles_string += str(tuple(cycle)) + ' '
print('The Cyclic Decomposition of Sigma is {}'.format(cycles_string))
print('\nSigma has order {}'.format(find_order(cycles)))
The Cyclic Decomposition of Sigma is (1, 3, 6, 7, 9) (8, 2, 4) (10, 5)
Sigma has order 30
Graphs are an extremely versitile tool that allow us to represent both abstract mathematical objects and physical networks as data in memory. By using DFS, we can traverse these graphs that represent the permutations of a set to learn more about their underlying structures… and we can automate some problems from our abstract algebra homework.
]]>During one of my biweekly research meetings, my group reviewed Does data interpolation contradict statistical optimality? by Mikhail Belkin, Alexander Rakhlin, and Alexandre B, Tsybakov
The aim of this paper was to show that interpolating training data can still lead to optimal results in nonparametric regression and prediction with square loss. Since the double descent phenomenon exhibits itself when the model capacity surpasses the “interpolation threshold”, I thought that reproducing the results from this paper would help me understand how a model interpolates data.
[1]
)
import numpy as np
import os
import numpy.linalg as lin
import matplotlib.pyplot as plt
import scipy.stats as stats
This paper takes a look at interpolation using the Nadaraya-Watson Estimator.
Let \((X, Y)\) be a random pair on \(\mathbb{R}^{d} \times \mathbb{R}\) with distribution \(P_{XY}\), and let \(\mathbb{E}[Y \vert X = x]\) be the regression function.
Given a sample \((X_{1}, Y_{1}),...,(X_{n}, Y_{n})\) drawn independently from \(P_{XY}\), we can approximate $f(x)$ using the Nadaraya-Watson Estimator where \(K: \mathbb{R}^{d} \rightarrow \mathbb{R}\) is a kernel function and \(h > 0\) is a bandwidth
def nadaraya_watson_estimator(x, X, Y, h, K=stats.norm.pdf):
cols=[]
for i in range(len(X)):
cols.append(np.array(K((x - X[i])/h)))
Kx = np.column_stack(tuple(cols))
row_sums = np.sum(Kx, axis=1)
W = Kx / row_sums[:, None]
result = np.matmul(W,Y)
result.shape = (result.shape[0], 1)
return result
Note: Since we are dealing with singular kernels that approach infinity when their argument tends to zero, we will have to use a modified version
def singular_nadaraya_watson_estimator(x, X, Y, h, K=stats.norm.pdf, a=1):
cols = []
for i in range(len(X)):
condition = False
for boolean in [k==0 for k in x - X[i]]:
if boolean:
condition = True
cols.append(np.array([Y[i]] * len(x)))
break
if condition == False:
cols.append(np.array(K((x - X[i])/h, a)))
Kx = np.column_stack(tuple(cols))
row_sums = np.sum(Kx, axis=1)
W = Kx / row_sums[:, None]
result = np.matmul(W,Y)
result.shape = (result.shape[0], 1)
return result
The two singular kernels we will be focusing on are:
def sing_kernel_1(x,a):
return 1/(abs(x))**a
and
def sing_kernel_2(x, a):
return (1/(abs(x))**a)*(1-abs(x))**2
The data generating functions we will look at are
def actual_regression_1(x):
return 5*np.sin(x)
and
def binary_classification(x):
outs = np.array([])
for element in x:
if abs(element) > .4:
outs = np.append(outs, [1])
else:
outs = np.append(outs, [0])
return outs
abs()
instead of linalg.norm()
since I pass in the elements as arrays, but they are really (x, y) which are both just 1 dimensional real numberssing_kernel_1
gave me errors when I tried implementing it, so i removed it. The kernel’s singularity as the argument goes to infinity is still present.singular_kernel_1
:The estimator fits the curve fairly well for values of a > .8. For some reason, the bandwidth, h, doesn’t do anything to this specific example. Because the best value of h in the paper was .4, I chose to keep h constant.
There are no animations for different parameters because the only tunable parameter is the bandwidth, h, which was held constant at .4
#np.random.seed(100)
n = 8
#epsilon = np.random.normal(loc = 0, scale = 2, size = n)
X = np.linspace(0, 2*np.pi, n)
#np.random.normal(loc = 0, scale = 3, size=n)
Y = actual_distribution_1(X)
x_axis = np.linspace(-1, 7, 100000)
# Singular Kernel
h=.4
a=1.5
plt.scatter(X, Y, color='k')
plt.plot(x_axis, actual_distribution_1(x_axis), 'b-')
plt.plot(x_axis, singular_nadaraya_watson_estimator(x_axis, X, Y, h, K=sing_kernel_1,a=a), 'r-')
plt.xlim(min(X) - .5, max(X) + .5)
plt.ylim(min(Y) - .5, max(Y) + .5)
plt.legend(['True Regression', 'Estimator'])
plt.show()
# Non-singular Kernel
h=n**(-1/(2*0 + len(x_axis)))
plt.scatter(X, Y, color='k')
plt.plot(x_axis, actual_distribution_1(x_axis), 'b-')
plt.plot(x_axis, nadaraya_watson_estimator(x_axis, X, Y, h), 'r-')
plt.xlim(min(X) - .5, max(X) + .5)
plt.ylim(min(Y) - .5, max(Y) + .5)
plt.legend(['True Regression', 'Estimator'])
plt.show()
singular_kernel_2
:In this animation, I sweep through several values for h with a constant. Then I hold h constant and sweep through several values of a
The only tunable parameter here is h
#np.random.seed(100)
n = 8
#epsilon = np.random.normal(loc = 0, scale = 2, size = n)
X = np.random.choice(np.linspace(-1, 1, 20), n)
Y = binary_distribution(X)
x_axis = np.linspace(-1, 1, 500)
h=n**(-1/(2*0 + len(x_axis)))
a=1
plt.scatter(X, Y, color='k')
plt.plot(x_axis, np.array([.5] * len(x_axis)), 'b--')
plt.plot(x_axis, singular_nadaraya_watson_estimator(x_axis, X, Y, h, K=sing_kernel_2, a=a), 'r-')
plt.xlim(min(X) -.1, max(X) + .1)
plt.ylim(min(Y) - .1, max(Y) + .1)
plt.title('a = {:.2f}; h = {:.2f}'.format(a, h), fontsize=12)
plt.legend(['Boundary', 'Estimator'])
plt.show()
h =.4
plt.scatter(X, Y, color='k')
plt.plot(x_axis, np.array([.5] * len(x_axis)), 'b--')
plt.plot(x_axis, nadaraya_watson_estimator(x_axis, X, Y, h), 'r-')
plt.xlim(min(X) - .1, max(X) + .1)
plt.ylim(min(Y) - .1, max(Y) + .1)
plt.title('h = {:.2f}'.format(h))
plt.legend(['Boundary', 'Estimator'])
plt.show()
After reproducing these results, I emailed Mikhail Belkin to get his thoughts on the connection between this paper and a previous paper he wrote on the double descent phenomenon. His reply was very similar to what my group thought the connection may be: interpolation is consistent with the current practice of deep learning. The most important thing to realize is that there is still not a proven, complete connection between modern machine learning methods and the model shown in this post.
References: