Manim Slides

Hello everyone. My name is Ilker and welcome my thesis presentation, which is titled Token Interchangeability and Alpha-Equivalence: Enhancing the Generalization Capacity of Language Models for Formal Logic

Consider a formal expression like this.

I would like to start by emphasizing the difference between machine learning and numerical optimization to set the stage.

Consider a curve fitting problem with these training samples

I could fit an extremely complex polynomial to these lines, yielding a curve like this. This would be perfectly fine if this weren't a machine learning problem, because it fits the given points perfectly.

However, this learned curve does not reflect the underlying distribution of the given points, which is much simpler.

When we evaluate the model outside the training distribution,

the model's prediction will be completely incorrect.

This shows that the intention to generalize to out of distribution samples is what separates machine learning from pure numerical optimization.

However, we don't have to restrict ourselves to samples that have the same structure.

In other words, generalization is not limited to input content, we can also generalize to different input structures.

For example, we could expect an image model to generalize to larger images after training.

Or we may want a graph neural network to generalize to different graph structures than those seen during training.

It's important to note here that, unlike other forms of generalization, structure generalization requires architectural changes. Because if the network is not designed with structure generalization in mind, it won't even be able to accept such inputs, let alone generalize to them. For example, an image model designed for a constant image size cannot even take larger images as input. Now, we will take a look at a machine learning problem and identify a new form of structure generalization.

This problem involves formal logic.

First, we have the constant True, represented by capital T.

We also have atomic propositions, denoted by p here, which can either be true or false.

Then we have two essential operators, with the first one being the negation

and the second one being conjunction, also known as the AND operator, which is true if and only if both operands are true.

With these operators, we unlock all logical operators because the rest can be derived from these. But we won't go into further details here.

The problem we are trying to solve here is Assignment Prediction. Given a logic formula, we want to assign values to propositions such that the formula evaluates to true.

There's also an extension of propositional logic called linear temporal logic.

Instead of reasoning about constant propositions, we consider propositions that change over time. By default, propositions refer to the current time step.

To refer to the future time steps, we introduce a new operator called next, which asserts that the given statement holds in the next time step.

For more complex changes, we introduce the until operator, which denotes that the left operand, phi 1, holds until the second opearand, phi 2, holds.

Our goal in LTL is still the same as assignment prediction, but since propositions change over time, we call these assignments traces.

Previously, deep learning methods were utilized on these problems to address the scalability concerns. The pioneering work was called DeepLTL.

Interestingly, this method can generalize to longer formulas not seen during training, thanks to the tree based positional encoding method.

Here, we see that the DeepLTL model was trained on formulas with a maximum length of 35, and produces mostly correct predictions, which are denoted by green. But it can also handle longer formulas, up to 50 tokens. This is an example of structure generalization. But there's another type of structure generalization in this domain that has been neglected in the literature.

Consider a formal expression like this.

Here, we can rename the bound variables without changing the meaning of the expression. For example, instead of a and b,

we could use x and y.

This concept is called alpha-equivalence.

Regardless of how we name the variables, the function has the same meaning.

Alpha-Equivalence is not specific to this domain. It emerges in many formal languages, including programming languages such as Python.

For example, consider a Python function like this, taking a and b as parameters.

Here is the same function with different variable names, x and y. But it's still functionally equivalent to the previous one. This demonstartes the concept of alpha-equivalence.

Based on this concept, there are several desirable properties for formal language models.

First, we would like to enforce alpha-equivalence so that changing variable names does not change how the model processes the input.

Secondly, we want to enable generalization across tokens. Since all variables share the same underlying concept,

a model trained on 3 variables should generalize to more variables without additional training.

Before we continue to the proposed method, let's see the prior work on extensible vocabulary.

Firstly, there's a method that uses dictionary definitions to create extensible word embeddings, but it's not related to the alpha-equivalence concept we have seen earlier. It depends on dictionary definitions to create new word embeddings, which is not a structure generalization capability we discussed at the start.

The second method proposes a component based sign language recognition framework. Similarly to the first work, this method also depends on external information to extend the vocabulary. It doesn't explore structure generalization in an alpha-equivalence context.

To the best of my knowledge, there is no prior machine learning work that has considered alpha-equivalence and the generalization problem it poses.

In the next section, we will see how the proposed method tackles this problem.

Here's an embedding matrix that converts tokens to vectors in a language model.

Rows represent tokens, and the columns are the dimensions of the embedding vectors.

The last 3 tokens are subject to alpha-equivalence. We call these interchangeable tokens.

Here's how we modify the embedding matrix in our method.

Interchangeable token embeddings have two parts. The first one contains learnable parameters that are shared across interchangeable tokens.

The second part, on the other hand, is not learnable. It features randomly generated vectors which make the embeddings distinguisable from each other.

During training, this second part will be randomized for each forward pass, forcing the model to distinguish variables without learnable parameters.

This embedding strategy enables generalization across interchangeable tokens.

Suppose that two new tokens are added during test time.

We extend the first part since all learnable parameters are shared in that region.

And since the second part is not learnable, we can simply generate it randomly as before. Now, the embedding matrix is extended without any additional training.

To make this method work within a transformer encoder decoder, we use three way weight tying.

Since some parts of the embedding matrix is manually built in the proposed method, we need to tie these parameters.

This will ensure that both the encoder and decoder use the same embedding matrix. It will also make sure that the input embeddings are consistent with the output embedding space, which is determined by the final linear layer in the network.

Let's go back to the embedding matrix. In the next section, we will take a look at how the random part is generated.

For simplicity, random vectors will be visualized in this 2D space.

The first method is also the most straightforward one: Normal Distribution. Here, we simply take a sample for each dimension from the normal distribution.

The second method is called Neighboring Points, and it uses a discrete sampling set, unlike the normal distribution.

For each dimension, we take a sample from the set containing -1, 0, and 1, while ensuring that the resulting vector is not completely zero.

In normal distribution, the probability of sampling the same vector twice was negligible. But Since the sampling set is discrete here, we need to make sure that there is no collision between the sampled vectors.

The next method is called Hypercube Vertices, which is also a discrete method. But this time each dimension is sampled from the set of -1 and 1. This also naturally ensures that the resulting vector is never 0.

We call this method Hypercube Vertices because its sampling set is composed of the vertices of an n-dimensional Hypercube. This would simply be a cube in this 2D space.

And it looks like this.

Next, we will take a look at normalization techniques in the proposed method. These are crucial for ensuring training stability.

The primary purpose of normalization is to keep the balance between these two parts.

We apply normalization for each row separately. Thus, we will focus on one row.

In the first step, we separate these parts, and normalize each of them independently. Here, normalization refers to L2 normalization. Normalizing these parts separately ensures that neither part overpowers the other one.

Then, we concatenate these parts again and normalize the row as a whole. Our reasoning here is that this further improves the gradient flow.

Now that the matrix is normalized, let's turn our attention to how the logits calculated. Logits can be thought of as unnormalized token probabilities, since applying softmax to logits yields the token probabilities.

Logit calculation happens in the final linear layer, which is the same as this embedding matrix due to weight tying.

Given a feature vector v, the logit calculation looks like this.

It's simply a dot product between this matrix, denoted by U, and the feature vector v.

Now, let's focus on logit for a single token. We take a single row from the embedding_matrix.

Since this is a dot product between two vectors, we can write it like this. Here, theta is the angle between these two vectors.

And since each row is normalized, the first term here disappears.

The dot product boils down to the length of v multiplied by cosine of theta.

This observation leads us to feature normalization.

What if we normalized the feature vector in addition to the embedding?

The only thing that reamins from the dot product is the cosine term.

In literature, this is known as cosine loss, which was frequently used in face recognition and Representation learning. But, to the best of my knowledge, it was never used in a language modeling context as we did here.

One of the disadvantages of cosine loss is that it is sensitive to hyperparameters. To address this issue, we apply adacos loss function, which scales this cosine term adaptively throughout training. We adjust adacos for language modeling by ignoring the padding tokens and treating the sequence dimension as another batch dimension. We also clip the scaling factor to avoid numerical issues.

Before we move on to the experiments, let me explain the baselines we considered.

The first one is the naive baseline, which uses the traditional embeddings without any modifications.

Naturally, this baseline fails to generalize to larger vocabularies.

The second one is called Full-Vocabulary baseline. It also uses traditional embeddings, but it sidesteps the vocabulary generalization challenge by using a training dataset with a larger vocabulary.

This represents what a transformer model can do when given the full dataset. Hence, we call it the golden baseline.

The final baseline is Alpha-Renaming baseline. Unlike other baselines, this one can actually generalize similarly to our method.

Let's dive into its details. Here is the default embedding matrix.

In this baseline, we extend the embedding matrix with test-only tokens before training. Unlike our method, all of these are learnable parameters.

To ensure that the model learns all tokens, we shuffle these embeddings in each forward pass. This is equivalent to augmenting the training dataset by renaming variables.

Next we will take a look at the Experiments.

We created a new variant of the copying task that involves Vocabulary extensions.

We also tested our method on two logic tasks. Linear Temporal logic

and Propositional logic

We will start with the copying task

which is a common toy task in the literature.

The motivation here is that many complex tasks in the real world contain copying as a subtask. For example, when you ask a language model to fix some bug in a given code snippet, it needs to copy most of the code exactly as it is and only change the problematic part. Thus, language models should perform copying without any issues.

In this work, we introduce a new variant for this task in which the vocabulary is extended at test time. This allows us to evaluate vocabulary generalization in a relatively isolated manner.

We will plot the experimental results on this graph.

Here is a sample from the dataset, taken from the pointed location. The x-axis represents the formula length.

Moving right makes the formula longer without affecting the variable count, which is determined by the y-axis.

Moving down in this graph increases the variable count without affecting the formula length.

To denote the sample count, we interpolate the brightness of each cell. White represents 100 samples,

and black represents 0 samples.

This rectangle denotes the limits of the training dataset. The model is expected to generalize in terms of

both formula length

and vocabulary size

We will plot the experimental results on this graph.

TODO

Since the difficulty of this task is trivial for all methods except for the naive baseline when given enough scale, we perform a Hyperparameter Search experiment on a smaller dataset.

We train the models on 5 unique characters and up to a length of 10.

We expect the models to generalize to 30 unique characters & length.

We train 277 models. To interpret the results from hundreds of models, we analyze the correlation coefficients between hyperparameters and edit distance.

We train 277 models. To interpret these results from hundreds of models, we analyze the correlation coefficients.

Best random generation method is Neighboring Points since it correlates negatively with the edit distance. In other words, using Neighboring Points decreases the edit distance.

Both row normalization and Feature normalization with adacos reduce the edit distance. Part normalization seems to increase the edit distance but the correlation coefficient is small.

Next, we will look at the Experiments in Linear Temporal Logic.

We will plot the experimental results on this graph.

Here is a sample from the dataset, taken from the pointed location. The x-axis represents the formula length.

Moving right makes the formula longer without affecting the variable count, which is determined by the y-axis.

Moving down in this graph increases the variable count without affecting the formula length.

To denote the sample count, we interpolate the brightness of each cell. White represents 100 samples,

and black represents 0 samples.

This rectangle denotes the limits of the training dataset. The model is expected to generalize in terms of

both formula length

and vocabulary size

These are the results from the baseline model. Although modern positional encodings allow length generalization, the model cannot handle larger vocabularies. As a result, the overall accuracy is only about 42%.

On the other hand, our embedding method enables vocabulary generalization, and combines well with generalizable positional encodings, achieving an overall accuracy of approximately 91%.

In fact, our method almost matches the performance of the full vocabulary baseline model that has been trained on all variables. Here, the training dataset for the baseline was extended. But its accuracy is only a few percentage points above our model.

Finally, alpha-Renaming baseline attains percent. Despite being the only baseline that can generalize to larger vocabularies, the overall accuracy of this approach is approximately 8 percentage points lower than our method.

Full-Vocabulary is the only baseline that outperforms the proposed method.

Furthermore, the full vocabulary approach is quite costly. Finding or generating samples that involve more variables is significantly more difficult. In temporal logic, even if we keep the formula length constant, increasing the vocabulary size in the training dataset requires a lot more compute.

Next, we will look at the Experiments in Propositional Logic.

Next, we will look at the ablation experiments in both logic problems.

Next we will take a look at the Experiments.

For more information, please feel free to read the full paper. Thank you for listening.