Deep Learning frameworks like Keras lower the barrier to entry for the masses and democratize the development of DL models to unexperienced folk, who can rely on reasonable defaults and simplified APIs to bear the brunt of heavy lifting, and produce decent results.
A common confusion arises between newer deep learning practitioners when using Keras loss functions for classification, such as CategoricalCrossentropy
and SparseCategoricalCrossentropy
:
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# Or
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
What does the
from_logits
flag refer to?
The answer is fairly simple, but requires a look at the output of the network we're trying to grade using the loss function.
Logits and SoftMax Probabilities
Long story short:
Probabilities are normalized - i.e. have a range between
[0..1]
. Logits aren't normalized, and can have a range between[-inf...+inf]
.
Depending on the output layer of your network:
output = keras.layers.Dense(n, activation='softmax')(x)
# Or
output = keras.layers.Dense(n)(x)
The output of the Dense
layer will either return:
- probabilities: The output is passed through a SoftMax function which normalizes the output into a set of probabilities over
n
, that all add up to1
. - logits:
n
activations.
This misconception possibly arises from the short-hand syntax that allows you to add an activation to a layer, seemingly as a single layer, even though it's just shorthand for:
output = keras.layers.Dense(n, activation='softmax')(x)
# Equivalent to
dense = keras.layers.Dense(n)(x)
output = keras.layers.Activation('softmax')(dense)
Your loss function has to be informed as to whether it should expect a normalized distribution (output passed through a SoftMax function) or logits. Hence, the from_logits
flag!
When Should from_logits=True?
If your output layer has a
'softmax'
activation,from_logits
should beFalse
. If your output layer doesn't have a'softmax'
activation,from_logits
should beTrue
.
If your network normalizes the output probabilities, your loss function should set from_logits
to False
, as it's not accepting logits. This is also the default value of all loss classes that accept the flag, as most people add an activation='softmax'
to their output layers:
model = keras.Sequential([
keras.layers.Input(shape=(10, 1)),
# Outputs normalized probability - from_logits=False
keras.layers.Dense(10, activation='softmax')
])
input_data = tf.random.uniform(shape=[1, 1])
output = model(input_data)
print(output)
This results in:
tf.Tensor(
[[[0.12467965 0.10423233 0.10054766 0.09162105 0.09144577 0.07093797
0.12523937 0.11292477 0.06583504 0.11253635]]], shape=(1, 1, 10), dtype=float32)
Since this network results in a normalized distribution - when comparing the outputs with target outputs, and grading them via a classification loss function (for the appropriate task) - you should set from_logits
to False
, or let the default value stay.
On the other hand, if your network doesn't apply SoftMax on the output:
model = keras.Sequential([
keras.layers.Input(shape=(10, 1)),
# Outputs logits - from_logits=True
keras.layers.Dense(10)
])
input_data = tf.random.uniform(shape=[1, 1])
output = model(input_data)
print(output)
This results in:
tf.Tensor(
[[[-0.06081138 0.04154852 0.00153442 0.0705068 -0.01139916
0.08506121 0.1211026 -0.10112958 -0.03410497 0.08653068]]], shape=(1, 1, 10), dtype=float32)
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
You'd need to set from_logits
to True
for the loss function to properly treat the outputs.
When to Use SoftMax on the Output?
Most practitioners apply SoftMax on the output to give a normalized probability distribution, as this is in many cases what you'll use a network for - especially in simplified educational material. However, in some cases, you don't want to apply the function to the output, to process it in a different way before applying either SoftMax or another function.
A notable example comes from NLP models, in which a really the probability over a large vocabulary can be present in the output tensor. Applying SoftMax over all of them and greedily getting the argmax
typically doesn't produce very good results.
However, if you observe the logits, extract the Top-K (where K can be any number but is typically somewhere between [0...10]
), and only then applying SoftMax to the top-k possible tokens in the vocabulary shifts the distribution significantly, and usually produces more realistic results.
This is known as Top-K sampling, and while it isn't the ideal strategy, usually significantly outperforms greedy sampling.
Advice: To see this principle in action in an end-to-end project that teaches you the crux of building autoregressive language models with TensorFlow and Keras - read our "5-Line GPT-Style Text Generation in Python with TensorFlow/Keras"!
Conclusion
In this short guide, we've taken a look at the from_logits
argument for Keras loss classes, which oftentimes raise questions with newer practitioners.
The confusion possibly arises from the short-hand syntax that allows the addition of activation layers on top of other layers, within the definition of a layer itself. We've finally taken a look at when the argument should be set to True
or False
, and when an output should be left as logits or passed through an activation function such as SoftMax.