Guide to String Interning in Python

Introduction

One of the first things you encounter while learning the basics of programming is the concept of strings. Similar to various programming languages, Python strings are arrays of bytes representing Unicode characters - an array or sequence of characters. Python, unlike many programming languages, doesn’t have a distinct character datatype, and characters are considered strings of length 1.

You can define a string using single or double quotation marks, for example, a = "Hello World" or a = 'Hello World'. To access a specific element of a string, you would use square brackets ([]) with the index of the character you wish to access (indexing starts at 0). Calling a[0], for example, would return H.

That being said, let's take a look at this code example:

a = 'Hello World'
b = 'Hello World'
c = 'Hello Worl'

print(a is b)
print(a == b)
print(a is c+'d')
print(a == c+'d')

All of the strings we compare hold the value of Hello World (a, b, and c +'d'). It might be intuitive to assume that the output would be True for all these statements.

However, when we run the code, it results in:

True
True
False
True

What's might look unintuitive about this output is that a is c + 'd' returns False, while a very similar statement a is b returns True. With this, we can conclude that a and b are the same object, while c is a different one, even though they hold the same value.

If you are unfamiliar with the Difference Between == and is - is checks whether the variables are referring to the same object in memory, while == checks whether the variables have the same value.

This distinction between a, b and c is the product of String Interning.

Note: The environment in which you run the code affects how string interning works. The previous examples were results of running the code as a script in a non-interactive environment, using the current latest version of Python (version 3.8.5). The behavior will be different when using the console/Jupyter because of different ways in which the code is optimized, or even between different versions of Python.

This is because different environments have different optimization levels.

String Interning

Strings are immutable objects in Python. This means that once strings are created, we can not change or update them. Even if it seems like a string has been modified, under the hood, a copy with the modified value was created and assigned to the variable, while the original string remained the same.

Let's attempt to modify a string:

name = 'Wtack Abuse!'
name[0] = 'S'

As the string name is immutable, this code will fail on the last line:

name[0] = 'S'
TypeError: 'str' object does not support item assignment

Note: If you really did want to change a particular character of a string, you could convert the string into a mutable object like a list, and change the desired element:

name = 'Wtack Abuse!'
name = list(name)
name[0] = 'S'
# Converting back to string
name = "".join(name) 

print(name)

Which gives us the desired output:

Stack Abuse!

The reason why we could change the character in the list (and not in the string) is because lists are mutable - meaning that we can change their elements.

String Interning is a process of storing only one copy of each distinct string value in memory.

This means that, when we create two strings with the same value - instead of allocating memory for both of them, only one string is actually committed to memory. The other one just points to that same memory location.

Given this information, let's go back to the initial Hello World example:

a = 'Hello World'
b = 'Hello World'
c = 'Hello Worl'

When string a is created, the compiler checks if Hello World is present in interned memory. Since it is the first occurrence of this string value, Python creates an object and caches this string in memory and points a to this reference.

When b is created, Hello World is found by the compiler in the interned memory so instead of creating another string, b simply points to the previously allocated memory.

python string values in memory

a is b and a == b in this case.

Finally, when we create the string c = 'Hello Worl', the compiler instantiates another object in interned memory because it could not find the same object for reference.

When we compare a and c+'d', the latter is evaluated to Hello World. However, since Python doesn't do interning during runtime, a new object is created instead. Thus, since no interning was done, these two aren't the same object and is returns False.

In contrast to the is operator, the == operator compares the values of the strings after computing runtime expressions - Hello World == Hello World.

At that time, a and c+'d' are the same, value-wise, so this returns True.

Verification

Let's see the id of string objects we created. The id(object) function in Python returns the ID of object, which is guaranteed to be unique during the lifetime of the said object. If two variables point to the same object, calling id would return the same number:

letter_d = 'd'

a = 'Hello World'
b = 'Hello World'
c = 'Hello Worl' + letter_d
d = 'Hello Worl' + 'd'

print(f"The ID of a: {id(a)}")
print(f"The ID of b: {id(b)}")
print(f"The ID of c: {id(c)}")
print(f"The ID of d: {id(d)}")

This results in:

The ID of a: 16785960
The ID of b: 16785960
The ID of c: 17152424
The ID of d: 16785960

Only c has a different id. All references now point to the object with the same Hello World value. However, c was not computed at compile-time, but rather at runtime. Even d, which we generated by adding the 'd' character now points to the same object that a and b point to.

How Strings Are Interned

In Python, there are two ways strings can be interned based on programmer-interaction:

  • Implicit interning
  • Explicit interning

Implicit Interning

Python automatically interns some strings at the moment of their creation. Whether or not a string is interned depends on several factors:

  • All empty strings and strings of length 1 are interned.

  • Up until version 3.7, Python used peephole optimization, and all strings longer than 20 characters were not interned. However, now it uses the AST optimizer, and (most) strings up to 4096 characters are interned.

  • Names of functions, class, variables, arguments, etc. are implicitly interned.

  • The keys of dictionaries used to hold module, class, or instance attributes are interned.

  • Strings are interned only at compile-time, this means that they will not be interned if their value can't be computed at compile-time.

    • These strings will be interned for example:
    a = 'why'
    b = 'why' * 5
    
    • The following expression is computed at runtime thus the string is not interned.
    b = "".join(['w','h','y'])
    
  • Strings having characters apart from ASCII are most likely not going to be interned.

If you recall, we said that 'Hello Worl' + letter_d was computed at runtime, and therefore it will not be interned. Since there is no consistent standard on string interning, a good rule of thumb to use is the compile-time/runtime idea, where you can assume that a string will be interned if it can be computed at compile-time.

Explicit Interning

We often come across strings that do not lie under the conditions of implicit interning in Python, but there is a way to intern any string you want. There is a function in the sys module called intern(immutable_object), this function tells Python to store the immutable_object (string in our case) in the interned memory table.

You can intern any kind of string as follows:

import sys
c = sys.intern('Hello World'+'!')

We can see that this would work in our previous example:

import sys

letter_d = 'd'

a = sys.intern('Hello World')
b = sys.intern('Hello Worl' + letter_d)

print(f"The ID of a: {id(a)}")
print(f"The ID of b: {id(b)}")
print(f"a is b? {a is b}")

Would give the output:

The ID of a: 26878464
The ID of b: 26878464
a is b? True

Now that we know how and which strings are interned in Python. One question remains - why was string interning introduced?

Advantages of String Interning

String interning has several advantages:

  • Saving Memory: We never have to save two string objects in memory separately if they are the same. Every new variable with the same content just points to the reference in the interned table literal. If for some reason, you wanted to have a list containing every single word and its appearance in Jane Austen's Pride and Prejudice, without explicit interning, you'd need 4.006.559 bytes, and with explicit interning of each word, you would need only 785.509 bytes of memory.
  • Fast Comparisons: The comparison of interned strings is much faster than non-interned strings which are useful when your program has lots of comparisons. This happens because in order to compare interned strings, you only need to compare whether their memory addresses are the same, instead of comparing the contents.
  • Fast Dictionary Lookups: If the lookup keys are interned, the comparison can be done by pointer comparisons instead of string compare, which works on the same principle as the previous point.

Disadvantages of String Interning

However interning strings does have some drawbacks and things to consider before using:

  • Memory Cost: In case your program has a large number of strings with different values and relatively fewer comparisons overall because the interned table itself consumes memory. Meaning that you want to intern strings if you have relatively few strings and a lot of comparisons between them.
  • Time Cost: The call to intern() function is expensive as it has to manage the interned table.
  • Multi-threaded Environments: The interned memory (table) is a global resource in a multi-threaded environment whose synchronization needs to be modified. This check might only be needed when the interned table is accessed i.e. when a new string is created but it can be expensive.

Conclusion

By using string interning, you ensure that only one object is created even if you define multiple strings with the same content. You should however keep in mind the balance between the advantages and disadvantages of string interning, and only use it when you think your program could benefit.

Always remember to add comments or documentation if you are using string interning so other team members know how to deal with strings in the program.

While the results may vary as per the implementation of your Python interpreter, as well as the environment you run your code in, you should definitely play around with intern() function to make yourself comfortable with it. This concept can help you to the improve design and performance of your code. It might help you in your next job interview as well.

Author image
Pakistan Twitter Website
I am a software engineer currently working on web/mobile app development. I try to make everyday count by learning or by teaching. Either way, I am learning. https://www.linkedin.com/in/abouthashir/