Python internals: Arbitrary-precision integer implementation | Artem Golubin

Excerpt

Have you ever noticed that Python supports integers of any size? Here is a quick glance at it.


Have you ever noticed that Python supports integers of any size? Here is a quick glance at it.

Python represents all objects by C structures. The following data structure is responsible for all integer objects:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
} PyLongObject; 

After macros expansions, a simplified version of the struct looks as follows:

struct {
    ssize_t ob_refcnt;
    struct _typeobject *ob_type;
    ssize_t ob_size; 
    uint32_t ob_digit[1];
};

The ob_refcnt field is responsible for reference counting technique which is used in garbage collection mechanism, whereas ob_type is a pointer to a structure which describes an integer type.

Generally, In languages like C/C++, the precision of integers is limited to 64-bit, but Python has built-in support for Arbitrary-precision integers. Since Python 3 there is no longer simple integer type, and all integers are represented as a bignum.

How to store arbitrarily large integer?

One of the solutions is to represent the integer as an array of digits. To do it in an efficient way we need to convert our number from base 10 to base 230 numeral system, where each element represents a single ‘digit’ in a range from 0 to 230−1. Depending on the platform, Python uses either 32-bit unsigned integer arrays with 30-bit digits or 16-bit unsigned integer arrays with 15-bit digits. Using such approach introduces additional requirements, that’s why we can’t use all bits of the integer. The ob_digit field in a structure above is responsible for such arrays.

To eliminate unnecessary computation CPython has a “fast path” implementation for integers in a range of −230 to 230. Such integers are stored as an array of one element and if it’s possible treated as fixed 32-bit integers.

Note that, unlike classical approach, the sign of an integer is stored separately in ob_size field. This field stores the size of the ob_digit array. So if you want to change the sign of an array of size 2 you need to set ob_size to -2.

The comment from the source code describes it as follows:

/* Long integer representation.
   The absolute value of a number is equal to
    SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
   Negative numbers are represented with ob_size < 0;
   zero is represented by ob_size == 0.
   In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
   digit) is never zero.  Also, in all cases, for all valid i,
    0 <= ob_digit[i] <= MASK.
   The allocation function takes care of allocating extra memory
   so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
 
   CAUTION:  Generic code manipulating subtypes of PyVarObject has to aware that integers abuse  ob_size's sign bit.
*/

The representation of 123456789101112131415 will look as follows:

ob_size3
ob_digit43797691987719511107

Algorithm of converting our representation back:

(437976919∗230∗0)+(87719511∗230∗1)+(107∗230∗2)

Optimization of commonly-used integers

Small integer objects in a range of -5 to 256 are always pre-allocated during initialization. Because Python integers are immutable, we can use them as singletons. Every time you need to create small integer instead of creating new object Python just points to already allocated one. Thereby it saves a lot of space and computation for commonly-used integers.

Interestingly enough, the PyLongObject structure takes at least 28 bytes for every allocated integer and therefore takes three times as much memory as a simple 64-bit C integer.

Number system conversion

Let’s see how Python converts integers to an array.

We are going to look at the simple example where we need to convert a long unsigned 64-bit integer to Python’s integer representation. Note that it’s the biggest possible standard C type that can be converted to our PyLongObject directly. Larger numbers can be converted from strings or byte arrays.

If you are not familiar with number system conversion, you can read this tutorial.

Here is a simplified algorithm translated from C to Python:

SHIFT = 30  # number of bits for each 'digit'
MASK = (2 ** SHIFT)
bignum = 18446744073709551615
 
def split_number(bignum):
    t = abs(bignum)
 
    num_list = []
    while t != 0:
        # Get remainder from division
        small_int = t % MASK  # more efficient bitwise analogue: (t & (MASK-1))
        num_list.append(small_int)
 
        # Get integral part of the division (floor division)
        t = t // MASK  # more efficient bitwise analogue: t >>= SHIFT
 
    return num_list
 
def restore_number(num_list):
    bignum = 0
    for i, n in enumerate(num_list):
        bignum += n * (2 ** (SHIFT * i))
    return bignum
 
num_list = split_number(bignum)
assert bignum == restore_number(num_list)

To verify our solution we can also take a look at the internal representation:

import ctypes
 
class PyLongObject(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type", ctypes.c_void_p),
                ("ob_size", ctypes.c_ulong),
                ("ob_digit", ctypes.c_uint * 3)]
 
 
bignum = 18446744073709551615
 
for d in PyLongObject.from_address(id(bignum)).ob_digit:
    print(d)

Arithmetic

Basic arithmetic operations are implemented using simple school math with only one difference — every element of an array is treated like a single “digit”.

Every operation results in a minimum one object creation. For example, when you want to execute c += 10 roughly these steps are performed:

  • Get an address of an already allocated object for 10 (because it’s a small number we don’t have to create a new object, see above).
  • Create a new integer object so we can store the result of addition in it.
  • Add c and 10 together. Store the result in the newly created object.
  • Replace variable c with reference to our new object.
  • Decrease the reference count of an old c so garbage collector can destroy this object in the future.

For example, let’s take a look at addition, where carrying is used:

def add_bignum(a, b):
    z = []
 
    if len(a) < len(b):
        # Ensure a is the larger of the two
        a, b = b, a
 
    carry = 0
 
    for i in range(0, len(b)):
        carry += a[i] + b[i]
        z.append(carry % MASK)
        carry = carry // MASK
 
    for i in range(i + 1, len(a)):
        carry += a[i]
        z.append(carry % MASK)
        carry = carry // MASK
 
    z.append(carry)
 
    # remove trailing zeros
    i = len(z)
    while i > 0 and z[i-1] == 0:
        i -= 1
    z = z[0:i]
 
    return z
 
 
a = 8223372036854775807
b = 100037203685477
assert restore_number(add_bignum(split_number(a), split_number(b))) == a + b

Diving deeper

There are a lot of small details that can’t be described in one article. You can read more about integers in CPython (1, 2, 3) source code.

I have plans to continue this series, in the next posts I will describe internals of class and function objects.


If you have any questions, feel free to ask them via e-mail displayed in the footer.