BitTokens — ICML 2026 Spotlight · Efficient numeracy via single-token number embeddings

Comparison of long reasoning chains and many numeric sub-tokens versus BitTokens single-token encoding

Figure 1. LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. Our BitTokens tokenization strategy allows language models to solve arithmetic tasks both effectively and efficiently.

Motivation

Many researchers share the common vision that large language models (LLMs) will not only alleviate routine work but also drive scientific and technological innovation. In many fields, such as physics and engineering, solving such complex tasks requires the processing of large amounts of numerical data and extensive calculations. Thus, to aid advancements in these fields, LLMs must possess efficient and effective numeracy, defined as the ability to represent and compute numbers.

Research to improve the numeracy of LLMs has predominantly focused on two strategies:

1. Arithmetic tool use

Tool-augmented LLMs leverage external calculators or code to bypass the need for internal arithmetic computation. However, it forces the model to dynamically and correctly identify, construct, and wait on all mathematical operations, introducing non-negligible latency and sources of error. Furthermore, outsourcing all calculations prevents the model from generating and calculating with intermediate numeric representations during a forward pass, restricting the efficiency of latent calculations.

2. Reasoning chains

Reasoning chains, on the other hand, prompt LLMs to generate logically consistent text, step-by-step, enabling them to solve complex problems by breaking them into smaller parts. However, reasoning chains can be very inefficient, sometimes requiring tens of thousands of tokens to solve a single calculation, as illustrated in Figure 1 and shown later. This limits the length and complexity of problems that LLMs can solve due to context window and cost constraints.

We argue that both tooling and reasoning chains are merely crutches, allowing LLMs to solve arithmetic calculations but preventing them from obtaining the intrinsic numeric computation skills required to efficiently solve complex tasks in advanced technical domains. We hypothesize that addressing this problem requires rethinking the way LLMs tokenize and encode numbers.

Takeaway 1: External tools and long reasoning are partial fixes: they do not give the model an intrinsic, efficient numeric representation for latent computation and long numeric-heavy contexts.

Contributions

In our work, we answer the following questions:

How do frontier LLMs perform on arithmetic tasks?

What attributes should a single-token number encoding have?

How do we build a better single-token number encoding?

Evaluation of numeracy in frontier LLMs

What we tested

Popular math benchmarks often mix language understanding with computation. We built a controlled benchmark that isolates numeric skill: comparisons and ordering, core arithmetic, and short multi-step statistics, under diverse magnitudes and precisions aligned with float64 reality.

Minimum / maximum of a list

What is the minimum of the list [1566.28, 1571.08, 2329.9, 1190.28, 878.242]?

Interval membership

What interval does x=57.667 belong to? A: x < 13.626, B: 13.626 <= x < 26.239, C: 26.239 <= x < 29.012, D: 29.012 <= x

List sorting (asc / desc)

Sort the list [6539.243, 2263.345, 14708.01, 15953.11] in ascending order.

Addition and subtraction

What is 9.6 − 77.96?

Multiplication

What is −56.228 × 37,485,561?

Division

What is 0.04561093096763438 ÷ −0.00563?

Exponentiation

What is 0.548309577⁻⁷?

Mean of a list

What is the mean of the list [−54733, −4216.9, 38145, 20819]?

Standard deviation of a list

What is the std of the list [−26975.4, 86144.6, −51820.1, −7862.07]?

We design our tasks to isolate individual aspects of numeracy as well as core mathematical operations. Numbers contain at most 15 significant decimal digits. Number magnitudes, differences between numbers, and precision are all densely and uniformly sampled.

Metric: log-sMAPE

We report exact match where appropriate, and for continuous predictions we use a scaled logarithmic view of symmetric mean absolute percentage error $\mathrm{sMAPE}(\hat y, y) = \dfrac{\lvert \hat y - y\rvert}{\lvert y\rvert + \lvert \hat y\rvert + \varepsilon}$, so that errors in later significant digits are visible:

$\mathrm{log\text{-}sMAPE}(\hat y, y) = \min\left(1,\; \dfrac{\log_{10}\left(\mathrm{sMAPE}(\hat y, y)+\varepsilon\right)}{-M} \right)$

with $\varepsilon=10^{-100}$ for stability. Here $M=15$ denotes the maximum number of significant digits tested.

Takeaway 2: Think of log-sMAPE as summarizing how many of the 15 significant digits are trustworthy before the first mistake.

What we found

Radar chart of frontier models on numeracy tasks

Figure 2. While simple tasks such as addition and ordering are almost perfectly solved by frontier LLMs, other tasks such as multiplication, division, calculating the standard deviation, or exponentiation remain difficult and require extensive reasoning tokens to solve.

Figure 3. Difficult numeracy tasks such as multiplication, division, exponentiation, and standard deviation can only be solved by frontier models using excessive reasoning tokens.

Immediately apparent is the perfect accuracy of almost all models on the basic number comparison tasks of MinMax, Sorting, and Interval, which are effectively solved. When considering the basic arithmetic tasks, there is a clear separation between high reasoning and non- or minimal-reasoning models. While both perform excellently on addition and mean calculation, all non-reasoning models perform very poorly on multiplication, division, exponentiation, and standard deviation tasks, with none achieving above 60% $\mathrm{log\textnormal{-}sMAPE}$. Only once we allow a large number of reasoning tokens do we see improved performance on these tasks. This separation between thinking and non-thinking models is underscored in Figure 3. LLMs use between 5 and 30 thousand tokens to solve a single calculation. While future advances in post-training could reduce the number of reasoning tokens, it is clear that current LLMs rely heavily on reasoning chains to compensate for their poor native arithmetic performance.

Takeaway 3: Current LLMs rely heavily on reasoning chains to compensate for their poor native arithmetic performance.

Desiderata for single-token number encodings

Our goal is to develop a number encoding strategy that generates representations which both maximize token efficiency and allow for neural networks to learn algorithms that perfectly execute numeric calculations.

D1 - Token efficiency

Every number is represented by a single token.

representation

D2 - Uniqueness

Each value has exactly one valid encoding, with a unique inverse mapping.

representation

D3 - Structured

The encoding geometry reflects numeric order and distance, facilitating generalizable algorithms.

representation

D4 - Scale invariance

The desired range of input magnitudes and precisions can be represented.

representation

D5 - Normalization

Encodings are bounded and information preserving under standard normalization functions used in language models (e.g., LayerNorm, RMSNorm).

training

D6 - Numerical stability

Representations remain accurate when using low-precision activations (e.g., FP8).

training

D7 - Continuity

Encodings vary smoothly with the underlying value, making them compatible with gradient-based optimization.

training

D8 - Robustness

Values can be decoded reliably under stochastic noise, allowing for stochastic training.

training

D9 - Arithmetic

Encodings admit learnable algorithms for core mathematical operations.

arithmetic

Limitations of existing single-token designs

xVal

Learns a numeric token scaled by value—conceptually simple and injective on reasonable ranges. To survive normalization, magnitudes are squeezed (e.g., toward a narrow band), which limits representable range and precision (D4, D6) and ultimately arithmetic flexibility (D9).

FoNE (sinusoidal)

Fourier-style embeddings behave well on several geometric and robustness criteria. Addition maps to a simple operation in frequency space; multiplication, however, does not admit a cheap local operator. Networks need to decode → multiply → re-encode (D9 bottleneck).

Technical deep dive

xVal (continuous scaled token)

xVal attaches a trainable [NUM] embedding and multiplies it by the scalar being encoded. When the model emits a number token, a dedicated head reads the continuous value from hidden states. That design is injective and smooth enough to satisfy D1, D3, D2, and D7. However, as shown by the authors, this strategy must map all inputs to the range [-5, 5] to satisfy D5 and avoid layer normalization interfering with the information encoded in the encoding magnitude. This causes it to be extremely limited in the range and precision of numbers it can encode, thus violating D4, D6, and by extension D9.

FoNE (sinusoidal features on a $\Phi$-dimensional torus)

FoNE builds a vector of sines and cosines at multiple base-10 frequencies (covering integer and fractional digit positions), then adds this to a learned [NUM] token.

Definition (sinusoidal encoding)

A sinusoidal encoding $\mathcal F:\mathbb R\mapsto \mathbb T^{|\Phi|}$ maps real numbers to a $|\Phi|$-dimensional torus, which forms a compact abelian Lie group. Given frequencies $b^\phi$ with base $b>1$ and $\phi\in\Phi\subset\mathbb Z$, then $\mathcal F$ is defined as:

$\mathcal F(x):=\left[\cos(2\pi b^\phi x),\, \sin(2\pi b^\phi x)\right]_{\phi \in \Phi} = \left[e^{i2\pi b^\phi x}\right]_{\phi\in\Phi} $

When a language model predicts a number token, the final hidden state can directly be interpreted as a sinusoidal encoding. The output number is predicted digit by digit through the maximum cosine similarity between each dimension of the final hidden state (corresponding to each digit of the output) with those of the encodings of the numbers in the range $[0, 9]$. This fulfills D1, D4, D3, D2, D5, D6, D7, D8. In contrast to simply scaling a [NUM] token, using both sine and cosine functions equally guarantees a constant RMS norm, satisfying D5. Such a digit-wise maximum similarity decoding also allows for robustness to noise, satisfying D8. The key limitation of sinusoidal encodings is that they are poorly suited for solving certain arithmetic operations with neural networks, most notably multiplication. In order to satisfy condition D9, we require a learnable mapping $o'$ that computes the encoding, $\xi$, of the result of an operation $o$ from the encodings of its operands:

$\forall_{o}\ \exists_{o'}:\ \xi\big( o(x_1, \ldots, x_n) \big) \;=\; o'\big( \xi(x_1), \ldots, \xi(x_n) \big)$

Why addition is easy in Fourier feature space

The encoding of the sum of two numbers $x_1$ and $x_2$ can be computed by a simple component-wise multiplication of their respective encodings.

Lemma (additive homomorphism)

The encoding map $\mathcal F$ is a group homomorphism from the additive group of real numbers $(\mathbb R, +)$ to the multiplicative torus $(\mathbb T^{|\Phi|}, \odot)$, where $\odot$ denotes the element-wise (Hadamard) product.

Proof. The claim follows directly from Euler's formula:

$\mathcal F(x_1+x_2)= \left[ e^{i2\pi b^\phi (x_1+x_2)} \right]_{\phi\in\Phi} = \left[ e^{i2\pi b^\phi x_1} e^{i2\pi b^\phi x_2} \right]_{\phi\in\Phi} = \mathcal F(x_1) \odot \mathcal F(x_2)$

This homomorphism elegantly transforms addition in the number domain into a simple, local operation in the encoded domain. Notably, this operation does not require carry-over logic. However, for multiplication such a simple mapping does not exist.

Why multiplication is hard in Fourier feature space

Proposition (complexity of multiplication)

Let $\mathcal X:=\{\varepsilon,\dots,U\}$ be the set of input numbers with resolution $\varepsilon=b^{m}$ and choose $\Phi = \{m,\dots,n\}$ so that $\mathcal F$ uniquely encodes the entire number range. Assume each encoding component has a finite precision $P$ (i.e., can represent $P$ distinct states). Suppose there exists an operator $\otimes_\phi:\mathbb T^{|\Phi|}\times\mathbb T^{|\Phi|}\mapsto\mathbb T$ such that $\otimes_\phi(\mathcal F(x),\mathcal F(y)) := \mathcal F_\phi(xy)$ for each output frequency $\phi\in\Phi$. Let $S_\phi^x,S_\phi^y\subseteq\Phi$ be the subsets of input frequencies that $\otimes_\phi$ is required to take as input from $\mathcal F(x)$ and $\mathcal F(y)$, respectively. Then:

Non-locality. The operator $\otimes_\phi$ must access at least $d = \mathcal O\left(\log_P(U/\varepsilon)\right)$ components from each input vector with $|S_\phi^x|,\;|S_\phi^y|\;\ge\;\left\lceil \log_P |\mathcal X| \right\rceil$.
Computational complexity. The operator $\otimes$ must perform a computation functionally equivalent to polynomial multiplication.

Proof.

1. Non-locality. The proof follows from a counting argument. Assume for contradiction that $|S_\phi^x|<\log_P|\mathcal X|$. Then the projection of $\mathcal{F}(x)|_{S_\phi^x}$, which is the only information about $x$ available to $\otimes_\phi$, can represent at most $P^{|S_\phi^x|}$ states. By the pigeonhole principle, there exist $x_1, x_2 \in \mathcal{X}$ with $x_1 \neq x_2$ that are indistinguishable to the operator, i.e., $\mathcal{F}(x_1)|_{S_\phi^x} = \mathcal{F}(x_2)|_{S_\phi^x}$. Let $\Delta:=x_1-x_2\neq 0$. The set $Y^*:=\{y\in \mathcal X:\; b^\phi \Delta y\in\mathbb Z\}$ is a proper subset of $\mathcal X$, so we can pick $y^\star\in\mathcal X \setminus Y^*$. Then $\mathcal F_\phi(x_1 y^\star)\neq \mathcal F_\phi(x_2 y^\star)$, yet $\otimes_\phi$ sees identical inputs from $\mathcal F(x_1)$ and $\mathcal F(x_2)$, which is a contradiction. The operator $\otimes_\phi$ is required by its definition to produce these two different outputs. However, since its inputs for $x_1$ and $x_2$ are identical ($\mathcal{F}(x_1)|_{S_\phi^x} = \mathcal{F}(x_2)|_{S_\phi^x}$), it is forced as a function to produce the same output for both. It cannot satisfy both conditions. Thus, the initial assumption must be false and $|S_\phi^x|\ge\lceil\log_P|\mathcal X|\rceil$. The same holds for $|S_\phi^y|$.

2. Computational complexity. Multiplication in any positional system is equivalent to the discrete convolution of the coefficients $k$ and $l$, followed by carry propagation.
Write $x=\sum_{i} k_i b^i$ and $y=\sum_{j} l_j b^j$ as the sum of their coefficients. Then

$xy=\sum_{\tau}(k*l)_\tau b^\tau, \quad (k*l)_\tau=\sum_{i+j=\tau}k_i l_j.$

Interpreting the circular sinusoidal component in its interval $[0,1)$ wrap-around form, we can view the phase vector of the encoding, $\Theta_x \in [0, 1)^{|\Phi|}$, as a linear transformation of the coefficient vector $k$ modulo 1. From the definition of the encoding, the component at frequency $\phi$ accumulates contributions from lower-order coefficients:

$[\Theta_x]_\phi = \big( \sum_{i < \phi} k_i b^{i-\phi} \big) \bmod 1.$

This forms a linear system $\Theta_x \equiv Mk \bmod 1$, where $M$ is a lower-triangular mixing matrix with entries $M_{\phi,i} = b^{i-\phi}$ for $i < \phi$ and $0$ otherwise. Since the encoding uniquely represents the number range, $M$ is invertible over the domain.
To compute the encoding $\mathcal{F}(xy)$ of the product, the operator must produce the phase vector $\Theta_{xy} \equiv M(k * l) \bmod 1$. We analyze two pathways for this from inputs $\Theta_x$ and $\Theta_y$:

Pathway A: Disentangle First. One effectively inverts the mixing matrix to recover coefficients: $k = M^{-1}\Theta_x$ and $l = M^{-1}\Theta_y$. The product is then computed as $\Theta_{xy} = M(M^{-1}\Theta_x * M^{-1}\Theta_y)$. The operation $M^{-1}$ represents the full cost of decoding the sinusoidal representation into positional digits.

Pathway B: Disentangle Later. Alternatively, one might attempt to compute the result directly from the entangled phases. Any bilinear operation on the inputs can be expressed via the Kronecker product $\Theta_x \otimes_K \Theta_y$. Substituting the linear forms yields:

$\Theta_x \otimes_K \Theta_y = (Mk) \otimes_K (Ml) = (M \otimes_K M)(k \otimes_K l).$

The vector $k \otimes_K l$ contains all cross-terms $k_i l_j$. To construct the convolution $h$, one must sum all subsets of these cross-terms where $i+j=\tau$. However, the term $\Theta_x \otimes_K \Theta_y$ does not provide direct access to $k_i l_j$, only linear combinations weighted by the expanded mixing matrix $M \otimes_K M$.
Isolating the necessary convolution terms from the bilinear expansion requires inverting this mixing process. This removal $(M \otimes_K M)^{-1} = M^{-1} \otimes_K M^{-1}$ of the cross-term redundancies is therefore computationally at least as expensive as the initial decoding $M^{-1}$. Moreover, under finite precision, the condition number satisfies $\kappa(M \otimes_K M)=\kappa(M)^2$ with $\kappa(M)\ge1$, which means that postponing disentanglement amplifies quantization errors.
Since both pathways require inverting the linear mixing $M$, there exists no shortcut in the sinusoidal domain. Any correct operator $\otimes$ must inherently learn the multi-stage procedure: (1) a non-local decoding of the sinusoidal inputs into internal coefficient sequences for $x$ and $y$, (2) a convolution of these sequences, followed by (3) a carry propagation step, and finally (4) a re-encoding of the result into the sinusoidal format.

The preceding proposition demonstrates that performing multiplication in sinusoidal encoding space requires a transformation that is both computationally intensive and prone to precision errors. Any network implementing such an operation is forced to first decode, then calculate, and finally re-encode the encoding. This leads us to conclude that sinusoidal encodings alone are not well-suited as a general-purpose number representation.

Why not encode $\ln x$ to turn multiplication into addition?

In log space, $\ln(x_1 x_2) = \ln x_1 + \ln x_2$, so multiplication becomes another additive homomorphism—at the price that addition is no longer local (classic log-number-system trade-off). Recovering $\mathcal{F}_{\log}(x_1 + x_2)$ from $\mathcal{F}_{\log}(x_1)$ and $\mathcal{F}_{\log}(x_2)$ needs something like exp–sum–log, reintroducing the same mixing difficulties (see paper discussion).

Takeaway 4: No prior single-token recipe simultaneously satisfies engineering constraints and supports learning general and robust arithmetic algorithms.

The solution: BitTokens

Guided by the desiderata introduced in the previous section, we propose BitTokens, a novel numeric encoding algorithm. BitTokens uses a dedicated, learnable [NUM] token to which a numeric encoding is added that is based on the IEEE 754 double-precision binary floating-point format (i.e., float64). The floating-point format writes a signed real value $v$ as

$v = (-1)^{s} \big( 1+\sum_{i=1}^{52}{b_{52-i}2^{-i}} \big) \times 2^{E-1023},$

with $s \in \{0,1\}$ denoting the sign bit, $E \in \{0,\dots,2047\}$ the 11-bit exponent field offset by $1023$, and $b_j \in \{0,1\}$ for $j=0,\dots,51$ the 52 significand bits.

Each IEEE dimension becomes one coordinate in the additive encoding (64 bits for float64), covering full range and specials ($\pm0$, $\pm\infty$, NaN). The checklist below matches the paper’s desiderata analysis: what each single-token design achieves in principle, where it breaks, and where BitTokens still involves a deliberate trade-off.

The IEEE 754 double-precision binary floating-point format (float64).

Desiderata at a glance

Hover, focus, or tap a badge to show the rationale in a floating card (anchored to the viewport so it is not clipped by the table). Same taxonomy as § Desiderata; columns follow the paper’s comparison of xVal, FoNE, and BitTokens.

Desideratum	xVal	FoNE	BitTokens
D1Token efficiency
D2Uniqueness
D3Structured
D4Scale invariance
D5Normalization
D6Numerical stability
D7Continuity
D8Robustness
D9Arithmetic

satisfies trade-off / partial breaks

Takeaway 5: BitTokens deliberately mirrors how computers store numbers to efficiently encode and decode numbers. It fulfills all desiderata for single-token number encodings.

How this looks in code

Raw text

→

Regex spans → [NUM]

→

Bit encode + pad → add to embeddings

→

Transformer layers

→

Linear num head → BCE / decode

Full training code, tokenizers, datasets, and evaluation live in the official repository: github.com/KreitnerL/BitTokens

(a) Parsing numeric spans with a regex

Parsing numbers

NUMERIC_SPAN_REGEX = re.compile(r"[-]?(?:(?:0(?!\.[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))")
raw_numbers = [m.group(0).strip() for m in NUMERIC_SPAN_REGEX.finditer(sample_text)]
nums = torch.tensor([float(n) for n in raw_numbers], dtype=torch.float64)

(b) Forward pass — IEEE bits from float tensors + combine at token positions

Encoding numbers

float64_bit_shifts = torch.arange(63, -1, -1, dtype=torch.int64)

def float64_tensor_to_binary_tensor(tensor_in: torch.DoubleTensor) -> torch.LongTensor:
    int_representation = tensor_in.view(torch.int64).unsqueeze(-1)
    bits = (int_representation >> float64_bit_shifts) & 1
    return bits

# IEEE 754 bit pattern
bits = float64_tensor_to_binary_tensor(nums)

# Optional reciprocal encoding
reciprocal_bits = float64_tensor_to_binary_tensor(nums.reciprocal())
full_encoding = torch.cat([bits, reciprocal_bits], dim=-1)

# Combine with token embeddings (sum)
combined[number_mask] = inputs_embeds[number_mask] + full_encoding

(c) Number head, prediction, and BCE loss

Computing number loss


hidden_states = combined  # Dummy assignment
output_size = bits.shape[-1]
target_bits = bits.to(torch.float32)
freq_weights = torch.ones(output_size).unsqueeze(0)

num_head_linear = torch.nn.Linear(hidden_dim, num_bit_dims)
pred_bits_logits = num_head_linear(hidden_states[number_mask])

bce_per_bit = torch.nn.functional.binary_cross_entropy_with_logits(
    pred_bits_logits, target_bits, reduction='none'
)
weighted_bce = (bce_per_bit * freq_weights).mean(dim=-1)

(d) Decoding bits back to float64

Decoding numbers

def binary_tensor_to_float64_tensor(bits_int64: torch.Tensor) -> torch.Tensor:
    weights = torch.tensor(1, dtype=torch.int64) << float64_bit_shifts
    reconstructed_int = torch.sum(bits_int64 * weights, dim=-1)
    return reconstructed_int.view(torch.int64).view(torch.float64)

noisy_pred_logits = bits * 2 - 1 + torch.rand_like(bits.float()).sub_(0.5)
x_base_digits_pred: torch.LongTensor = (noisy_pred_logits > 0).to(torch.int64)
num_preds = binary_tensor_to_float64_tensor(x_base_digits_pred)

Takeaway 6: BitTokens can easily be integrated into existing LLMs with minimal changes. The binary vector of a number can be efficiently constructed via type reinterpretation and bit shifts.

Results

We compare BitTokens to traditional single digit and triple digit (subword) tokenizers, as well as xVal and FoNE.

Bar chart comparing BitTokens to baselines on multitask numeracy

Takeaway 7: BitTokens outperform all other methods and achieve near-perfect performance on comparison and single-step calculation tasks. Low perplexity shows that language understanding is not impacted.

Conclusion

In brief, we answer the questions from above as follows:

How do frontier LLMs perform on arithmetic tasks?

Frontier models solve basic comparisons reliably, but strongly rely on reasoning chains with a large number of tokens to solve calculations.

What attributes should a single-token number encoding have?

Number encodings should jointly offer a large range of input magnitudes and precisions, be compatible with (low precision) gradient-based training, and allow for the learning of core arithmetic operations.

How do we build a better single-token number encoding?

BitTokens fuses a dedicated number token with an IEEE 754 float64 bit embedding and allows even small language models to learn algorithms to solve basic arithmetic operations.

Citation

@inproceedings{
    kreitner2026bittokens,
    title={Efficient numeracy in language models through single-token number embeddings},
    author={Linus Kreitner and Paul Hager and Jonathan Mengedoht and Georgios Kaissis and Daniel Rueckert and Martin J. Menten},
    booktitle={Forty-third International Conference on Machine Learning},
    year={2026},
    url={https://openreview.net/forum?id=Bh4Ubk80M8}
}

Acknowledgements

This work was partially funded by ERC Grant Deep4MI (Grant No. 884622).

The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universitaet Erlangen-Nuernberg (FAU) under the NHR project b247bb. NHR funding is provided by federal and Bavarian state authorities.

Martin J. Menten is funded by the German Research Foundation under project 532139938.