TL;DR Frontier LLMs often spend thousands of tokens reasoning through basic arithmetic. Ordinary tokenizers split numbers into many pieces. We propose BitTokens, a novel tokenization strategy for LLMs that embeds numbers using their IEEE 754 binary floating-point representation, which allows for efficient numeracy in language models.
Figure 1. LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. Our BitTokens tokenization strategy allows language models to solve arithmetic tasks both effectively and efficiently.
Many researchers share the common vision that large language models (LLMs) will not only alleviate routine work but also drive scientific and technological innovation. In many fields, such as physics and engineering, solving such complex tasks requires the processing of large amounts of numerical data and extensive calculations. Thus, to aid advancements in these fields, LLMs must possess efficient and effective numeracy, defined as the ability to represent and compute numbers.
Research to improve the numeracy of LLMs has predominantly focused on two strategies:
1. Arithmetic tool use
2. Reasoning chains
We argue that both tooling and reasoning chains are merely crutches, allowing LLMs to solve arithmetic calculations but preventing them from obtaining the intrinsic numeric computation skills required to efficiently solve complex tasks in advanced technical domains. We hypothesize that addressing this problem requires rethinking the way LLMs tokenize and encode numbers.
Takeaway 1: External tools and long reasoning are partial fixes: they do not give the model an intrinsic, efficient numeric representation for latent computation and long numeric-heavy contexts.
In our work, we answer the following questions:
How do frontier LLMs perform on arithmetic tasks?
What attributes should a single-token number encoding have?
How do we build a better single-token number encoding?
Popular math benchmarks often mix language understanding with computation. We built a controlled benchmark that isolates numeric skill: comparisons and ordering, core arithmetic, and short multi-step statistics, under diverse magnitudes and precisions aligned with float64 reality.
What is the minimum of the list [1566.28, 1571.08, 2329.9, 1190.28, 878.242]?
What interval does x=57.667 belong to? A: x < 13.626, B: 13.626 <= x < 26.239, C: 26.239 <= x < 29.012, D: 29.012 <= x
Sort the list [6539.243, 2263.345, 14708.01, 15953.11] in ascending order.
What is 9.6 − 77.96?
What is −56.228 × 37,485,561?
What is 0.04561093096763438 ÷ −0.00563?
What is 0.548309577−7?
What is the mean of the list [−54733, −4216.9, 38145, 20819]?
What is the std of the list [−26975.4, 86144.6, −51820.1, −7862.07]?
We design our tasks to isolate individual aspects of numeracy as well as core mathematical operations. Numbers contain at most 15 significant decimal digits. Number magnitudes, differences between numbers, and precision are all densely and uniformly sampled.
We report exact match where appropriate, and for continuous predictions we use a scaled logarithmic view of symmetric mean absolute percentage error $\mathrm{sMAPE}(\hat y, y) = \dfrac{\lvert \hat y - y\rvert}{\lvert y\rvert + \lvert \hat y\rvert + \varepsilon}$, so that errors in later significant digits are visible:
with $\varepsilon=10^{-100}$ for stability. Here $M=15$ denotes the maximum number of significant digits tested.
Takeaway 2: Think of log-sMAPE as summarizing how many of the 15 significant digits are trustworthy before the first mistake.
Figure 2. While simple tasks such as addition and ordering are almost perfectly solved by frontier LLMs, other tasks such as multiplication, division, calculating the standard deviation, or exponentiation remain difficult and require extensive reasoning tokens to solve.
Figure 3. Difficult numeracy tasks such as multiplication, division, exponentiation, and standard deviation can only be solved by frontier models using excessive reasoning tokens.
Immediately apparent is the perfect accuracy of almost all models on the basic number comparison tasks of MinMax, Sorting, and Interval, which are effectively solved. When considering the basic arithmetic tasks, there is a clear separation between high reasoning and non- or minimal-reasoning models. While both perform excellently on addition and mean calculation, all non-reasoning models perform very poorly on multiplication, division, exponentiation, and standard deviation tasks, with none achieving above 60% $\mathrm{log\textnormal{-}sMAPE}$. Only once we allow a large number of reasoning tokens do we see improved performance on these tasks. This separation between thinking and non-thinking models is underscored in Figure 3. LLMs use between 5 and 30 thousand tokens to solve a single calculation. While future advances in post-training could reduce the number of reasoning tokens, it is clear that current LLMs rely heavily on reasoning chains to compensate for their poor native arithmetic performance.
Takeaway 3: Current LLMs rely heavily on reasoning chains to compensate for their poor native arithmetic performance.
Our goal is to develop a number encoding strategy that generates representations which both maximize token efficiency and allow for neural networks to learn algorithms that perfectly execute numeric calculations.
Every number is represented by a single token.
representationEach value has exactly one valid encoding, with a unique inverse mapping.
representationThe encoding geometry reflects numeric order and distance, facilitating generalizable algorithms.
representationThe desired range of input magnitudes and precisions can be represented.
representationEncodings are bounded and information preserving under standard normalization functions used in language models (e.g., LayerNorm, RMSNorm).
trainingRepresentations remain accurate when using low-precision activations (e.g., FP8).
trainingEncodings vary smoothly with the underlying value, making them compatible with gradient-based optimization.
trainingValues can be decoded reliably under stochastic noise, allowing for stochastic training.
trainingEncodings admit learnable algorithms for core mathematical operations.
arithmeticLearns a numeric token scaled by value—conceptually simple and injective on reasonable ranges. To survive normalization, magnitudes are squeezed (e.g., toward a narrow band), which limits representable range and precision (D4, D6) and ultimately arithmetic flexibility (D9).
Fourier-style embeddings behave well on several geometric and robustness criteria. Addition maps to a simple operation in frequency space; multiplication, however, does not admit a cheap local operator. Networks need to decode → multiply → re-encode (D9 bottleneck).
xVal attaches a trainable [NUM] embedding and multiplies it by the scalar being encoded.
When the model emits a number token, a dedicated head reads the continuous value from hidden states.
That design is injective and smooth enough to satisfy D1, D3, D2, and D7. However, as shown by the authors, this strategy must map all inputs to the range [-5, 5] to satisfy D5 and avoid layer normalization interfering with the information encoded in the encoding magnitude.
This causes it to be extremely limited in the range and precision of numbers it can encode, thus violating D4, D6, and by extension D9.
FoNE builds a vector of sines and cosines at multiple base-10 frequencies (covering integer and fractional digit positions), then adds this to a learned [NUM] token.
A sinusoidal encoding $\mathcal F:\mathbb R\mapsto \mathbb T^{|\Phi|}$ maps real numbers to a $|\Phi|$-dimensional torus, which forms a compact abelian Lie group. Given frequencies $b^\phi$ with base $b>1$ and $\phi\in\Phi\subset\mathbb Z$, then $\mathcal F$ is defined as:
[NUM] token, using both sine and cosine functions equally guarantees a constant RMS norm, satisfying D5.
Such a digit-wise maximum similarity decoding also allows for robustness to noise, satisfying D8.
The key limitation of sinusoidal encodings is that they are poorly suited for solving certain arithmetic operations with neural networks, most notably multiplication.
In order to satisfy condition D9, we require a learnable mapping $o'$ that computes the encoding, $\xi$, of the result of an operation $o$ from the encodings of its operands:
The encoding of the sum of two numbers $x_1$ and $x_2$ can be computed by a simple component-wise multiplication of their respective encodings.
The encoding map $\mathcal F$ is a group homomorphism from the additive group of real numbers $(\mathbb R, +)$ to the multiplicative torus $(\mathbb T^{|\Phi|}, \odot)$, where $\odot$ denotes the element-wise (Hadamard) product.
Proof.
The claim follows directly from Euler's formula:
This homomorphism elegantly transforms addition in the number domain into a simple, local operation in the encoded domain. Notably, this operation does not require carry-over logic. However, for multiplication such a simple mapping does not exist.
Let $\mathcal X:=\{\varepsilon,\dots,U\}$ be the set of input numbers with resolution $\varepsilon=b^{m}$ and choose $\Phi = \{m,\dots,n\}$ so that $\mathcal F$ uniquely encodes the entire number range. Assume each encoding component has a finite precision $P$ (i.e., can represent $P$ distinct states). Suppose there exists an operator $\otimes_\phi:\mathbb T^{|\Phi|}\times\mathbb T^{|\Phi|}\mapsto\mathbb T$ such that $\otimes_\phi(\mathcal F(x),\mathcal F(y)) := \mathcal F_\phi(xy)$ for each output frequency $\phi\in\Phi$. Let $S_\phi^x,S_\phi^y\subseteq\Phi$ be the subsets of input frequencies that $\otimes_\phi$ is required to take as input from $\mathcal F(x)$ and $\mathcal F(y)$, respectively. Then:
1. Non-locality. The proof follows from a counting argument. Assume for contradiction that $|S_\phi^x|<\log_P|\mathcal X|$. Then the projection of $\mathcal{F}(x)|_{S_\phi^x}$, which is the only information about $x$ available to $\otimes_\phi$, can represent at most $P^{|S_\phi^x|}$ states. By the pigeonhole principle, there exist $x_1, x_2 \in \mathcal{X}$ with $x_1 \neq x_2$ that are indistinguishable to the operator, i.e., $\mathcal{F}(x_1)|_{S_\phi^x} = \mathcal{F}(x_2)|_{S_\phi^x}$. Let $\Delta:=x_1-x_2\neq 0$. The set $Y^*:=\{y\in \mathcal X:\; b^\phi \Delta y\in\mathbb Z\}$ is a proper subset of $\mathcal X$, so we can pick $y^\star\in\mathcal X \setminus Y^*$. Then $\mathcal F_\phi(x_1 y^\star)\neq \mathcal F_\phi(x_2 y^\star)$, yet $\otimes_\phi$ sees identical inputs from $\mathcal F(x_1)$ and $\mathcal F(x_2)$, which is a contradiction. The operator $\otimes_\phi$ is required by its definition to produce these two different outputs. However, since its inputs for $x_1$ and $x_2$ are identical ($\mathcal{F}(x_1)|_{S_\phi^x} = \mathcal{F}(x_2)|_{S_\phi^x}$), it is forced as a function to produce the same output for both. It cannot satisfy both conditions. Thus, the initial assumption must be false and $|S_\phi^x|\ge\lceil\log_P|\mathcal X|\rceil$. The same holds for $|S_\phi^y|$.
2. Computational complexity.
Multiplication in any positional system is equivalent to the discrete convolution of the coefficients $k$ and $l$, followed by carry propagation.
Write $x=\sum_{i} k_i b^i$ and $y=\sum_{j} l_j b^j$ as the sum of their coefficients. Then
The preceding proposition demonstrates that performing multiplication in sinusoidal encoding space requires a transformation that is both computationally intensive and prone to precision errors. Any network implementing such an operation is forced to first decode, then calculate, and finally re-encode the encoding. This leads us to conclude that sinusoidal encodings alone are not well-suited as a general-purpose number representation.
In log space, $\ln(x_1 x_2) = \ln x_1 + \ln x_2$, so multiplication becomes another additive homomorphism—at the price that addition is no longer local (classic log-number-system trade-off). Recovering $\mathcal{F}_{\log}(x_1 + x_2)$ from $\mathcal{F}_{\log}(x_1)$ and $\mathcal{F}_{\log}(x_2)$ needs something like exp–sum–log, reintroducing the same mixing difficulties (see paper discussion).
Takeaway 4: No prior single-token recipe simultaneously satisfies engineering constraints and supports learning general and robust arithmetic algorithms.
Guided by the desiderata introduced in the previous section, we propose BitTokens, a novel numeric encoding algorithm.
BitTokens uses a dedicated, learnable [NUM] token to which a numeric encoding is added that is based on the IEEE 754 double-precision binary floating-point format (i.e., float64).
The floating-point format writes a signed real value $v$ as
Each IEEE dimension becomes one coordinate in the additive encoding (64 bits for float64), covering full range and specials ($\pm0$, $\pm\infty$, NaN).
The checklist below matches the paper’s desiderata analysis: what each single-token design achieves in principle, where it breaks, and where BitTokens still involves a deliberate trade-off.
The IEEE 754 double-precision binary floating-point format (float64).
Hover, focus, or tap a badge to show the rationale in a floating card (anchored to the viewport so it is not clipped by the table). Same taxonomy as § Desiderata; columns follow the paper’s comparison of xVal, FoNE, and BitTokens.
| Desideratum | xVal | FoNE | BitTokens |
|---|---|---|---|
| D1Token efficiency |
xVal
Scales a learned |
FoNE
Packs Fourier features of the value into a single |
BitTokens
Each parsed number becomes one |
| D2Uniqueness |
xVal
On the chosen rescaling, distinct values map to distinct scaled embeddings, and the number head targets a single continuous decode—so the mapping is injective on the training range. |
FoNE
With frequencies covering the modeled digit grid, phases identify a unique value inside that range; decoding compares against reference encodings digit-wise. |
BitTokens
Every finite float64 has a canonical IEEE bit pattern; the model predicts that 64-bit vector, so (up to float semantics) there is a one-to-one link between value and target bits. |
| D3Structured |
xVal
Because the embedding direction is fixed and only the scalar scales magnitude, larger numbers move the vector along a predictable ray—useful for monotonic comparisons and simple regressions. |
FoNE
Base-10 frequencies tie phases to digit places, so local geometry reflects positional structure; addition becomes a simple operation in Fourier (Hadamard) space. |
BitTokens
Bits expose sign, biased exponent, and significand in the same order hardware uses, so algorithms can mirror textbook floating-point steps (align, add, round, normalize). |
| D4Scale invariance |
xVal
To keep RMSNorm from erasing magnitude, inputs must be squeezed (e.g. into a narrow band). That rescaling collapses many orders of magnitude and fine distinctions, so the method cannot faithfully cover wide scientific ranges at full precision. |
FoNE
Multiple decades of frequencies capture both very large and very small magnitudes (and fractional digits) within the chosen digit budget, so the design targets wide dynamic range by construction. |
BitTokens
Float64 spans about $10^{\pm 308}$ with roughly 15–17 decimal digits of precision, matching IEEE range; specials like infinities and NaN are native cases of the format. |
| D5Normalization |
xVal
After aggressive input rescaling, activations feeding LayerNorm/RMSNorm stay in a well-behaved magnitude band, so normalization layers no longer wipe out the encoded number—at the cost of the D4 trade-off. |
FoNE
Sine and cosine pairs keep each frequency’s energy bounded; combined with the shared |
BitTokens
Raw bits are mapped linearly to $[-1,1]$ (optionally with reciprocal bits concatenated), giving predictable norms for RMSNorm while preserving discrete meaning per dimension. |
| D6Numerical stability |
xVal
The narrow dynamic range implied by rescaling means tiny changes from FP8/BF16 matmuls or activations can already move the decoded value materially, so low-precision training is fragile relative to the intended magnitudes. |
FoNE
Smooth periodic features live on a compact torus; small numerical noise perturbs phases continuously rather than snapping entire exponents, which plays well with mixed-precision training. |
BitTokens
Each channel represents a single binary digit after scaling; activations may jitter but the supervised target stays in $\{0,1\}$, so quantization noise does not collapse mantissa semantics the way tiny continuous deltas can. |
| D7Continuity |
xVal
The encoding and regression head are smooth in the underlying real; gradients flow through scalar multiplication and MSE-style objectives without hard discontinuities from thresholding. |
FoNE
Sinusoids are differentiable in the input, so end-to-end optimization sees smooth landscapes (modulo periodic wrapping), which is ideal for gradient-based fitting. |
BitTokens
Training uses bit-wise BCE with sigmoid logits: nearby reals can flip many mantissa bits, so the loss landscape is not strictly continuous in the value. In practice the model still fits well because latent layers smooth those discontinuities. |
| D8Robustness |
xVal
Decoding is a direct regression into $\mathbb{R}$; there is no built-in thresholding step analogous to hardened bits, so moderate noise in late layers can shift the predicted scalar more ambiguously than digit-wise voting schemes. |
FoNE
Outputs are read by cosine-similarity against reference digit prototypes, which averages out small perturbations per digit position—deliberately robust under stochastic depth/dropout noise. |
BitTokens
The number head emits logits per bit; applying a fixed 0.5 threshold after the sigmoid snaps predictions to valid binary strings, hardening decoding against small logit jitter during inference. |
| D9Arithmetic |
xVal
Because representable values are tightly compressed, the model lacks room to stage intermediate wide-dynamic-range results; learning general multiply/divide routines that mirror arbitrary-precision hardware is therefore impractical. |
FoNE
Multiplication is not a local operation on the torus: implementing it forces an implicit decode of mixed phases, coefficient convolution with carries, and a re-encode—precisely the bottleneck formalized for sinusoidal encodings in the paper. |
BitTokens
Addition and multiplication decompose into standard IEEE algorithms on independent bit tracks; transformers can allocate width to emulate carries, exponent handling, and significand updates in parallel. |
satisfies trade-off / partial breaks
Takeaway 5: BitTokens deliberately mirrors how computers store numbers to efficiently encode and decode numbers. It fulfills all desiderata for single-token number encodings.
[NUM]Full training code, tokenizers, datasets, and evaluation live in the official repository: github.com/KreitnerL/BitTokens
(a) Parsing numeric spans with a regex
NUMERIC_SPAN_REGEX = re.compile(r"[-]?(?:(?:0(?!\.[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))")
raw_numbers = [m.group(0).strip() for m in NUMERIC_SPAN_REGEX.finditer(sample_text)]
nums = torch.tensor([float(n) for n in raw_numbers], dtype=torch.float64)
(b) Forward pass — IEEE bits from float tensors + combine at token positions
float64_bit_shifts = torch.arange(63, -1, -1, dtype=torch.int64)
def float64_tensor_to_binary_tensor(tensor_in: torch.DoubleTensor) -> torch.LongTensor:
int_representation = tensor_in.view(torch.int64).unsqueeze(-1)
bits = (int_representation >> float64_bit_shifts) & 1
return bits
# IEEE 754 bit pattern
bits = float64_tensor_to_binary_tensor(nums)
# Optional reciprocal encoding
reciprocal_bits = float64_tensor_to_binary_tensor(nums.reciprocal())
full_encoding = torch.cat([bits, reciprocal_bits], dim=-1)
# Combine with token embeddings (sum)
combined[number_mask] = inputs_embeds[number_mask] + full_encoding
(c) Number head, prediction, and BCE loss
hidden_states = combined # Dummy assignment
output_size = bits.shape[-1]
target_bits = bits.to(torch.float32)
freq_weights = torch.ones(output_size).unsqueeze(0)
num_head_linear = torch.nn.Linear(hidden_dim, num_bit_dims)
pred_bits_logits = num_head_linear(hidden_states[number_mask])
bce_per_bit = torch.nn.functional.binary_cross_entropy_with_logits(
pred_bits_logits, target_bits, reduction='none'
)
weighted_bce = (bce_per_bit * freq_weights).mean(dim=-1)
(d) Decoding bits back to float64
def binary_tensor_to_float64_tensor(bits_int64: torch.Tensor) -> torch.Tensor:
weights = torch.tensor(1, dtype=torch.int64) << float64_bit_shifts
reconstructed_int = torch.sum(bits_int64 * weights, dim=-1)
return reconstructed_int.view(torch.int64).view(torch.float64)
noisy_pred_logits = bits * 2 - 1 + torch.rand_like(bits.float()).sub_(0.5)
x_base_digits_pred: torch.LongTensor = (noisy_pred_logits > 0).to(torch.int64)
num_preds = binary_tensor_to_float64_tensor(x_base_digits_pred)
Takeaway 6: BitTokens can easily be integrated into existing LLMs with minimal changes. The binary vector of a number can be efficiently constructed via type reinterpretation and bit shifts.
We compare BitTokens to traditional single digit and triple digit (subword) tokenizers, as well as xVal and FoNE.
Figure 4. Our BitTokens outperforms all other methods in the multi-task setting and is the superior single-token strategy for the multi-step tasks. Between the two multi-token strategies, single-digit tokenization performs best, albeit with the highest token cost.
Takeaway 7: BitTokens outperform all other methods and achieve near-perfect performance on comparison and single-step calculation tasks. Low perplexity shows that language understanding is not impacted.
In brief, we answer the questions from above as follows:
How do frontier LLMs perform on arithmetic tasks?
Frontier models solve basic comparisons reliably, but strongly rely on reasoning chains with a large number of tokens to solve calculations.
What attributes should a single-token number encoding have?
Number encodings should jointly offer a large range of input magnitudes and precisions, be compatible with (low precision) gradient-based training, and allow for the learning of core arithmetic operations.
How do we build a better single-token number encoding?
BitTokens fuses a dedicated number token with an IEEE 754 float64 bit embedding and allows even small language models to learn algorithms to solve basic arithmetic operations.
@inproceedings{
kreitner2026bittokens,
title={Efficient numeracy in language models through single-token number embeddings},
author={Linus Kreitner and Paul Hager and Jonathan Mengedoht and Georgios Kaissis and Daniel Rueckert and Martin J. Menten},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=Bh4Ubk80M8}
}
This work was partially funded by ERC Grant Deep4MI (Grant No. 884622).
The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universitaet Erlangen-Nuernberg (FAU) under the NHR project b247bb. NHR funding is provided by federal and Bavarian state authorities.
Martin J. Menten is funded by the German Research Foundation under project 532139938.