Embeddings Architecture

Detailed specification of ATOME LM's embedding layer, utilizing 1.58-bit ternary weights and per-tensor scaling for extreme memory efficiency.

✦

1.58-bit Ternary Weights

ATOME LM completely abandons traditional FP16/BF16 embedding tables. Instead, it utilizes 1.58-bit ternary weights {-1, 0, 1}. This reduces the memory footprint of the embedding layer by approximately 90% compared to standard architectures, crucial for deployment on edge devices.

Quantization Formulation

# Forward pass for ternary embedding lookup
W_ternary =
round(clip(W_fp16 / scale, -1, 1))

output =
lookup(W_ternary, input_ids) * scale

Memory Efficiency

BF16 Baseline:256 MB

ATOME Ternary:28 MB

⌁

Absence of Positional Embeddings

A defining characteristic of ATOME LM's input processing is the strict avoidance of explicit positional embeddings (such as ROPE or absolute positional encodings).

Position information is instead implicitly learned through the recurrent nature of the ternary state-space models deployed deeper in the network. This architectural decision eliminates the sequence-length scaling bottleneck inherent in typical Transformer positional mechanisms.

Tokens

RoPE

↓

Pure Ternary Tokens

◌

Zero-Heap Allocation Constraint

To ensure deterministic execution times and prevent memory fragmentation on edge devices, the embedding lookup process strictly adheres to a zero-heap allocation constraint during inference.

Implementation Constraints

Memory Pool:

Static arena pre-allocated at initialization.

Lookup:

Direct index mapping; no intermediate buffers.

Scaling:

Per-tensor scale factor applied via in-place SIMD multiplication.

C++ Primitive

// Zero-alloc lookup
void tern_embed(
  const int8_t* W,
  const int32_t* ids,
  float* out,
  float scale,
  int seq_len,
  int dim
) {
  for(int i=0; i<seq_len; ++i) {
    const int8_t* row = W + (ids[i] * dim);
    simd_mul_add(out + (i*dim), row, scale);
  }
}

▣

Embedding Metrics

Parameter	Value	Notes
Vocab Size	32,000	Standard BPE layout.
Hidden Dimension	4,096	Aligned to 256-byte cache lines.
Weight Format	int2 (packed)	Stores 4 ternary values per byte.
Scale Factor	FP32	Single scalar per embedding tensor.
Lookup Latency	< 0.1ms	Measured on ARM Cortex-A76 (batch size 1).