ATOMELM

Embeddings Architecture

Detailed specification of ATOME LM's embedding layer, utilizing 1.58-bit ternary weights and per-tensor scaling for extreme memory efficiency.

1.58-bit Ternary Weights

ATOME LM completely abandons traditional FP16/BF16 embedding tables. Instead, it utilizes 1.58-bit ternary weights {-1, 0, 1}. This reduces the memory footprint of the embedding layer by approximately 90% compared to standard architectures, crucial for deployment on edge devices.

Quantization Formulation
# Forward pass for ternary embedding lookup
W_ternary =
round(clip(W_fp16 / scale, -1, 1))

output =
lookup(W_ternary, input_ids) * scale
Memory Efficiency
BF16 Baseline:256 MB
ATOME Ternary:28 MB

Absence of Positional Embeddings

A defining characteristic of ATOME LM's input processing is the strict avoidance of explicit positional embeddings (such as ROPE or absolute positional encodings).

Position information is instead implicitly learned through the recurrent nature of the ternary state-space models deployed deeper in the network. This architectural decision eliminates the sequence-length scaling bottleneck inherent in typical Transformer positional mechanisms.

Tokens
+
RoPE
Pure Ternary Tokens

Zero-Heap Allocation Constraint

To ensure deterministic execution times and prevent memory fragmentation on edge devices, the embedding lookup process strictly adheres to a zero-heap allocation constraint during inference.

Implementation Constraints
Memory Pool:
Static arena pre-allocated at initialization.
Lookup:
Direct index mapping; no intermediate buffers.
Scaling:
Per-tensor scale factor applied via in-place SIMD multiplication.
C++ Primitive
// Zero-alloc lookup
void tern_embed(
  const int8_t* W,
  const int32_t* ids,
  float* out,
  float scale,
  int seq_len,
  int dim
) {
  for(int i=0; i<seq_len; ++i) {
    const int8_t* row = W + (ids[i] * dim);
    simd_mul_add(out + (i*dim), row, scale);
  }
}

Embedding Metrics

ParameterValueNotes
Vocab Size32,000Standard BPE layout.
Hidden Dimension4,096Aligned to 256-byte cache lines.
Weight Formatint2 (packed)Stores 4 ternary values per byte.
Scale FactorFP32Single scalar per embedding tensor.
Lookup Latency< 0.1msMeasured on ARM Cortex-A76 (batch size 1).