Embeddings Architecture
Detailed specification of ATOME LM's embedding layer, utilizing 1.58-bit ternary weights and per-tensor scaling for extreme memory efficiency.
1.58-bit Ternary Weights
ATOME LM completely abandons traditional FP16/BF16 embedding tables. Instead, it utilizes 1.58-bit ternary weights {-1, 0, 1}. This reduces the memory footprint of the embedding layer by approximately 90% compared to standard architectures, crucial for deployment on edge devices.
# Forward pass for ternary embedding lookup W_ternary = round(clip(W_fp16 / scale, -1, 1)) output = lookup(W_ternary, input_ids) * scale
Absence of Positional Embeddings
A defining characteristic of ATOME LM's input processing is the strict avoidance of explicit positional embeddings (such as ROPE or absolute positional encodings).
Position information is instead implicitly learned through the recurrent nature of the ternary state-space models deployed deeper in the network. This architectural decision eliminates the sequence-length scaling bottleneck inherent in typical Transformer positional mechanisms.
Zero-Heap Allocation Constraint
To ensure deterministic execution times and prevent memory fragmentation on edge devices, the embedding lookup process strictly adheres to a zero-heap allocation constraint during inference.
// Zero-alloc lookup
void tern_embed(
const int8_t* W,
const int32_t* ids,
float* out,
float scale,
int seq_len,
int dim
) {
for(int i=0; i<seq_len; ++i) {
const int8_t* row = W + (ids[i] * dim);
simd_mul_add(out + (i*dim), row, scale);
}
}Embedding Metrics
| Parameter | Value | Notes |
|---|---|---|
| Vocab Size | 32,000 | Standard BPE layout. |
| Hidden Dimension | 4,096 | Aligned to 256-byte cache lines. |
| Weight Format | int2 (packed) | Stores 4 ternary values per byte. |
| Scale Factor | FP32 | Single scalar per embedding tensor. |
| Lookup Latency | < 0.1ms | Measured on ARM Cortex-A76 (batch size 1). |