Training Integration¶

SeqPacker provides framework-specific utilities for converting packing results into training-ready formats. Both modules are optional -- import only what you need.

Which integration?¶

Workflow	Module	Dependency	Import
HF Trainer / SFTTrainer / TRL	`hf_utils`	`datasets`	`from seqpacker.hf_utils import pack_dataset`
Custom PyTorch training loop	`torch_utils`	`torch`	`from seqpacker.torch_utils import packed_collate_fn`
PyTorch + Flash Attention	`torch_utils`	`torch`	`from seqpacker.torch_utils import pack_result_to_tensors`
Framework-agnostic	core API	none	`from seqpacker import pack_sequences`

HuggingFace Trainer¶

seqpacker.hf_utils converts packing results directly into a HuggingFace datasets.Dataset with correctly computed input_ids, attention_mask, labels (shifted with boundary masking), and position_ids (per-sequence reset).

One-call packing¶

from seqpacker.hf_utils import pack_dataset

tokenized = tokenizer(texts, truncation=True, max_length=2048)
ds = pack_dataset(
    token_ids=tokenized["input_ids"],
    capacity=2048,
    padding_value=tokenizer.pad_token_id or tokenizer.eos_token_id,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    args=SFTConfig(..., dataset_text_field=None),
    tokenizer=tokenizer,
)
trainer.train()

pack_dataset handles the full pipeline: packing sequences into bins, concatenating token IDs, padding, and generating all metadata columns. See examples/sft_trainer.py for a complete fine-tuning script.

Pre-computed packing¶

If you want to inspect metrics before building the dataset:

from seqpacker import pack_sequences
from seqpacker.hf_utils import pack_dataset_from_result

lengths = [len(ids) for ids in tokenized["input_ids"]]
result = pack_sequences(lengths=lengths, capacity=2048)
print(f"Efficiency: {result.efficiency:.2%}")
print(f"Packed {len(lengths)} sequences into {result.num_bins} bins")

ds = pack_dataset_from_result(
    result=result,
    token_ids=tokenized["input_ids"],
    padding_value=tokenizer.pad_token_id or tokenizer.eos_token_id,
)

Dataset columns¶

The returned datasets.Dataset contains these columns (each row is one packed bin):

Column	Type	Description
`input_ids`	`list[int]`	Padded token IDs, length = capacity
`attention_mask`	`list[int]`	1 for real tokens, 0 for padding
`labels`	`list[int]`	Shifted labels with -100 at sequence boundaries and padding
`position_ids`	`list[int]`	Per-sequence reset position IDs

Parameters¶

`pack_dataset`¶

Parameter	Type	Default	Description
`token_ids`	`Sequence[Sequence[int]]`	required	Token ID sequences to pack
`capacity`	`int`	required	Maximum bin capacity in tokens
`strategy`	`str`	`"obfd"`	Packing algorithm short name
`seed`	`int \\| None`	`None`	Random seed for shuffle-based algorithms
`padding_value`	`int`	`0`	Value for padding positions in `input_ids`
`label_padding_value`	`int`	`-100`	Value for masked label positions

`pack_dataset_from_result`¶

Parameter	Type	Default	Description
`result`	`PackResult`	required	Packing result from `pack_sequences` or `Packer.pack`
`token_ids`	`Sequence[Sequence[int]]`	required	Token ID sequences indexed by original sequence ID
`padding_value`	`int`	`0`	Value for padding positions in `input_ids`
`label_padding_value`	`int`	`-100`	Value for masked label positions

PyTorch DataLoader¶

seqpacker.torch_utils provides helpers for converting pack results into GPU-ready tensors for packed LLM training.

Collate function¶

The simplest way to integrate with a custom PyTorch training loop:

from seqpacker.torch_utils import packed_collate_fn
from torch.utils.data import DataLoader

collate = packed_collate_fn(capacity=2048, strategy="obfd")
loader = DataLoader(dataset, collate_fn=collate, batch_size=256)

for batch in loader:
    outputs = model(
        input_ids=batch.input_ids,
        position_ids=batch.position_ids,
        labels=batch.labels,
    )

The collate function:

Extracts lengths from each sample
Packs them using the specified algorithm
Returns a PackedBatch with all tensors ready for training

See examples/pytorch_training.py for a complete training loop.

Manual conversion¶

For more control, convert a PackResult directly:

from seqpacker import pack_sequences
from seqpacker.torch_utils import pack_result_to_tensors

result = pack_sequences(lengths, capacity=2048)
batch = pack_result_to_tensors(result=result, token_ids=token_ids)

PackedBatch¶

The PackedBatch dataclass contains all tensors needed for packed training:

Field	Shape	Description
`input_ids`	`(num_packs, capacity)`	Padded token IDs
`cu_seqlens`	list of `(num_seqs + 1,)`	Cumulative sequence lengths (Flash Attention varlen API)
`max_seqlen`	`int`	Maximum individual sequence length
`position_ids`	`(num_packs, capacity)`	Per-sequence reset position IDs
`labels`	`(num_packs, capacity)`	Shifted labels with -100 at boundaries/padding
`attention_mask`	`(num_packs, capacity)`	Binary mask (1=real, 0=padding)

Parameters¶

`pack_result_to_tensors`¶

Parameter	Type	Default	Description
`result`	`PackResult`	required	Packing result from `pack_sequences` or `Packer.pack`
`token_ids`	`Sequence[Sequence[int]]`	required	Token ID sequences indexed by original sequence ID
`padding_value`	`int`	`0`	Value for padding positions in `input_ids`
`label_padding_value`	`int`	`-100`	Value for masked label positions
`device`	`torch.device \\| str \\| None`	`None`	Device for output tensors

`packed_collate_fn`¶

Parameter	Type	Default	Description
`capacity`	`int`	required	Maximum bin capacity in tokens
`strategy`	`str`	`"obfd"`	Packing algorithm short name
`seed`	`int \\| None`	`None`	Random seed for shuffle-based algorithms
`padding_value`	`int`	`0`	Value for padding positions
`label_padding_value`	`int`	`-100`	Value for masked label positions

Flash Attention¶

Use cu_seqlens from PackedBatch with Flash Attention's variable-length API:

from flash_attn import flash_attn_varlen_func

for i, pack_cu in enumerate(batch.cu_seqlens):
    # pack_cu = [0, len_0, len_0+len_1, ..., total]
    output = flash_attn_varlen_func(
        q=q[i],
        k=k[i],
        v=v[i],
        cu_seqlens_q=pack_cu,
        cu_seqlens_k=pack_cu,
        max_seqlen_q=batch.max_seqlen,
        max_seqlen_k=batch.max_seqlen,
    )

This is only available via torch_utils since Flash Attention requires raw tensors. The cu_seqlens tensor tells Flash Attention where each sequence starts and ends within the packed input, preventing cross-sequence attention.

Training Integration¶

Which integration?¶

HuggingFace Trainer¶

One-call packing¶

Pre-computed packing¶

Dataset columns¶

Parameters¶

pack_dataset¶

pack_dataset_from_result¶

PyTorch DataLoader¶

Collate function¶

Manual conversion¶

PackedBatch¶

Parameters¶

pack_result_to_tensors¶

packed_collate_fn¶

Flash Attention¶

`pack_dataset`¶

`pack_dataset_from_result`¶

`pack_result_to_tensors`¶

`packed_collate_fn`¶