Qwen3-4B-Instruct-2507 supports 256K token context. When I fine-tunes this model by using SFTTrainer with one training args of max_length=256K and DeepSpeed (Accelerate), OOM errors still occur during training. The deepspeed configuration is as follows:
{
"train_batch_size": "auto",
"gradient_accumulation_steps": 2,
"fp16": { "enabled": false },
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu" },
"offload_param": { "device": "cpu" },
"contiguous_gradients": true,
"overlap_comm": true,
"allgather_bucket_size": 2e8,
"reduce_bucket_size": 2e8,
"stage3_prefetch_bucket_size": 2e8,
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"pipeline": {
"pipeline_parallel_size": 1
},
"wall_clock_breakdown": false,
"gradient_clipping": 1.0
}
The accelerate configuration is as follows:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: /path/to/deepspeed.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I wonder if it is possible to train a long context model on an 8*910B (64G) machine. Thx!
Qwen3-4B-Instruct-2507 supports 256K token context. When I fine-tunes this model by using SFTTrainer with one training args of max_length=256K and DeepSpeed (Accelerate), OOM errors still occur during training. The deepspeed configuration is as follows:
{ "train_batch_size": "auto", "gradient_accumulation_steps": 2, "fp16": { "enabled": false }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "contiguous_gradients": true, "overlap_comm": true, "allgather_bucket_size": 2e8, "reduce_bucket_size": 2e8, "stage3_prefetch_bucket_size": 2e8, "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "activation_checkpointing": { "partition_activations": true, "cpu_checkpointing": false, "contiguous_memory_optimization": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false }, "pipeline": { "pipeline_parallel_size": 1 }, "wall_clock_breakdown": false, "gradient_clipping": 1.0 }The accelerate configuration is as follows:
I wonder if it is possible to train a long context model on an 8*910B (64G) machine. Thx!