readme/Article_Mineworld at main · pyxcode/readme · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
Title: MINEWORLD: A REAL-TIME AND OPEN-SOURCE INTERACTIVE WORLD MODEL ON MINECRAFT
Authors: Junliang Guo*, Yang Ye*, Tianyu He*, Haoyu Wu*, Yushu Jiang, Tim Pearce, Jiang Bian
Affiliation: Microsoft Research
Contact: {junliangguo, v-yangye, tianyuhe, v-haoywu}@microsoft.com
Link: https://aka.ms/mineworld
Publication date: April 2024 (arXiv: 2504.08388)
(Use of IA - gpt 4.0 to summerize the analysis I had written)

1. Problem
Visual tokenizers for video generation produce very large token sequences (40k–160k for 16 frames), which makes autoregressive decoding slow, energy-consuming, and expensive. Current models are not efficient for real-time interaction.

2. Main Contribution
MineWorld introduces an autoregressive transformer-based world model for Minecraft. It adds a novel parallel decoding method based on spatial dependencies to speed up inference while maintaining output quality.

3. Method

Based on an autoregressive Transformer (Vaswani et al., 2017).
Uses two tokenizers: one for world states (visual frames), one for player actions (mouse, keyboard).
Each frame is downsampled from 256×256 to 16×16 (256 tokens).
Actions are quantized into 11 discrete tokens, following Baker et al. (2022).
State and action tokens are interleaved into one sequence.
Model is trained autoregressively to predict next tokens based on previous ones.
Parallel decoding groups spatially close tokens and predicts them in parallel.
A modified causal mask allows intra-group attention and past-group attention, but not future-group attention.

4. Results

Trained on 10M video clips (160M frames) from VPT dataset.
Total: 55B tokens, using 32 × 40GB A100 GPUs.
Evaluation metrics: FVD, PSNR, LPIPS, SSIM.
Parallel decoding provides up to 3× faster inference.
Achieves 2–6 frames per second, compatible with real-time control.

5. Limitations

Trained only on Minecraft data, with a fixed downsampled resolution.
Temporal context is limited by transformer input size; long-term dependencies may be lost.
No external memory mechanism.
Visual detail reduced due to spatial compression.

6. What I retain

Merges policy learning and world modeling in a single autoregressive framework.
Interleaved sequences of states and actions let the model learn causality.
Spatial grouping allows efficient parallel decoding.
Adjusted attention mask helps balance training-inference mismatch.

7. Personal note

This model is far from a learning World model. The methods used and tricks to handle complexity are very interesting but not transposable to any other situation. It is a fix environment, with fixed posssibility and it is not learning anything for the outworld without training.

It is an pertinent exercice, but sadly of little use in the future. What's more they have used Oasis and its inference code to benchmark against itm and the results are better but not tremendeously better (2.58 fps for 500m param against 3.18fps for 700m param or 5.91 fps for 300m param for MineWorld).
Pertinent, but it does not advance my work a lot.


-------------------------------------------------------------------------------------------------------

MINEWORLD: A REAL-TIME AND OPEN-SOURCE INTERACTIVE WORLD MODEL ON MINECRAFT

What is it?

Video generqtive models qbility to leqrn commonsense knowledge from raw video --> Brooks et al., 2024

Problem bottlneck form this mineworld, it the lqtent video representqtion encoded by viusql tokeniwers concst of q lqrge number of tokens 40k to 160k for 16 frames
Not efficient at all consume a lot of energy and costly

Visual and action tokenizers convert game states and actions into discret tokens, they are concatenated (combining more things into a single sequence) and fed into transformer decoder as the input.

The transformers is then trained with an autoregressive objective.

-> which is a training strategy where the model learns to predict the next element in a sequence, like for nlp. gpt etc

P(x) = P(x₁) × P(x₂|x₁) × P(x₃|x₁, x₂) × ... × P(xₙ|x₁, ..., xₙ₋₁)

It is powerful because it matches the causal floz of many real world sequences, transformers compatible.

Auto regressive is NOT auto encoding
the second one encodes the full input
Bert modelm VAEs etc


Back to subject

It is costly, and energy consuming so it can take time. BUT to accelerate the process, they add parallel decoding via SPATIAL DEPENDENCIES

Instead of decoding each token sequentially, one by one, they identify spoatial groups of tokens that can be predicted in parallel.

How to exploit spatial dependencies?
We can imagine in a image nearby pixels that have correlated patterns - they put it into groups

such as
Group 1: [t1, t2, t3] => decode in parallel
Group 2: [t4, t5, t6]

It is respecting the cqusql structure globally but relaxes it locally.

Thisa strategy speed up the process by 3 with no significant loss of output.

So to sumup; efficiency contrability
code released and model weights

CONTRIBUTION announced

- new benchmark for world modeling
- Novel parallel decoding algorithm - speedup inference
- evalluation metrics for assessing the controllability

2. framework

We build MineWorld on the autoregressive Transformer (Vaswani et al., 2017),

this framework needs to deal with two different type of data
-> 2 tokenizers

1. the state of the video, the world
2. The action of the player which consist basically of the mouse movements and keyboard

Both needs to be converted in a sequence of discrete tokens, as words for a llm but differently

They do not do video tokenizers, they may try it in the future but not now.

So each frame is compressed by 16 times, a resolution from 256 by 256 becomes 16 by 16, so 256 tokens to create from each frame.

Why not using a video tokenizer ? easier and faster to process is my guess

Video tokenizer would offer to have temporal compression, a kind of tokenizer that compress more images at the same time.

Encoding action sequences as a single vector;

1. They followed practises by Baker et al., 2022 by quantizeing camera angles into discrete bins - they simulate the continuous mouvement into a discrete more handable way for prediction.

2. For discreete actions they ccategoriwe them into 7 exclusive classes each represented by a unique token.

So each action is represented by a sequence of 11 tokens.

They have also added special tokens to mark the boundaries of an action sequence, more specifically where the action starts and where the action stoped.
It helps the model alos make the difference between the state of the world and the action.

Generally the modele uses the auto regressive prediction as in GPT 2 or 4 - that use all the previous tokens to predict the next one

The combined approach of state + actions is interesting: it is treating the states - pictures and actions as equivalent tokenwise.

So the model can learn the dynimic of games and the behavior of the player in the samne sequence.

 why does it work?
 By taking both as tokens help the model create links between action AND states

 By doing this, it recreates a kind of causality impression that will help the model predict - if I do that, this is likely to be the outcome.

 So this model is at the same time a mix of a policy model as well as a world model; it predicts future actions from the actual state and predicts future state based on actual actions.

 They call this conbination the ; enterleaved structure.

 Parallel Decoding

 real time = number of action per minute
 2 frames per second would be nereded for they model for amateur real time interactions

 Visual atuoregressive models can provide high fidelity restults for image and video generation tasks - but the sequential nature is a bottlneck

 to deal with, they used diagonal decoding, they leverage spatial dependencies betzeen adjacent image tokens to speedup the decoding

 r = (h × w)/(h + w − 1)

 Discrepancy between training and inference

 during training => the model sees the true sequence of tokens
 It is prediciting the next token, using the "ground-truth pas tokens""

During inference, also called generation
The model must predict token by token, using its own previous prediction

So if a mistake is done early -> it will propagate or amplify, because futur prediction rely on the previous ones.


To face that issue, they changed the regular mask (which is a matrix of 1 and 0 that tells the model which positions is allowed to attend to or us - example, it permit to vorbid the attention to future tokens)

In this case the mask is likely to have give attention in a way that each group of token that are predicted at the same time can attend to each other, can attend to the last group but NOt the futures groups of tokens.

This configuration allows to lower the issues mostly caused by the parallel decoding alrogithm which is a structure not usual for the standard causal attention mask that goes one tokern at a time.

They finetuned an existing model because the existing methods were based on training a model from scratch - which is more time consuming and they chose not to. Lformer as an exemple


More recently ZipAR, did quite a similar thing that they did but for image generatioun and not video generation. The latest present more complex challenges due to error accumnulation across frames.


EVALUATION

to asses the performance of world models in minecraft environment they assess video quality and constollability quality of generation results.

1. dataset construction and preprocessing

They used the VPT dataset- baker et al., 2022 - pairs of recorded game playing videos and corresponding actions
They filtered

results 10M video clips - 160M fram - validate and test on 1k clips

resize, reduce computation cost
same ratio as original

55B tokens total in training set

Evaluation metrics
Frechet Video Distance ´
(FVD) (Unterthiner et al., 2018), Peak Signal-to-Noise Ratio (PSNR) (Hore & Ziou, 2010), Learned Perceptual
Image Patch Similarity (LPIPS) (Zhang et al., 2018), and Structural Similarity Index Measure (SSIM) (Wang
et al., 2004).


n 32 NVIDIA 40G A100 GPUs


Effect of parallel decoding
Larger mdoels perform better with parallel decoding even zithout finetuning - up to 3x

training model from scratch with parallel attention mask may lead to faster convergence and lower overall training costs


Conclusion

first open-source interactive wolrd model for minecraft

using open dataset of game states and corresponding actions

tokenize both modalities and train an AR transfotmers decoder using the interleav ed sequence of state and action tokens as input,
To support and make it more efficient - they explotied redundancy between adjacent image tokens and introduce a parallel decoinding alrgorithm,

and 2-6 frames per second enables real time interaction with pro gamne players.

Limitations annouces
- exclusively trained on minecraft data - at a fixed downsampled resolutionm limitin its ability to generaliwe to other video domains.

the downsampling process lead to loss of fine grained details in game states - bullshit ca un peu

Fundamental limitation of transformers
maximum input lenght = x k tokens

IT MEANS that the transformers model used as a context windo specific to xk tokens - anythin beyond that is cut off.

IF related game states are far apart, the model can,t see the earlier one - this can breaks temporal consistency - it can't remember what was built, can't track long terme goals

Here attention is computed over the entire sequence, no external memory.