FREE Reverse Engineering Self-Study Course HERE
Part 1 covers character-level tokenization, encoding/decoding, train/test splits, batching and transforming input.txt into tensors ready for a transformer.
Author: Kevin Thomas
Part 2 HERE
import torchNow let's read the file and see what we're working with. Understanding your data is crucial before building any model!
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()textOutput:
'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'
Tokenization is the process of converting raw text into numbers that a neural network can process.
chars = sorted(list(set(text)))
charsOutput:
['\n',
' ',
',',
'.',
'A',
'C',
'D',
'E',
'I',
'L',
'P',
'R',
'T',
'W',
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z']
vocab_size = len(chars)
vocab_sizeOutput:
40
for i, ch in enumerate(chars):
if ch == '\n':
print(f" Index {i:2d}: '\\n' (newline)")
elif ch == ' ':
print(f" Index {i:2d}: ' ' (space)")
else:
print(f" Index {i:2d}: '{ch}'")Output:
Index 0: '\n' (newline)
Index 1: ' ' (space)
Index 2: ','
Index 3: '.'
Index 4: 'A'
Index 5: 'C'
Index 6: 'D'
Index 7: 'E'
Index 8: 'I'
Index 9: 'L'
Index 10: 'P'
Index 11: 'R'
Index 12: 'T'
Index 13: 'W'
Index 14: 'a'
Index 15: 'b'
Index 16: 'c'
Index 17: 'd'
Index 18: 'e'
Index 19: 'f'
Index 20: 'g'
Index 21: 'h'
Index 22: 'i'
Index 23: 'j'
Index 24: 'k'
Index 25: 'l'
Index 26: 'm'
Index 27: 'n'
Index 28: 'o'
Index 29: 'p'
Index 30: 'q'
Index 31: 'r'
Index 32: 's'
Index 33: 't'
Index 34: 'u'
Index 35: 'v'
Index 36: 'w'
Index 37: 'x'
Index 38: 'y'
Index 39: 'z'
This is how we convert between human-readable text and model-readable numbers w/ two functions.
- Encoder: text → list of integers
- Decoder: list of integers → text
stoi = {}
for i, char in enumerate(chars):
stoi[char] = i
stoiOutput:
{'\n': 0,
' ': 1,
',': 2,
'.': 3,
'A': 4,
'C': 5,
'D': 6,
'E': 7,
'I': 8,
'L': 9,
'P': 10,
'R': 11,
'T': 12,
'W': 13,
'a': 14,
'b': 15,
'c': 16,
'd': 17,
'e': 18,
'f': 19,
'g': 20,
'h': 21,
'i': 22,
'j': 23,
'k': 24,
'l': 25,
'm': 26,
'n': 27,
'o': 28,
'p': 29,
'q': 30,
'r': 31,
's': 32,
't': 33,
'u': 34,
'v': 35,
'w': 36,
'x': 37,
'y': 38,
'z': 39}
itos = {}
for i, char in enumerate(chars):
itos[i] = char
itosOutput:
{0: '\n',
1: ' ',
2: ',',
3: '.',
4: 'A',
5: 'C',
6: 'D',
7: 'E',
8: 'I',
9: 'L',
10: 'P',
11: 'R',
12: 'T',
13: 'W',
14: 'a',
15: 'b',
16: 'c',
17: 'd',
18: 'e',
19: 'f',
20: 'g',
21: 'h',
22: 'i',
23: 'j',
24: 'k',
25: 'l',
26: 'm',
27: 'n',
28: 'o',
29: 'p',
30: 'q',
31: 'r',
32: 's',
33: 't',
34: 'u',
35: 'v',
36: 'w',
37: 'x',
38: 'y',
39: 'z'}
def encode(string):
"""Encode a string into a list of token indices.
Args:
string (str): input text to encode
Returns:
list[int]: list of integer token indices
"""
encoded = []
for char in string:
encoded.append(stoi[char])
return encodeddef decode(lst):
"""Convert a list of token indices into a string.
Args:
lst (list[int]): list of integer token indices
Returns:
str: the decoded string
"""
chars = []
for i in lst:
chars.append(itos[i])
return ''.join(chars)encoded_text = encode('hello world')
encoded_textOutput:
[21, 18, 25, 25, 28, 1, 36, 28, 31, 25, 17]
decoded_text = decode(encoded_text)
decoded_textOutput:
'hello world'
Now we'll encode the entire text and store it as a PyTorch tensor.
encoded_text = encode(text)
encoded_textOutput:
[4,
1,
17,
22,
26,
1,
20,
25,
28,
36,
1,
31,
22,
32,
18,
32,
1,
15,
18,
21,
22,
27,
17,
1,
33,
21,
18,
1,
20,
25,
14,
32,
32,
1,
28,
19,
1,
14,
1,
32,
16,
31,
18,
18,
27,
1,
14,
27,
17,
1,
33,
21,
18,
1,
26,
14,
16,
21,
22,
27,
18,
1,
18,
37,
21,
14,
25,
18,
32,
1,
22,
27,
1,
15,
22,
27,
14,
31,
38,
1,
33,
22,
17,
18,
32,
3,
1,
12,
21,
18,
1,
21,
34,
26,
1,
22,
32,
1,
14,
1,
25,
14,
27,
20,
34,
14,
20,
18,
1,
14,
27,
17,
1,
28,
27,
18,
1,
36,
21,
28,
1,
25,
22,
32,
33,
18,
27,
32,
1,
25,
18,
14,
27,
32,
1,
16,
25,
28,
32,
18,
1,
33,
28,
1,
16,
14,
33,
16,
21,
1,
33,
21,
18,
1,
30,
34,
22,
18,
33,
1,
20,
31,
14,
26,
26,
14,
31,
3,
1,
10,
14,
33,
33,
18,
31,
27,
32,
1,
19,
28,
25,
17,
1,
25,
22,
24,
18,
1,
32,
26,
14,
25,
25,
1,
26,
14,
29,
32,
1,
14,
27,
17,
1,
32,
18,
14,
26,
32,
1,
21,
22,
27,
33,
1,
14,
33,
1,
21,
28,
36,
1,
33,
21,
18,
1,
33,
21,
22,
27,
20,
1,
21,
28,
25,
17,
32,
1,
22,
33,
32,
18,
25,
19,
1,
33,
28,
20,
18,
33,
21,
18,
31,
3,
1,
12,
31,
18,
14,
33,
1,
18,
14,
16,
21,
1,
15,
25,
22,
27,
24,
22,
27,
20,
1,
17,
22,
28,
17,
18,
1,
14,
27,
17,
1,
18,
14,
16,
21,
1,
22,
17,
25,
18,
1,
33,
22,
16,
24,
1,
14,
32,
1,
14,
1,
32,
18,
27,
33,
18,
27,
16,
18,
1,
22,
27,
1,
14,
1,
32,
33,
28,
31,
38,
1,
33,
21,
14,
33,
1,
14,
32,
24,
32,
1,
33,
28,
1,
15,
18,
1,
31,
18,
14,
17,
3,
0,
0,
12,
21,
18,
31,
18,
1,
22,
32,
1,
29,
14,
33,
22,
18,
27,
16,
18,
1,
21,
18,
31,
18,
2,
1,
27,
28,
33,
1,
28,
19,
1,
21,
14,
32,
33,
18,
1,
15,
34,
33,
1,
28,
19,
1,
16,
14,
31,
18,
19,
34,
25,
1,
34,
27,
33,
21,
31,
18,
14,
17,
22,
27,
20,
3,
1,
13,
21,
18,
31,
18,
1,
28,
33,
21,
18,
31,
32,
1,
32,
18,
18,
1,
14,
1,
32,
18,
14,
25,
18,
17,
1,
15,
28,
37,
1,
33,
21,
18,
1,
16,
34,
31,
22,
28,
34,
32,
1,
21,
14,
27,
17,
1,
33,
31,
14,
16,
18,
32,
1,
33,
21,
18,
1,
23,
28,
22,
27,
1,
14,
27,
17,
1,
36,
28,
27,
17,
18,
31,
32,
1,
36,
21,
22,
16,
21,
1,
33,
21,
28,
34,
20,
21,
33,
1,
26,
14,
17,
18,
1,
22,
33,
1,
19,
22,
33,
3,
1,
6,
28,
1,
27,
28,
33,
1,
31,
34,
32,
21,
1,
33,
28,
1,
15,
31,
18,
14,
24,
2,
1,
16,
28,
14,
37,
1,
33,
21,
18,
1,
26,
18,
14,
27,
22,
27,
20,
1,
28,
34,
33,
1,
36,
22,
33,
21,
1,
30,
34,
18,
32,
33,
22,
28,
27,
32,
2,
1,
14,
27,
17,
1,
36,
14,
33,
16,
21,
1,
21,
28,
36,
1,
33,
21,
18,
1,
25,
28,
20,
22,
16,
1,
31,
18,
29,
25,
22,
18,
32,
1,
22,
27,
1,
33,
31,
14,
16,
18,
32,
1,
14,
27,
17,
1,
18,
31,
31,
28,
31,
32,
1,
14,
27,
17,
1,
22,
27,
1,
33,
21,
18,
1,
18,
16,
21,
28,
18,
32,
1,
28,
19,
1,
19,
28,
31,
20,
28,
33,
33,
18,
27,
1,
22,
27,
33,
18,
31,
19,
14,
16,
18,
32,
3,
0,
0,
12,
18,
16,
21,
27,
28,
25,
28,
20,
38,
1,
22,
32,
1,
14,
31,
33,
22,
19,
14,
16,
33,
1,
14,
27,
17,
1,
14,
31,
20,
34,
26,
18,
27,
33,
1,
14,
33,
1,
28,
27,
16,
18,
3,
1,
8,
33,
1,
26,
14,
24,
18,
32,
1,
14,
1,
16,
25,
14,
22,
26,
1,
14,
15,
28,
34,
33,
1,
36,
21,
14,
33,
1,
32,
21,
28,
34,
25,
17,
1,
15,
18,
1,
32,
22,
26,
29,
25,
18,
2,
1,
36,
21,
14,
33,
1,
32,
21,
28,
34,
25,
17,
1,
15,
18,
1,
21,
22,
17,
17,
18,
27,
2,
1,
14,
27,
17,
1,
36,
21,
14,
33,
1,
32,
21,
28,
34,
25,
17,
1,
15,
18,
1,
33,
31,
34,
32,
33,
18,
17,
3,
1,
11,
18,
35,
18,
31,
32,
18,
1,
33,
21,
18,
1,
20,
14,
39,
18,
1,
14,
27,
17,
1,
25,
18,
14,
31,
27,
1,
22,
33,
32,
1,
31,
21,
18,
33,
28,
31,
22,
16,
2,
1,
32,
18,
18,
1,
36,
21,
18,
31,
18,
1,
22,
33,
1,
29,
31,
28,
26,
22,
32,
18,
32,
1,
18,
14,
32,
18,
2,
1,
36,
21,
18,
31,
18,
1,
22,
33,
1,
15,
34,
31,
22,
18,
32,
1,
16,
28,
26,
29,
25,
18,
37,
22,
33,
38,
2,
1,
14,
27,
17,
1,
36,
21,
18,
31,
18,
1,
22,
33,
1,
25,
18,
14,
35,
18,
32,
1,
14,
1,
15,
14,
16,
24,
17,
28,
28,
31,
1,
14,
32,
1,
14,
1,
32,
22,
20,
21,
1,
15,
18,
33,
36,
18,
18,
27,
1,
15,
31,
22,
16,
24,
32,
3,
1,
12,
28,
1,
31,
18,
14,
17,
1,
33,
21,
14,
33,
1,
31,
21,
18,
33,
28,
31,
22,
16,
1,
22,
32,
1,
33,
28,
1,
15,
18,
1,
14,
1,
24,
...]
data = torch.tensor(encoded_text, dtype=torch.long)
dataOutput:
tensor([ 4, 1, 17, ..., 25, 25, 3])
data.dtypeOutput:
torch.int64
data.shapeOutput:
torch.Size([2114])
data.ndimOutput:
1
decode(data.tolist())Output:
'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'
- Training Set (90%): Used to train the model.
- Test Set (10%): Used to check if the model is overfitting.
Overfitting = Model memorizes training data but can't generalize to new data.
By checking performance on test data (which the model never trained on), we can detect this.
split_point = int(0.9 * len(data))
split_pointOutput:
1902
training_data = data[:split_point]
training_dataOutput:
tensor([ 4, 1, 17, ..., 26, 18, 16])
decode(training_data.tolist())Output:
'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mec'
test_data = data[split_point:]
test_dataOutput:
tensor([21, 14, 27, 22, 32, 26, 1, 29, 14, 34, 32, 18, 1, 14, 27, 17, 1, 14,
32, 24, 1, 36, 21, 14, 33, 1, 32, 21, 28, 34, 25, 17, 1, 16, 21, 14,
27, 20, 18, 2, 1, 14, 1, 15, 34, 20, 1, 33, 28, 1, 15, 18, 1, 19,
22, 37, 18, 17, 2, 1, 14, 1, 34, 32, 18, 31, 1, 33, 28, 1, 15, 18,
1, 36, 14, 31, 27, 18, 17, 2, 1, 14, 1, 17, 18, 32, 22, 20, 27, 1,
33, 28, 1, 15, 18, 1, 14, 26, 18, 27, 17, 18, 17, 3, 1, 12, 21, 18,
1, 33, 31, 34, 18, 1, 26, 14, 24, 18, 31, 1, 28, 19, 1, 26, 14, 16,
21, 22, 27, 18, 32, 1, 25, 18, 14, 35, 18, 32, 1, 33, 21, 22, 27, 20,
32, 1, 15, 18, 33, 33, 18, 31, 1, 19, 28, 31, 1, 21, 14, 35, 22, 27,
20, 1, 25, 28, 28, 24, 18, 17, 2, 1, 27, 28, 33, 1, 32, 22, 26, 29,
25, 38, 1, 19, 28, 31, 1, 21, 14, 35, 22, 27, 20, 1, 16, 31, 14, 16,
24, 18, 17, 1, 33, 21, 18, 1, 32, 21, 18, 25, 25, 3])
decode(test_data.tolist())Output:
'hanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'
Block size (also called context length) is the maximum number of previous characters the model can "see" when making a prediction.
For example, with block_size = 8.
- The model sees at most 8 characters of history.
- Given 'hello wo', it predicts the next character ('r' for 'world').
block_size = 8 # context window
block_sizeOutput:
8
chunk = training_data[:block_size+1]
chunkOutput:
tensor([ 4, 1, 17, 22, 26, 1, 20, 25, 28])
decode(chunk.tolist())Output:
'A dim glo'
From one chunk of 9 chars, we actually get 8 training examples.
Each example has the following.
- Input (x): Some context chars (1 to 8 chars).
- Target (y): The next char after the context.
This is how language models learn; given context, predict the next token.
x = training_data[:block_size] # chars 0,1,2,3,4,5,6,7
xOutput:
tensor([ 4, 1, 17, 22, 26, 1, 20, 25])
decode(x.tolist())Output:
'A dim gl'
y = training_data[1:block_size+1] # chars 1,2,3,4,5,6,7,8
yOutput:
tensor([ 1, 17, 22, 26, 1, 20, 25, 28])
decode(y.tolist())Output:
' dim glo'
print(f"{'Example':<8} {'Context (Input)':<30} {'Target':<10} {'Decoded'}")
print("=" * 70)
for token in range(block_size):
context = x[:token+1]
target_char = y[token]
context_decoded = decode(context.tolist())
target_char_decoded = decode([target_char.item()])
print(f" {token+1:<6} {str(context.tolist()):<30} {target_char.item():<10} '{context_decoded}' → '{target_char_decoded}'")Output:
Example Context (Input) Target Decoded
======================================================================
1 [4] 1 'A' → ' '
2 [4, 1] 17 'A ' → 'd'
3 [4, 1, 17] 22 'A d' → 'i'
4 [4, 1, 17, 22] 26 'A di' → 'm'
5 [4, 1, 17, 22, 26] 1 'A dim' → ' '
6 [4, 1, 17, 22, 26, 1] 20 'A dim ' → 'g'
7 [4, 1, 17, 22, 26, 1, 20] 25 'A dim g' → 'l'
8 [4, 1, 17, 22, 26, 1, 20, 25] 28 'A dim gl' → 'o'
Processing one example at a time is inefficient. GPUs excel at parallel processing, so we group multiple examples into a batch.
| Concept | Value | Meaning |
|---|---|---|
| batch_size | 4 | process 4 independent sequences in parallel |
| block_size | 8 | each sequence is 8 tokens long |
So one batch processes 4 × 8 = 32 total predictions!
torch.manual_seed(42)Output:
<torch._C.Generator at 0x117e8e4f0>
batch_size = 4 # number of independent sequences per batch
batch_sizeOutput:
4
block_size = 8 # max context length
block_sizeOutput:
8
def get_batch(split):
"""
Generate a batch of training data (friendly names).
Args:
split (str): 'train' or 'val' - which dataset to sample from
Returns:
x: Input tensor of shape (batch_size, block_size)
y: Target tensor of shape (batch_size, block_size)
"""
# choose the dataset based on `split`
data = training_data if split == 'train' else test_data
# make sure we have room to sample a full block
if len(data) <= block_size:
raise ValueError('Dataset is too small for the configured block_size!')
# highest valid start index so that start + block_size is in-range
max_start_index = len(data) - block_size
# sample `batch_size` random start positions in [0, max_start_index)
start_positions = torch.randint(0, max_start_index, (batch_size,))
# build the input (x) and target (y) batches
x = torch.stack([data[pos:pos + block_size] for pos in start_positions])
y = torch.stack([data[pos + 1: pos + block_size + 1] for pos in start_positions])
# return the input and target batches
return x, yx_batch, y_batch = get_batch('train')x_batchOutput:
tensor([[33, 32, 1, 31, 21, 18, 33, 28],
[17, 1, 34, 27, 17, 18, 31, 32],
[ 1, 36, 21, 18, 31, 18, 1, 22],
[25, 1, 33, 21, 22, 27, 20, 32]])
for row in x_batch.tolist():
print(decode(row))Output:
ts rheto
d unders
where i
l things
x_batch.dtypeOutput:
torch.int64
x_batch.shapeOutput:
torch.Size([4, 8])
x_batch.ndimOutput:
2
y_batchOutput:
tensor([[32, 1, 31, 21, 18, 33, 28, 31],
[ 1, 34, 27, 17, 18, 31, 32, 33],
[36, 21, 18, 31, 18, 1, 22, 33],
[ 1, 33, 21, 22, 27, 20, 32, 2]])
for row in y_batch.tolist():
print(decode(row))Output:
s rhetor
underst
where it
things,
y_batch.dtypeOutput:
torch.int64
y_batch.shapeOutput:
torch.Size([4, 8])
y_batch.ndimOutput:
2
f'{x_batch.shape} → x_batch (inputs): {batch_size} sequences x {block_size} tokens each'Output:
'torch.Size([4, 8]) → x_batch (inputs): 4 sequences x 8 tokens each'
f'{y_batch.shape} → y_batch (inputs): {batch_size} sequences x {block_size} tokens each'Output:
'torch.Size([4, 8]) → y_batch (inputs): 4 sequences x 8 tokens each'
print('Batch Inputs (x_batch):')
print('-' * 60)
for batch in range(batch_size):
tokens = x_batch[batch].tolist()
decoded = decode(tokens)
print(f'Sequence {batch}: {tokens}')
print(f' → "{decoded}"')Output:
Batch Inputs (x_batch):
------------------------------------------------------------
Sequence 0: [33, 32, 1, 31, 21, 18, 33, 28]
→ "ts rheto"
Sequence 1: [17, 1, 34, 27, 17, 18, 31, 32]
→ "d unders"
Sequence 2: [1, 36, 21, 18, 31, 18, 1, 22]
→ " where i"
Sequence 3: [25, 1, 33, 21, 22, 27, 20, 32]
→ "l things"
print('Batch Targets (y_batch):')
print('-' * 60)
for batch in range(batch_size):
tokens = y_batch[batch].tolist()
decoded = decode(tokens)
print(f'Sequence {batch}: {tokens}')
print(f' → "{decoded}"')Output:
Batch Targets (y_batch):
------------------------------------------------------------
Sequence 0: [32, 1, 31, 21, 18, 33, 28, 31]
→ "s rhetor"
Sequence 1: [1, 34, 27, 17, 18, 31, 32, 33]
→ " underst"
Sequence 2: [36, 21, 18, 31, 18, 1, 22, 33]
→ "where it"
Sequence 3: [1, 33, 21, 22, 27, 20, 32, 2]
→ " things,"
print('All 32 Training Examples in this Batch.')
print('=' * 60)
example_num = 0
# loop over sequences in batch
for batch in range(batch_size):
# loop over positions in sequence
for token in range(block_size):
example_num += 1
# context: all tokens up to token position
context = x_batch[batch, :token+1]
# target: the next token
target = y_batch[batch, token]
context_decoded = decode(context.tolist())
target_decoded = decode([target.item()])
# truncate long contexts for display
context_string = context_decoded if len(context_decoded) <= 10 else '...' + context_decoded[-7:]
print(f'Example {example_num:2d} (seq {batch}, pos {token}): "{context_string}" → "{target_decoded}"')
print() # Blank line between sequences
print('=' * 60)
print(f'Total: {batch_size} x {block_size} = {batch_size * block_size} training examples per batch processed in parallel!')Output:
All 32 Training Examples in this Batch.
============================================================
Example 1 (seq 0, pos 0): "t" → "s"
Example 2 (seq 0, pos 1): "ts" → " "
Example 3 (seq 0, pos 2): "ts " → "r"
Example 4 (seq 0, pos 3): "ts r" → "h"
Example 5 (seq 0, pos 4): "ts rh" → "e"
Example 6 (seq 0, pos 5): "ts rhe" → "t"
Example 7 (seq 0, pos 6): "ts rhet" → "o"
Example 8 (seq 0, pos 7): "ts rheto" → "r"
Example 9 (seq 1, pos 0): "d" → " "
Example 10 (seq 1, pos 1): "d " → "u"
Example 11 (seq 1, pos 2): "d u" → "n"
Example 12 (seq 1, pos 3): "d un" → "d"
Example 13 (seq 1, pos 4): "d und" → "e"
Example 14 (seq 1, pos 5): "d unde" → "r"
Example 15 (seq 1, pos 6): "d under" → "s"
Example 16 (seq 1, pos 7): "d unders" → "t"
Example 17 (seq 2, pos 0): " " → "w"
Example 18 (seq 2, pos 1): " w" → "h"
Example 19 (seq 2, pos 2): " wh" → "e"
Example 20 (seq 2, pos 3): " whe" → "r"
Example 21 (seq 2, pos 4): " wher" → "e"
Example 22 (seq 2, pos 5): " where" → " "
Example 23 (seq 2, pos 6): " where " → "i"
Example 24 (seq 2, pos 7): " where i" → "t"
Example 25 (seq 3, pos 0): "l" → " "
Example 26 (seq 3, pos 1): "l " → "t"
Example 27 (seq 3, pos 2): "l t" → "h"
Example 28 (seq 3, pos 3): "l th" → "i"
Example 29 (seq 3, pos 4): "l thi" → "n"
Example 30 (seq 3, pos 5): "l thin" → "g"
Example 31 (seq 3, pos 6): "l thing" → "s"
Example 32 (seq 3, pos 7): "l things" → ","
============================================================
Total: 4 x 8 = 32 training examples per batch processed in parallel!
