Skip to content

mytechnotalent/HackingGPT-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

FREE Reverse Engineering Self-Study Course HERE


HackingGPT

Part 1

Part 1 covers character-level tokenization, encoding/decoding, train/test splits, batching and transforming input.txt into tensors ready for a transformer.

Author: Kevin Thomas


Part 2 HERE



import torch

Step 1: Load and Inspect the Data

Now let's read the file and see what we're working with. Understanding your data is crucial before building any model!

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
text

Output:

'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'

Step 2: Tokenization

Tokenization is the process of converting raw text into numbers that a neural network can process.

chars = sorted(list(set(text)))
chars

Output:

['\n',
 ' ',
 ',',
 '.',
 'A',
 'C',
 'D',
 'E',
 'I',
 'L',
 'P',
 'R',
 'T',
 'W',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']
vocab_size = len(chars)
vocab_size

Output:

40
for i, ch in enumerate(chars):
    if ch == '\n':
        print(f"   Index {i:2d}: '\\n' (newline)")
    elif ch == ' ':
        print(f"   Index {i:2d}: ' ' (space)")
    else:
        print(f"   Index {i:2d}: '{ch}'")

Output:

   Index  0: '\n' (newline)
   Index  1: ' ' (space)
   Index  2: ','
   Index  3: '.'
   Index  4: 'A'
   Index  5: 'C'
   Index  6: 'D'
   Index  7: 'E'
   Index  8: 'I'
   Index  9: 'L'
   Index 10: 'P'
   Index 11: 'R'
   Index 12: 'T'
   Index 13: 'W'
   Index 14: 'a'
   Index 15: 'b'
   Index 16: 'c'
   Index 17: 'd'
   Index 18: 'e'
   Index 19: 'f'
   Index 20: 'g'
   Index 21: 'h'
   Index 22: 'i'
   Index 23: 'j'
   Index 24: 'k'
   Index 25: 'l'
   Index 26: 'm'
   Index 27: 'n'
   Index 28: 'o'
   Index 29: 'p'
   Index 30: 'q'
   Index 31: 'r'
   Index 32: 's'
   Index 33: 't'
   Index 34: 'u'
   Index 35: 'v'
   Index 36: 'w'
   Index 37: 'x'
   Index 38: 'y'
   Index 39: 'z'

Step 3: Create Encoder and Decoder

This is how we convert between human-readable text and model-readable numbers w/ two functions.

  1. Encoder: text → list of integers
  2. Decoder: list of integers → text
stoi = {}
for i, char in enumerate(chars):
    stoi[char] = i
stoi

Output:

{'\n': 0,
 ' ': 1,
 ',': 2,
 '.': 3,
 'A': 4,
 'C': 5,
 'D': 6,
 'E': 7,
 'I': 8,
 'L': 9,
 'P': 10,
 'R': 11,
 'T': 12,
 'W': 13,
 'a': 14,
 'b': 15,
 'c': 16,
 'd': 17,
 'e': 18,
 'f': 19,
 'g': 20,
 'h': 21,
 'i': 22,
 'j': 23,
 'k': 24,
 'l': 25,
 'm': 26,
 'n': 27,
 'o': 28,
 'p': 29,
 'q': 30,
 'r': 31,
 's': 32,
 't': 33,
 'u': 34,
 'v': 35,
 'w': 36,
 'x': 37,
 'y': 38,
 'z': 39}
itos = {}
for i, char in enumerate(chars):
    itos[i] = char
itos

Output:

{0: '\n',
 1: ' ',
 2: ',',
 3: '.',
 4: 'A',
 5: 'C',
 6: 'D',
 7: 'E',
 8: 'I',
 9: 'L',
 10: 'P',
 11: 'R',
 12: 'T',
 13: 'W',
 14: 'a',
 15: 'b',
 16: 'c',
 17: 'd',
 18: 'e',
 19: 'f',
 20: 'g',
 21: 'h',
 22: 'i',
 23: 'j',
 24: 'k',
 25: 'l',
 26: 'm',
 27: 'n',
 28: 'o',
 29: 'p',
 30: 'q',
 31: 'r',
 32: 's',
 33: 't',
 34: 'u',
 35: 'v',
 36: 'w',
 37: 'x',
 38: 'y',
 39: 'z'}
def encode(string):
    """Encode a string into a list of token indices.

    Args:
        string (str): input text to encode

    Returns:
        list[int]: list of integer token indices
    """
    encoded = []
    for char in string:
        encoded.append(stoi[char])
    return encoded
def decode(lst):
    """Convert a list of token indices into a string.

    Args:
        lst (list[int]): list of integer token indices

    Returns:
        str: the decoded string
    """
    chars = []
    for i in lst:
        chars.append(itos[i])
    return ''.join(chars)
encoded_text = encode('hello world')
encoded_text

Output:

[21, 18, 25, 25, 28, 1, 36, 28, 31, 25, 17]
decoded_text = decode(encoded_text)
decoded_text

Output:

'hello world'

Step 4: Convert Entire Dataset to Tensors

Now we'll encode the entire text and store it as a PyTorch tensor.

encoded_text = encode(text)
encoded_text

Output:

[4,
 1,
 17,
 22,
 26,
 1,
 20,
 25,
 28,
 36,
 1,
 31,
 22,
 32,
 18,
 32,
 1,
 15,
 18,
 21,
 22,
 27,
 17,
 1,
 33,
 21,
 18,
 1,
 20,
 25,
 14,
 32,
 32,
 1,
 28,
 19,
 1,
 14,
 1,
 32,
 16,
 31,
 18,
 18,
 27,
 1,
 14,
 27,
 17,
 1,
 33,
 21,
 18,
 1,
 26,
 14,
 16,
 21,
 22,
 27,
 18,
 1,
 18,
 37,
 21,
 14,
 25,
 18,
 32,
 1,
 22,
 27,
 1,
 15,
 22,
 27,
 14,
 31,
 38,
 1,
 33,
 22,
 17,
 18,
 32,
 3,
 1,
 12,
 21,
 18,
 1,
 21,
 34,
 26,
 1,
 22,
 32,
 1,
 14,
 1,
 25,
 14,
 27,
 20,
 34,
 14,
 20,
 18,
 1,
 14,
 27,
 17,
 1,
 28,
 27,
 18,
 1,
 36,
 21,
 28,
 1,
 25,
 22,
 32,
 33,
 18,
 27,
 32,
 1,
 25,
 18,
 14,
 27,
 32,
 1,
 16,
 25,
 28,
 32,
 18,
 1,
 33,
 28,
 1,
 16,
 14,
 33,
 16,
 21,
 1,
 33,
 21,
 18,
 1,
 30,
 34,
 22,
 18,
 33,
 1,
 20,
 31,
 14,
 26,
 26,
 14,
 31,
 3,
 1,
 10,
 14,
 33,
 33,
 18,
 31,
 27,
 32,
 1,
 19,
 28,
 25,
 17,
 1,
 25,
 22,
 24,
 18,
 1,
 32,
 26,
 14,
 25,
 25,
 1,
 26,
 14,
 29,
 32,
 1,
 14,
 27,
 17,
 1,
 32,
 18,
 14,
 26,
 32,
 1,
 21,
 22,
 27,
 33,
 1,
 14,
 33,
 1,
 21,
 28,
 36,
 1,
 33,
 21,
 18,
 1,
 33,
 21,
 22,
 27,
 20,
 1,
 21,
 28,
 25,
 17,
 32,
 1,
 22,
 33,
 32,
 18,
 25,
 19,
 1,
 33,
 28,
 20,
 18,
 33,
 21,
 18,
 31,
 3,
 1,
 12,
 31,
 18,
 14,
 33,
 1,
 18,
 14,
 16,
 21,
 1,
 15,
 25,
 22,
 27,
 24,
 22,
 27,
 20,
 1,
 17,
 22,
 28,
 17,
 18,
 1,
 14,
 27,
 17,
 1,
 18,
 14,
 16,
 21,
 1,
 22,
 17,
 25,
 18,
 1,
 33,
 22,
 16,
 24,
 1,
 14,
 32,
 1,
 14,
 1,
 32,
 18,
 27,
 33,
 18,
 27,
 16,
 18,
 1,
 22,
 27,
 1,
 14,
 1,
 32,
 33,
 28,
 31,
 38,
 1,
 33,
 21,
 14,
 33,
 1,
 14,
 32,
 24,
 32,
 1,
 33,
 28,
 1,
 15,
 18,
 1,
 31,
 18,
 14,
 17,
 3,
 0,
 0,
 12,
 21,
 18,
 31,
 18,
 1,
 22,
 32,
 1,
 29,
 14,
 33,
 22,
 18,
 27,
 16,
 18,
 1,
 21,
 18,
 31,
 18,
 2,
 1,
 27,
 28,
 33,
 1,
 28,
 19,
 1,
 21,
 14,
 32,
 33,
 18,
 1,
 15,
 34,
 33,
 1,
 28,
 19,
 1,
 16,
 14,
 31,
 18,
 19,
 34,
 25,
 1,
 34,
 27,
 33,
 21,
 31,
 18,
 14,
 17,
 22,
 27,
 20,
 3,
 1,
 13,
 21,
 18,
 31,
 18,
 1,
 28,
 33,
 21,
 18,
 31,
 32,
 1,
 32,
 18,
 18,
 1,
 14,
 1,
 32,
 18,
 14,
 25,
 18,
 17,
 1,
 15,
 28,
 37,
 1,
 33,
 21,
 18,
 1,
 16,
 34,
 31,
 22,
 28,
 34,
 32,
 1,
 21,
 14,
 27,
 17,
 1,
 33,
 31,
 14,
 16,
 18,
 32,
 1,
 33,
 21,
 18,
 1,
 23,
 28,
 22,
 27,
 1,
 14,
 27,
 17,
 1,
 36,
 28,
 27,
 17,
 18,
 31,
 32,
 1,
 36,
 21,
 22,
 16,
 21,
 1,
 33,
 21,
 28,
 34,
 20,
 21,
 33,
 1,
 26,
 14,
 17,
 18,
 1,
 22,
 33,
 1,
 19,
 22,
 33,
 3,
 1,
 6,
 28,
 1,
 27,
 28,
 33,
 1,
 31,
 34,
 32,
 21,
 1,
 33,
 28,
 1,
 15,
 31,
 18,
 14,
 24,
 2,
 1,
 16,
 28,
 14,
 37,
 1,
 33,
 21,
 18,
 1,
 26,
 18,
 14,
 27,
 22,
 27,
 20,
 1,
 28,
 34,
 33,
 1,
 36,
 22,
 33,
 21,
 1,
 30,
 34,
 18,
 32,
 33,
 22,
 28,
 27,
 32,
 2,
 1,
 14,
 27,
 17,
 1,
 36,
 14,
 33,
 16,
 21,
 1,
 21,
 28,
 36,
 1,
 33,
 21,
 18,
 1,
 25,
 28,
 20,
 22,
 16,
 1,
 31,
 18,
 29,
 25,
 22,
 18,
 32,
 1,
 22,
 27,
 1,
 33,
 31,
 14,
 16,
 18,
 32,
 1,
 14,
 27,
 17,
 1,
 18,
 31,
 31,
 28,
 31,
 32,
 1,
 14,
 27,
 17,
 1,
 22,
 27,
 1,
 33,
 21,
 18,
 1,
 18,
 16,
 21,
 28,
 18,
 32,
 1,
 28,
 19,
 1,
 19,
 28,
 31,
 20,
 28,
 33,
 33,
 18,
 27,
 1,
 22,
 27,
 33,
 18,
 31,
 19,
 14,
 16,
 18,
 32,
 3,
 0,
 0,
 12,
 18,
 16,
 21,
 27,
 28,
 25,
 28,
 20,
 38,
 1,
 22,
 32,
 1,
 14,
 31,
 33,
 22,
 19,
 14,
 16,
 33,
 1,
 14,
 27,
 17,
 1,
 14,
 31,
 20,
 34,
 26,
 18,
 27,
 33,
 1,
 14,
 33,
 1,
 28,
 27,
 16,
 18,
 3,
 1,
 8,
 33,
 1,
 26,
 14,
 24,
 18,
 32,
 1,
 14,
 1,
 16,
 25,
 14,
 22,
 26,
 1,
 14,
 15,
 28,
 34,
 33,
 1,
 36,
 21,
 14,
 33,
 1,
 32,
 21,
 28,
 34,
 25,
 17,
 1,
 15,
 18,
 1,
 32,
 22,
 26,
 29,
 25,
 18,
 2,
 1,
 36,
 21,
 14,
 33,
 1,
 32,
 21,
 28,
 34,
 25,
 17,
 1,
 15,
 18,
 1,
 21,
 22,
 17,
 17,
 18,
 27,
 2,
 1,
 14,
 27,
 17,
 1,
 36,
 21,
 14,
 33,
 1,
 32,
 21,
 28,
 34,
 25,
 17,
 1,
 15,
 18,
 1,
 33,
 31,
 34,
 32,
 33,
 18,
 17,
 3,
 1,
 11,
 18,
 35,
 18,
 31,
 32,
 18,
 1,
 33,
 21,
 18,
 1,
 20,
 14,
 39,
 18,
 1,
 14,
 27,
 17,
 1,
 25,
 18,
 14,
 31,
 27,
 1,
 22,
 33,
 32,
 1,
 31,
 21,
 18,
 33,
 28,
 31,
 22,
 16,
 2,
 1,
 32,
 18,
 18,
 1,
 36,
 21,
 18,
 31,
 18,
 1,
 22,
 33,
 1,
 29,
 31,
 28,
 26,
 22,
 32,
 18,
 32,
 1,
 18,
 14,
 32,
 18,
 2,
 1,
 36,
 21,
 18,
 31,
 18,
 1,
 22,
 33,
 1,
 15,
 34,
 31,
 22,
 18,
 32,
 1,
 16,
 28,
 26,
 29,
 25,
 18,
 37,
 22,
 33,
 38,
 2,
 1,
 14,
 27,
 17,
 1,
 36,
 21,
 18,
 31,
 18,
 1,
 22,
 33,
 1,
 25,
 18,
 14,
 35,
 18,
 32,
 1,
 14,
 1,
 15,
 14,
 16,
 24,
 17,
 28,
 28,
 31,
 1,
 14,
 32,
 1,
 14,
 1,
 32,
 22,
 20,
 21,
 1,
 15,
 18,
 33,
 36,
 18,
 18,
 27,
 1,
 15,
 31,
 22,
 16,
 24,
 32,
 3,
 1,
 12,
 28,
 1,
 31,
 18,
 14,
 17,
 1,
 33,
 21,
 14,
 33,
 1,
 31,
 21,
 18,
 33,
 28,
 31,
 22,
 16,
 1,
 22,
 32,
 1,
 33,
 28,
 1,
 15,
 18,
 1,
 14,
 1,
 24,
 ...]
data = torch.tensor(encoded_text, dtype=torch.long)
data

Output:

tensor([ 4,  1, 17,  ..., 25, 25,  3])
data.dtype

Output:

torch.int64
data.shape

Output:

torch.Size([2114])
data.ndim

Output:

1
decode(data.tolist())

Output:

'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mechanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'

Step 5: Train/Test Split

  • Training Set (90%): Used to train the model.
  • Test Set (10%): Used to check if the model is overfitting.

Overfitting = Model memorizes training data but can't generalize to new data.

By checking performance on test data (which the model never trained on), we can detect this.

split_point = int(0.9 * len(data))
split_point

Output:

1902
training_data = data[:split_point]
training_data

Output:

tensor([ 4,  1, 17,  ..., 26, 18, 16])
decode(training_data.tolist())

Output:

'A dim glow rises behind the glass of a screen and the machine exhales in binary tides. The hum is a language and one who listens leans close to catch the quiet grammar. Patterns fold like small maps and seams hint at how the thing holds itself together. Treat each blinking diode and each idle tick as a sentence in a story that asks to be read.\n\nThere is patience here, not of haste but of careful unthreading. Where others see a sealed box the curious hand traces the join and wonders which thought made it fit. Do not rush to break, coax the meaning out with questions, and watch how the logic replies in traces and errors and in the echoes of forgotten interfaces.\n\nTechnology is artifact and argument at once. It makes a claim about what should be simple, what should be hidden, and what should be trusted. Reverse the gaze and learn its rhetoric, see where it promises ease, where it buries complexity, and where it leaves a backdoor as a sigh between bricks. To read that rhetoric is to be a kind interpreter, not a vandal.\n\nThis work is an apprenticeship in humility. Expect bafflement and expect to be corrected by small things, a timing oddity, a mismatch of expectation, a choice that favors speed over grace. Each misstep teaches a vocabulary of trade offs. Each discovery is a map of decisions and not a verdict on worth.\n\nThere is a moral keeping in the craft. Let curiosity be tempered with regard for consequence. Let repair and understanding lead rather than exploitation. The skill that opens a lock should also know when to hold the key and when to hand it back, mindful of harm and mindful of help.\n\nCelebrate the quiet victories, a stubborn protocol understood, an obscure format rendered speakable, a closed device coaxed into cooperation. These are small reconciliations between human intent and metal will, acts of translation rather than acts of conquest.\n\nAfter decoding a mec'
test_data = data[split_point:]
test_data

Output:

tensor([21, 14, 27, 22, 32, 26,  1, 29, 14, 34, 32, 18,  1, 14, 27, 17,  1, 14,
        32, 24,  1, 36, 21, 14, 33,  1, 32, 21, 28, 34, 25, 17,  1, 16, 21, 14,
        27, 20, 18,  2,  1, 14,  1, 15, 34, 20,  1, 33, 28,  1, 15, 18,  1, 19,
        22, 37, 18, 17,  2,  1, 14,  1, 34, 32, 18, 31,  1, 33, 28,  1, 15, 18,
         1, 36, 14, 31, 27, 18, 17,  2,  1, 14,  1, 17, 18, 32, 22, 20, 27,  1,
        33, 28,  1, 15, 18,  1, 14, 26, 18, 27, 17, 18, 17,  3,  1, 12, 21, 18,
         1, 33, 31, 34, 18,  1, 26, 14, 24, 18, 31,  1, 28, 19,  1, 26, 14, 16,
        21, 22, 27, 18, 32,  1, 25, 18, 14, 35, 18, 32,  1, 33, 21, 22, 27, 20,
        32,  1, 15, 18, 33, 33, 18, 31,  1, 19, 28, 31,  1, 21, 14, 35, 22, 27,
        20,  1, 25, 28, 28, 24, 18, 17,  2,  1, 27, 28, 33,  1, 32, 22, 26, 29,
        25, 38,  1, 19, 28, 31,  1, 21, 14, 35, 22, 27, 20,  1, 16, 31, 14, 16,
        24, 18, 17,  1, 33, 21, 18,  1, 32, 21, 18, 25, 25,  3])
decode(test_data.tolist())

Output:

'hanism pause and ask what should change, a bug to be fixed, a user to be warned, a design to be amended. The true maker of machines leaves things better for having looked, not simply for having cracked the shell.'

Step 6: Understanding Context(Block Size)

Block size (also called context length) is the maximum number of previous characters the model can "see" when making a prediction.

For example, with block_size = 8.

  • The model sees at most 8 characters of history.
  • Given 'hello wo', it predicts the next character ('r' for 'world').
block_size = 8  # context window
block_size

Output:

8
chunk = training_data[:block_size+1]
chunk

Output:

tensor([ 4,  1, 17, 22, 26,  1, 20, 25, 28])
decode(chunk.tolist())

Output:

'A dim glo'

Step 7: Training Examples from One Chunk

From one chunk of 9 chars, we actually get 8 training examples.

Each example has the following.

  • Input (x): Some context chars (1 to 8 chars).
  • Target (y): The next char after the context.

This is how language models learn; given context, predict the next token.

x = training_data[:block_size]  # chars 0,1,2,3,4,5,6,7
x

Output:

tensor([ 4,  1, 17, 22, 26,  1, 20, 25])
decode(x.tolist())

Output:

'A dim gl'
y = training_data[1:block_size+1]  # chars 1,2,3,4,5,6,7,8
y

Output:

tensor([ 1, 17, 22, 26,  1, 20, 25, 28])
decode(y.tolist())

Output:

' dim glo'
print(f"{'Example':<8} {'Context (Input)':<30} {'Target':<10} {'Decoded'}")
print("=" * 70)
for token in range(block_size):
    context = x[:token+1]
    target_char = y[token]
    context_decoded = decode(context.tolist())
    target_char_decoded = decode([target_char.item()])
    print(f"  {token+1:<6} {str(context.tolist()):<30} {target_char.item():<10} '{context_decoded}' → '{target_char_decoded}'")

Output:

Example  Context (Input)                Target     Decoded
======================================================================
  1      [4]                            1          'A' → ' '
  2      [4, 1]                         17         'A ' → 'd'
  3      [4, 1, 17]                     22         'A d' → 'i'
  4      [4, 1, 17, 22]                 26         'A di' → 'm'
  5      [4, 1, 17, 22, 26]             1          'A dim' → ' '
  6      [4, 1, 17, 22, 26, 1]          20         'A dim ' → 'g'
  7      [4, 1, 17, 22, 26, 1, 20]      25         'A dim g' → 'l'
  8      [4, 1, 17, 22, 26, 1, 20, 25]  28         'A dim gl' → 'o'

Step 8: Batching

Processing one example at a time is inefficient. GPUs excel at parallel processing, so we group multiple examples into a batch.

Concept Value Meaning
batch_size 4 process 4 independent sequences in parallel
block_size 8 each sequence is 8 tokens long

So one batch processes 4 × 8 = 32 total predictions!

torch.manual_seed(42)

Output:

<torch._C.Generator at 0x117e8e4f0>
batch_size = 4  # number of independent sequences per batch
batch_size

Output:

4
block_size = 8  # max context length
block_size

Output:

8
def get_batch(split):
    """
    Generate a batch of training data (friendly names).

    Args:
        split (str): 'train' or 'val' - which dataset to sample from

    Returns:
        x: Input tensor of shape (batch_size, block_size)
        y: Target tensor of shape (batch_size, block_size)
    """
    # choose the dataset based on `split`
    data = training_data if split == 'train' else test_data

    # make sure we have room to sample a full block
    if len(data) <= block_size:
        raise ValueError('Dataset is too small for the configured block_size!')

    # highest valid start index so that start + block_size is in-range
    max_start_index = len(data) - block_size

    # sample `batch_size` random start positions in [0, max_start_index)
    start_positions = torch.randint(0, max_start_index, (batch_size,))

    # build the input (x) and target (y) batches
    x = torch.stack([data[pos:pos + block_size] for pos in start_positions])
    y = torch.stack([data[pos + 1: pos + block_size + 1] for pos in start_positions])

    # return the input and target batches
    return x, y
x_batch, y_batch = get_batch('train')
x_batch

Output:

tensor([[33, 32,  1, 31, 21, 18, 33, 28],
        [17,  1, 34, 27, 17, 18, 31, 32],
        [ 1, 36, 21, 18, 31, 18,  1, 22],
        [25,  1, 33, 21, 22, 27, 20, 32]])
for row in x_batch.tolist():
    print(decode(row))

Output:

ts rheto
d unders
 where i
l things

x_batch.dtype

Output:

torch.int64
x_batch.shape

Output:

torch.Size([4, 8])
x_batch.ndim

Output:

2
y_batch

Output:

tensor([[32,  1, 31, 21, 18, 33, 28, 31],
        [ 1, 34, 27, 17, 18, 31, 32, 33],
        [36, 21, 18, 31, 18,  1, 22, 33],
        [ 1, 33, 21, 22, 27, 20, 32,  2]])
for row in y_batch.tolist():
    print(decode(row))

Output:

s rhetor
 underst
where it
 things,

y_batch.dtype

Output:

torch.int64
y_batch.shape

Output:

torch.Size([4, 8])
y_batch.ndim

Output:

2
f'{x_batch.shape} → x_batch (inputs): {batch_size} sequences x {block_size} tokens each'

Output:

'torch.Size([4, 8]) → x_batch (inputs): 4 sequences x 8 tokens each'
f'{y_batch.shape} → y_batch (inputs): {batch_size} sequences x {block_size} tokens each'

Output:

'torch.Size([4, 8]) → y_batch (inputs): 4 sequences x 8 tokens each'
print('Batch Inputs (x_batch):')
print('-' * 60)
for batch in range(batch_size):
    tokens = x_batch[batch].tolist()
    decoded = decode(tokens)
    print(f'Sequence {batch}: {tokens}')
    print(f'             → "{decoded}"')

Output:

Batch Inputs (x_batch):
------------------------------------------------------------
Sequence 0: [33, 32, 1, 31, 21, 18, 33, 28]
             → "ts rheto"
Sequence 1: [17, 1, 34, 27, 17, 18, 31, 32]
             → "d unders"
Sequence 2: [1, 36, 21, 18, 31, 18, 1, 22]
             → " where i"
Sequence 3: [25, 1, 33, 21, 22, 27, 20, 32]
             → "l things"

print('Batch Targets (y_batch):')
print('-' * 60)
for batch in range(batch_size):
    tokens = y_batch[batch].tolist()
    decoded = decode(tokens)
    print(f'Sequence {batch}: {tokens}')
    print(f'             → "{decoded}"')

Output:

Batch Targets (y_batch):
------------------------------------------------------------
Sequence 0: [32, 1, 31, 21, 18, 33, 28, 31]
             → "s rhetor"
Sequence 1: [1, 34, 27, 17, 18, 31, 32, 33]
             → " underst"
Sequence 2: [36, 21, 18, 31, 18, 1, 22, 33]
             → "where it"
Sequence 3: [1, 33, 21, 22, 27, 20, 32, 2]
             → " things,"

print('All 32 Training Examples in this Batch.')
print('=' * 60)

example_num = 0
# loop over sequences in batch
for batch in range(batch_size):  
    # loop over positions in sequence
    for token in range(block_size):  
        example_num += 1
        # context: all tokens up to token position
        context = x_batch[batch, :token+1]  
        # target: the next token
        target = y_batch[batch, token]  
        context_decoded = decode(context.tolist())
        target_decoded = decode([target.item()])
        # truncate long contexts for display
        context_string = context_decoded if len(context_decoded) <= 10 else '...' + context_decoded[-7:]
        print(f'Example {example_num:2d} (seq {batch}, pos {token}): "{context_string}" → "{target_decoded}"')
    print()  # Blank line between sequences

print('=' * 60)
print(f'Total: {batch_size} x {block_size} = {batch_size * block_size} training examples per batch processed in parallel!')

Output:

All 32 Training Examples in this Batch.
============================================================
Example  1 (seq 0, pos 0): "t" → "s"
Example  2 (seq 0, pos 1): "ts" → " "
Example  3 (seq 0, pos 2): "ts " → "r"
Example  4 (seq 0, pos 3): "ts r" → "h"
Example  5 (seq 0, pos 4): "ts rh" → "e"
Example  6 (seq 0, pos 5): "ts rhe" → "t"
Example  7 (seq 0, pos 6): "ts rhet" → "o"
Example  8 (seq 0, pos 7): "ts rheto" → "r"

Example  9 (seq 1, pos 0): "d" → " "
Example 10 (seq 1, pos 1): "d " → "u"
Example 11 (seq 1, pos 2): "d u" → "n"
Example 12 (seq 1, pos 3): "d un" → "d"
Example 13 (seq 1, pos 4): "d und" → "e"
Example 14 (seq 1, pos 5): "d unde" → "r"
Example 15 (seq 1, pos 6): "d under" → "s"
Example 16 (seq 1, pos 7): "d unders" → "t"

Example 17 (seq 2, pos 0): " " → "w"
Example 18 (seq 2, pos 1): " w" → "h"
Example 19 (seq 2, pos 2): " wh" → "e"
Example 20 (seq 2, pos 3): " whe" → "r"
Example 21 (seq 2, pos 4): " wher" → "e"
Example 22 (seq 2, pos 5): " where" → " "
Example 23 (seq 2, pos 6): " where " → "i"
Example 24 (seq 2, pos 7): " where i" → "t"

Example 25 (seq 3, pos 0): "l" → " "
Example 26 (seq 3, pos 1): "l " → "t"
Example 27 (seq 3, pos 2): "l t" → "h"
Example 28 (seq 3, pos 3): "l th" → "i"
Example 29 (seq 3, pos 4): "l thi" → "n"
Example 30 (seq 3, pos 5): "l thin" → "g"
Example 31 (seq 3, pos 6): "l thing" → "s"
Example 32 (seq 3, pos 7): "l things" → ","

============================================================
Total: 4 x 8 = 32 training examples per batch processed in parallel!

MIT License

About

HackingGPT part 1 where we learn the foundations to create a custom GPT from absolute scratch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors