Skip to content

feat: codebook Compile — prose-in token abbreviation map#15

Merged
pythondatascrape merged 1 commit into
mainfrom
feat/codebook-compile
May 6, 2026
Merged

feat: codebook Compile — prose-in token abbreviation map#15
pythondatascrape merged 1 commit into
mainfrom
feat/codebook-compile

Conversation

@pythondatascrape
Copy link
Copy Markdown
Owner

Summary

Test plan

  • High-frequency tokens extracted and abbreviated
  • Short tokens (≤6 chars) excluded
  • Rare tokens (freq=1) excluded
  • Abbreviations are unique within the map
  • Empty input returns empty map
  • Case-insensitive counting (Authentication/AUTHENTICATION count as same token)

🤖 Generated with Claude Code

Scans prose for high-frequency tokens (len > 6, freq > 1) and returns
a substitution map of original → unique short abbreviation, enabling
the prose-in codebook gateway described in issue #7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 14:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new internal/identity/codebook authoring utility that derives abbreviation substitutions from free-form prose, intended as a building block for the “prose-in” codebook workflow discussed in #7.

Changes:

  • Introduces codebook.Compile(text string) (map[string]string, error) to extract high-frequency tokens and generate unique abbreviations.
  • Adds unit tests covering frequency/length filters, case normalization, uniqueness, and empty input.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
internal/identity/codebook/compile.go Implements token frequency scanning and abbreviation generation for substitution maps.
internal/identity/codebook/compile_test.go Adds tests validating extraction rules, normalization, and uniqueness constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +46
// Collect candidates: tokens that appear more than once.
var candidates []string
for tok, count := range freq {
if count > 1 {
candidates = append(candidates, tok)
}
}

if len(candidates) == 0 {
return map[string]string{}, nil
}

result := make(map[string]string, len(candidates))
usedAbbrs := make(map[string]struct{})

for _, tok := range candidates {
abbr, err := uniqueAbbreviation(tok, usedAbbrs)
if err != nil {
return nil, err
}
result[tok] = abbr
usedAbbrs[abbr] = struct{}{}
Comment on lines +62 to +70
// Fall back: append a numeric suffix to a 3-char prefix.
base := string(runes[:3])
for i := 2; i <= len(runes); i++ {
candidate := fmt.Sprintf("%s%d", base, i)
if _, taken := used[candidate]; !taken {
return candidate, nil
}
}
return "", fmt.Errorf("could not generate unique abbreviation for %q", tok)
Comment on lines +8 to +21
// Compile scans prose for high-frequency multi-syllable tokens (len > 6, freq > 1)
// and returns a substitution map of original_token → short_abbreviation.
// The abbreviations are unique within the returned map.
func Compile(text string) (map[string]string, error) {
if text == "" {
return map[string]string{}, nil
}

// Count normalized (lowercased) token frequencies.
freq := make(map[string]int)
for _, tok := range strings.Fields(text) {
tok = strings.ToLower(strings.Trim(tok, ".,;:!?\"'()[]{}"))
if len(tok) > 6 {
freq[tok]++
Comment on lines +18 to +22
for _, tok := range strings.Fields(text) {
tok = strings.ToLower(strings.Trim(tok, ".,;:!?\"'()[]{}"))
if len(tok) > 6 {
freq[tok]++
}
Comment on lines +42 to +53
func TestCompile_AbbreviationsAreUnique(t *testing.T) {
prompt := strings.Repeat("authentication configuration authorization ", 5)
subs, err := codebook.Compile(prompt)
require.NoError(t, err)
seen := make(map[string]string)
for orig, abbr := range subs {
if prev, exists := seen[abbr]; exists {
t.Errorf("abbreviation %q used for both %q and %q", abbr, prev, orig)
}
seen[abbr] = orig
}
}
@pythondatascrape pythondatascrape merged commit cd6f117 into main May 6, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: prose-in codebook gateway — compile raw prompts to compressed codebook entries

2 participants