feat: comprehensive Argilla user and workspace management#78
Conversation
Add `gsma argilla add-users` command to create multiple annotator accounts
for a workspace with predictable naming and password patterns.
Features:
- Creates users with pattern: {workspace}-user-{number}
- Generates passwords with pattern: {username}-gsma
- Exports credentials to CSV for easy distribution
- Validates workspace exists before creating users
- Skips existing users automatically
- Adds created users to the workspace
This enables easy provisioning of annotation accounts for domain experts,
workshops, or specific annotation campaigns.
Add three new Argilla CLI commands for workspace administration: 1. `list-users`: Export workspace users with credentials to CSV - Lists all users in a workspace - Prints to CLI with masked passwords (first 3 chars visible) - Exports full credentials to CSV file - Example: gsma argilla list-users -w mantis -o users.csv 2. `track-progress`: Track annotation progress per annotator - Retrieves detailed annotation statistics per user - Shows submitted/draft/discarded counts for completed and pending records - Supports single dataset or all datasets in workspace - Exports to CSV with totals and breakdowns - Example: gsma argilla track-progress -w tsg-wg -o progress.csv 3. `delete-user`: Delete a user from Argilla - Permanently removes user and their data - Requires confirmation unless --force flag used - Example: gsma argilla delete-user tsg-wg-user-1 All commands support environment variable configuration for API credentials and include comprehensive error handling.
- Make output CSV file optional (--output-csv parameter) - Print progress summary to console with formatted output - Group progress by dataset for cleaner visualization - Show detailed breakdown of submitted/draft/discarded counts per user - Maintain CSV export functionality when output path provided Example usage: # Print to console only gsma argilla track-progress -w mantis # Print to console and export to CSV gsma argilla track-progress -w mantis -o progress.csv
…s bar - Replace logging with print() for cleaner console output - Add tabulate table formatting for better readability - Add tqdm progress bar to show dataset fetching progress - Include fallback simple format if tabulate unavailable - Show all annotation details (submitted/draft/discarded) in columns This makes the command more user-friendly and less likely to feel like it's hanging during long-running API calls.
Add multiple new commands for better Argilla administration: 1. `add-user`: Create user and add to workspace(s) - Create new user with username, password, and role - Add to one or more workspaces (multiple -w flags) - Validates all workspaces exist before creation - Supports annotator, admin, and owner roles 2. `add-to-workspace`: Add existing user to workspace(s) - Add already-created user to additional workspaces - Supports multiple workspaces (multiple -w flags) - Skips if user already in workspace - Continues processing if workspace not found 3. `list-users`: List workspace users with credentials - Pretty-prints users in table format with masked passwords - Optional CSV export - Shows exact case-sensitive usernames 4. `list-workspaces`: List all workspaces - Displays all available workspaces in table format - Alphabetically sorted - Helps identify correct workspace names (case-sensitive) 5. `track-progress`: Track annotation progress - Optimized workspace dataset fetching (was 37s, now instant) - Pretty table output with tabulate - Optional CSV export - Shows detailed breakdown of annotations by status 6. `delete-user`: Delete a user from Argilla - Removes user system-wide - Requires confirmation unless --force These commands properly separate user creation from workspace assignment, allowing users to belong to multiple workspaces as expected in Argilla.
Replace predictable password pattern ({username}-gsma) with cryptographically
secure random passwords generated using the secrets module.
Changes:
- Add generate_random_password() using secrets.choice() for secure randomness
- Default password length: 8 characters (alphanumeric)
- Passwords only accessible during user creation (saved to CSV)
- Update list-users to only show usernames (passwords are hashed and cannot be retrieved)
- Add helpful messages about password availability
Security improvement: Predictable passwords are a security risk. Random passwords
ensure that each user has a unique, unguessable credential.
Note: Users MUST save the CSV file from add-users command to share passwords
with annotators, as passwords cannot be retrieved after creation.
Add second-tier annotation field to capture specific problems with incorrect or partially correct answers. Changes: - Add MultiLabelQuestion "quality_issues" with options: - "Bad acronym expansion" - "Incorrect spec version" - "Other" - Field is optional (not required) - Update annotation guidelines to explain when to use the field - Test dataset created: test_quality_issues_tsg_20251028_191645_eea7602 Testing: - 20 TSG samples uploaded to mantis workspace - Dataset URL: https://mantisnlp-annotate.hf.space/dataset/c4806f1b-02d6-48da-8e25-e138d28d5438/annotation-mode This allows annotators to provide more granular feedback on problematic Q&A pairs without requiring conditional field logic (which Argilla doesn't support).
1. Add list-datasets command: - Lists all datasets in a specified workspace - Displays dataset names and IDs in table format - Usage: gsma argilla list-datasets -w mantis 2. Update quality issues options: - Replace "Other" with "Missing context" - Final options: "Bad acronym expansion", "Incorrect spec version", "Missing context" - More specific feedback categories for annotators The list-datasets command helps users navigate workspaces and verify dataset uploads.
…mands Changes: - Added -w shorthand to upload, delete, download commands - Changed upload-by-subgroup --working-group shorthand from -w to -g - Added -w shorthand to upload-by-subgroup --workspace parameter - Updated test to validate random passwords instead of predictable pattern All --workspace parameters now consistently support -w shorthand for improved CLI usability. The --working-group parameter now uses -g to avoid conflicts.
Comprehensive PR Review - PR #78: Bulk User Creation for Argilla WorkspacesOverviewThis PR adds extensive Argilla user and workspace management functionality with 8 new commands and comprehensive test coverage. The implementation is well-structured and follows the project's patterns. 🟢 Strengths1. Excellent Security Practices
2. Robust Error Handling
3. User Experience
4. Code Quality
5. Feature Completeness
🟡 Issues & Concerns1. Critical: Breaking Change Not Highlighted EnoughLocation: Line 261 in argilla_cli.py working_group: str | None = typer.Option(
None,
"--working-group",
"-g", # Changed from -w
...
)Issue: The Recommendation:
2. Security: Unused Import in add-users CommandLocation: Lines 769-770 from gsma_dataset_creation.validation.argilla_subgroup_uploader import \
create_valid_passwordIssue: The Impact:
Recommendation: Remove the unused import on lines 769-770. 3. Inconsistency: Password Strategy DivergenceIssue: Two different password strategies exist:
Impact:
Recommendation:
4. Documentation Gap: Command Not ListedIssue: The Recommendation: Add all new commands to the CLAUDE.md CLI Commands list with brief descriptions. 5. Potential Bug: list-users Password ReconstructionLocation: Lines 880-892 (docstring) """
This command retrieves all users associated with a workspace and exports their
usernames and reconstructed passwords (following the {username}-gsma pattern).
"""Issue: The docstring claims to export "reconstructed passwords" but:
Impact: Misleading documentation could lead users to expect password recovery functionality that doesn't exist. Recommendation: Update the docstring to remove the misleading "reconstructed passwords" claim. 6. Code Quality: Import Style InconsistencyLocation: Lines 74-75, 104-105, 152-153, etc. from gsma_dataset_creation.validation.argilla_uploader import \
upload_dataset_to_argillaIssue: Most imports use backslash line continuation, but Python prefers implicit line continuation within parentheses. Recommendation: Consider using: from gsma_dataset_creation.validation.argilla_uploader import (
upload_dataset_to_argilla
)This is minor but follows PEP 8 best practices. 7. Performance: track-progress Could Be SlowLocation: Lines 1078-1110 for dataset in datasets_to_track:
progress = dataset.progress(with_users_distribution=True)Issue: The Recommendation:
🔵 Suggestions for Future Enhancement1. Password Length ConfigurationThe password_length: int = typer.Option(8, "--password-length", help="Length of generated passwords (minimum 8)")2. Bulk User Import from CSVConsider adding a command to import users from CSV (reverse of export), useful for pre-planned annotation campaigns. 3. User Role ManagementThe 4. Progress Export FormatThe track-progress CSV export is excellent. Consider adding JSON export option for programmatic consumption. 🟢 Test Coverage AssessmentThe test file ( ✅ Well-covered scenarios:
Recommendation: Add tests for the other 7 commands to maintain consistency with project's TDD approach. 📋 Action Items SummaryMust Fix Before Merge:
Should Fix Before Merge:
Nice to Have:
✅ Final VerdictStatus: Approve with Required Changes This is a high-quality PR that adds valuable functionality with good security practices. The core implementation is solid, but there are 4 critical issues that must be addressed before merge:
Once these issues are resolved, this PR will be ready to merge. The additional test coverage can be added in a follow-up PR if needed. Great work overall! The comprehensive command suite and attention to UX details (emojis, table formatting, CSV exports) show thoughtful design. 🎉 |
Changes: - Remove unused create_valid_password import from add-users command - Fix list-users docstring to clarify passwords cannot be retrieved - Update CLAUDE.md to document all new CLI commands - Update password documentation to reflect random secure passwords - Add usage examples for add-user command Addresses all documentation gaps and misleading claims identified in PR review.
Added comprehensive documentation for all Argilla user and workspace management commands from PR #78: - User creation commands (add-users, add-user) - Workspace management (add-to-workspace, list-workspaces, list-datasets) - Monitoring (track-progress, list-users) - Cleanup (delete-user) Includes usage examples for common workflows like bulk user creation and multi-workspace user management.
Resolved CLAUDE.md delete/modify conflict by keeping deletion (file renamed to AGENTS.md in this branch). Brings in new features from main: - Argilla user/workspace management commands (PR #78) - Updated README with simplified overview - Quality issues field in annotations - Test infrastructure for CLI commands
* feat: create consolidated PRD pipeline Created unified pipelines/prd/dvc.yaml consolidating 5 separate pipelines (chunker, questions, similarity, filters, validation) into a single end-to-end pipeline. Pipeline structure (15 stages): - Stage 1: process_documents (DOCX → Markdown) - Stages 2-6: create_late_chunks (5 foreach: 500-4000 tokens) - Stages 7-11: generate_questions (5 foreach: 5-40 questions per chunk) - Stage 12: data_combiner (merge chunks + questions) - Stage 13: similarity_hasher (SHA-256 hashes) - Stage 14: similarity_ranker (FAISS IVFFlat top-K) - Stage 15: overlap_detector (character offset overlaps) - Stage 16: explode_questions (question-centric format) - Stage 17: apply_question_filter (external reference classifier) - Stage 18: apply_chunk_filter (procedures + keyword exclusion) - Stage 19: filter_questions_by_chunk_quality (combined filtering) - Stage 20: validate_requests (LLM validation with Qwen 235B) - Stage 21: create_validation_dataset (dual format: embedding + QA) - Stage 22: upload_embedding_dataset (HuggingFace Hub) - Stage 23: upload_qa_dataset (HuggingFace Hub) Configuration: - Variables: data_prefix=data/prd, metrics_prefix=metrics/prd - Min-similarity-score: 0.35 (validation pipeline setting) - Question counts: 5/10/20/30/40 per chunk size - Cerebras provider for question generation and validation - Keyword filter: --exclude-matches 'prd@gsma.com' Data migrated to data/prd/, metrics to metrics/prd/. Used dvc commit --force to register existing outputs, avoiding re-execution of expensive stages (chunking, questions, similarity). * docs: rename CLAUDE.md to AGENTS.md and shorten documentation Renamed CLAUDE.md → AGENTS.md and substantially shortened it (725 → 198 lines, 73% reduction). Changes: - Consolidated structure, removed duplicate sections - Removed verbose API signatures and detailed breakdowns - Focused on actionable info for AI agents - Kept essential content: architecture, pipelines, CLI commands, env vars Updated for consolidated PRD pipeline: - Documented 15-stage unified pipeline structure - Added data/prd and metrics/prd paths - Listed deprecated pipelines (chunker, questions, similarity, filters, validation) - Added pipeline consolidation to recent changes This file serves as the project's living memory for AI agents. * chore: remove deprecated pipeline directories Removed 6 deprecated pipeline directories that have been consolidated into pipelines/prd/dvc.yaml: Removed: - pipelines/chunker/ → stages 1-2 in PRD pipeline (process + chunk) - pipelines/questions/ → stage 3 in PRD pipeline (generate questions) - pipelines/similarity/ → stages 4-7 in PRD pipeline (combine, hash, rank, overlap) - pipelines/filters/ → stages 9-11 in PRD pipeline (question/chunk filters) - pipelines/validation/ → stages 8, 12-15 in PRD pipeline (explode, validate, dataset) - pipelines/datasets/ → legacy question-based dataset creation (superseded) Remaining pipelines: - pipelines/prd/ - Consolidated PRD pipeline (primary) - pipelines/discover/ - Discover document pipeline - pipelines/annotation/ - Human annotation workflow * fix: update DVC cache metadata for data and model files Recalculated MD5 hashes using 'dvc add' to fix cache mismatches: - data/working_groups_mapping.json - data/raw - data/raw2 - data/raw3 - models/filters/chunk-filter-run-5000-2025-10-08_19-03-29 - models/filters/question-filter-run-5000-2025-10-08_22-47-46 This resolves 'not in cache' warnings for files that exist on disk but had outdated .dvc metadata. * fix: unfreeze discover pipeline and update lock file Unfroze all 22 discover pipeline stages (frozen: true → frozen: false). Updated dvc.lock with current code dependency hashes using 'dvc commit --force'. Stages no longer need to be frozen since: - Lock file now reflects current code state (cli.py, deduplicator.py, filters_cli.py) - All outputs are properly registered in cache - No 'not in cache' warnings remain This allows DVC to properly track dependencies and only re-run stages when actual changes occur, rather than keeping everything permanently frozen. * fix: register explode_questions stage with existing output Used existing questions_with_candidates.parquet from GSMA-classifier cache (md5: 5c17bfdba81cc86d4289e8d8e33831c3, 214MB) to preserve data continuity with downstream stages. Rationale: The explode_questions code has changed since this file was created. Re-running would produce different output and break compatibility with existing downstream filter/validation stages that depend on this data. Created symlink to cached file and force-committed stage to lock file. Stages 1-8 now registered. Stages 9-15 (filtering, validation, dataset creation) were never run for PRD data and need to execute fresh. * docs: add Argilla CLI commands to AGENTS.md Added comprehensive documentation for all Argilla user and workspace management commands from PR #78: - User creation commands (add-users, add-user) - Workspace management (add-to-workspace, list-workspaces, list-datasets) - Monitoring (track-progress, list-users) - Cleanup (delete-user) Includes usage examples for common workflows like bulk user creation and multi-workspace user management. * docs: expand README with detailed pipeline structure Expanded Data Structure and Pipeline Stages sections to provide comprehensive overview of the consolidated PRD pipeline: Data Structure: - Added detailed directory structure for prd/ and discover/ outputs - Documented all intermediate stages (chunks, questions, similarity, etc.) - Clarified data flow through pipeline stages Pipeline Stages: - Expanded from 2 stages to complete 15-stage PRD pipeline breakdown - Added Discover and Annotation pipeline summaries - Included technical details (chunk sizes, models, thresholds) - Documented outputs and HuggingFace Hub datasets This provides better onboarding for new developers and clearer understanding of the consolidated pipeline architecture. * docs: expand CLI commands section with comprehensive command reference Added complete CLI command documentation covering all pipeline stages: - Document Processing: process, deduplicate, chunk - Question Generation: generate-from-chunks, combine-questions - Similarity Analysis: combine, hash, rank, detect-overlaps - Quality Filtering: chunk filter, question filter, combined filtering - Validation: explode-questions, validate-requests - Dataset Creation: create-from-validation, upload to HuggingFace - Argilla Management: upload, user/workspace management, progress tracking - Subgroup Classification: add-subgroup-to-dataset Each section includes practical examples with common options and flags. This provides a quick reference for all available pipeline operations. * docs: add comprehensive overview paragraph Added detailed overview paragraph explaining: - Purpose: Transform GSMA documents into synthetic Q&A datasets for telecom LLMs - Pipeline stages: document conversion, chunking, Q&A generation, similarity, filtering, validation - Output formats: Contrastive learning (embeddings) and Q&A (RAG) - Three main pipelines: PRD, Discover, and Annotation This provides immediate context for new developers and stakeholders about what the repository does and its key components. * fix: use uv sync for development mode installation Changed from 'uv pip install -e .' to 'uv sync' which is the correct uv command for installing dependencies and the project in development mode. * docs: Update README.md
Summary
Add comprehensive Argilla user and workspace management commands.
New Commands
add-users: Bulk create users with random secure passwordsadd-user: Create single user with multi-workspace supportadd-to-workspace: Add existing user to multiple workspaceslist-users: Show workspace users (usernames only)list-workspaces: Show all available workspaceslist-datasets: Show datasets in a workspacetrack-progress: Monitor annotation progress with optimized performancedelete-user: Remove user from ArgillaUsage Examples
Breaking Changes
upload-by-subgroup:--working-groupshorthand changed from-wto-g