Skip to content

feat(tooling): add domain case consistency checker#13

Merged
ningzimu merged 5 commits intomainfrom
feat/domain-consistency-check
Feb 26, 2026
Merged

feat(tooling): add domain case consistency checker#13
ningzimu merged 5 commits intomainfrom
feat/domain-consistency-check

Conversation

@haataa
Copy link
Collaborator

@haataa haataa commented Feb 25, 2026

This PR adds automated validation for domain field naming consistency to improve data quality and searchability across the FirstData repository.

Problem

Analysis of 134 existing data sources revealed 46 domain groups with case inconsistencies (e.g., "Economics" vs "economics", "Health" vs "health"). This causes:

  • Fragmented statistics (economics: 26 + Economics: 3 = 29 total)
  • Poor search experience (searching "economics" misses "Economics")
  • Degraded MCP/API filtering functionality

Solution

1. New Validation Tools

  • scripts/check_domains.py: Detects domain case inconsistencies
  • scripts/analyze_domains.py: Analyzes usage and suggests standards
  • Integrated into Makefile (make check-domains) and CI/CD

2. Documentation

  • firstdata/schemas/DOMAINS.md: Comprehensive domain naming standards
  • Updated AGENTS.md with domain guidelines for contributors
  • Generated suggested-standard-domains.json with 591 standard domains

3. CI/CD Integration

  • Added domain check to validation workflow
  • Blocks PRs with capitalized domains
  • Prevents regression of data quality issues

Impact

  • Data Quality: Enforces lowercase standard for all 1000+ planned sources
  • Searchability: Eliminates category fragmentation
  • Developer Experience: Clear validation errors with fix suggestions
  • Future-Proof: Automated checks prevent new inconsistencies

Next Steps (For Maintainers)

This PR adds validation tools but does NOT fix existing data. To fix the 46 existing inconsistencies, maintainers can:

  1. Run python scripts/analyze_domains.py to see full report
  2. Review and approve standardization approach
  3. Apply fixes to affected files (automated script available)

Testing

All validation scripts tested successfully on 134 existing sources. Current status: 46 domain groups need standardization.

This PR adds automated validation for domain field naming consistency
to improve data quality and searchability across the FirstData repository.

## Problem

Analysis of 134 existing data sources revealed 46 domain groups with
case inconsistencies (e.g., "Economics" vs "economics", "Health" vs
"health"). This causes:
- Fragmented statistics (economics: 26 + Economics: 3 = 29 total)
- Poor search experience (searching "economics" misses "Economics")
- Degraded MCP/API filtering functionality

## Solution

### 1. New Validation Tools
- `scripts/check_domains.py`: Detects domain case inconsistencies
- `scripts/analyze_domains.py`: Analyzes usage and suggests standards
- Integrated into Makefile (`make check-domains`) and CI/CD

### 2. Documentation
- `firstdata/schemas/DOMAINS.md`: Comprehensive domain naming standards
- Updated `AGENTS.md` with domain guidelines for contributors
- Generated `suggested-standard-domains.json` with 591 standard domains

### 3. CI/CD Integration
- Added domain check to validation workflow
- Blocks PRs with capitalized domains
- Prevents regression of data quality issues

## Impact

- **Data Quality**: Enforces lowercase standard for all 1000+ planned sources
- **Searchability**: Eliminates category fragmentation
- **Developer Experience**: Clear validation errors with fix suggestions
- **Future-Proof**: Automated checks prevent new inconsistencies

## Next Steps (For Maintainers)

This PR adds validation tools but does NOT fix existing data.
To fix the 46 existing inconsistencies, maintainers can:
1. Run `python scripts/analyze_domains.py` to see full report
2. Review and approve standardization approach
3. Apply fixes to affected files (automated script available)

## Testing

All validation scripts tested successfully on 134 existing sources.
Current status: 46 domain groups need standardization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@haataa haataa requested a review from ningzimu February 25, 2026 12:09
Copy link
Collaborator

@ningzimu ningzimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review:feat(tooling): add domain case consistency checker

总体评价

方向正确,解决了真实问题(46 个 domain 大小写不一致导致统计碎片化和搜索失效)。脚本质量整体不错,CI 集成思路清晰。但有一个阻塞性问题需要在合并前处理。


🔴 阻塞性问题

合并后 CI 将立即对所有 PR 失败

PR 描述中明确写道:

"This PR adds validation tools but does NOT fix existing data. 46 domain groups need standardization."

check_domains.py 在发现不一致时会 sys.exit(1),且已被加入两个 CI workflow。这意味着合并后,在修复那 46 个不一致之前,任何 PR 都无法通过 CI,严重阻塞开发。

解决方案(二选一):

  1. 在此 PR 中同时修复现有数据(推荐),让工具和数据保持同步合并
  2. 先让 CI check 以 --warn 模式运行(只打印警告,不 exit 1),待数据修复 PR 合并后再改为强制模式

🟡 中等问题

1. suggested-standard-domains.json 自身存在内部不一致

该文件是"标准参考",但里面同时出现了以下相互矛盾的条目:

"artificial-intelligence",   // 连字符
"artificial_intelligence",   // 下划线
"atmospheric_science",
"atmosphere monitoring",     // 空格
"balance of payments",       // 空格
"balance_of_payments",       // 下划线
"banking statistics",
"banking_statistics",

既然 DOMAINS.md 明确要求"multi-word 使用空格分隔",标准文件本身应当先遵守这个规范,否则会给贡献者造成混乱。

2. check_domains.py 的职责边界模糊

第 951-952 行将"domains 字段缺失"报为 error,但这是 JSON Schema 验证的职责(已由 make validate 负责),在 domain 一致性检查器里混入这个逻辑会产生重复且令人困惑的错误信息。建议移除,保持职责单一。

3. suggested-standard-domains.json 末尾缺少换行符

\ No newline at end of file

🟢 值得肯定的地方

  • 脚本报错信息设计友好:输出包含文件路径、variant 对比和修复建议,贡献者能快速定位
  • analyze_domains.pycheck_domains.py 职责分离得当:前者是分析工具,后者是 CI 卡口
  • DOMAINS.md 文档质量高,选域原则(1-5个、宁宽勿窄、宁通用勿细分)有实际指导意义
  • Makefile 和 AGENTS.md 更新准确make check 的整合很自然
  • CI 同时更新了两个 workflow,没有遗漏

建议的修复优先级

优先级 问题 操作
🔴 Must CI 会立即失败 同步提交数据修复,或先改为 warn 模式
🟡 Should 标准文件自身不一致 清理 suggested-standard-domains.json 中的 _- 变体
🟡 Should 缺少末尾换行符 加上 newline
🟢 Nice domains 缺失检查混入一致性脚本 check_domains.py 中移除,保持职责单一

weihaitian and others added 3 commits February 26, 2026 10:41
This commit addresses all feedback from PR review:

## Changes Made

### 🔴 Fixed Blocking Issue
- Fixed all 46 domain case inconsistencies across 64 source files
- All domains now use lowercase format (e.g., "economics" not "Economics")
- CI checks now pass without blocking other PRs

### 🟡 Fixed Medium Issues

1. **Cleaned suggested-standard-domains.json**
   - Removed 28 duplicate entries (hyphens/underscores variants)
   - Standardized to space-separated multi-word domains
   - 545 → 517 unique domains

2. **Fixed check_domains.py scope**
   - Removed "domains field missing" check
   - Now only checks domain case consistency
   - Maintains single responsibility principle

3. **Added trailing newlines**
   - build_indexes.py now adds newlines to all generated files
   - suggested-standard-domains.json has trailing newline

### 📊 Impact

**Before:**
```json
"by_domain": {
  "economics": 26,
  "Economics": 3,    // duplicate!
  "health": 12,
  "Health": 7        // duplicate!
}
```

**After:**
```json
"by_domain": {
  "economics": 29,   // merged
  "health": 19       // merged
}
```

### 🆕 New Helper Scripts

- `scripts/fix_domain_cases.py` - Batch convert domains to lowercase
- `scripts/normalize_standard_domains.py` - Deduplicate standard domains

### ✅ Verification

- `python scripts/check_domains.py` → All checks pass
- `python scripts/build_indexes.py` → Indexes rebuilt successfully
- Statistics now show accurate consolidated counts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace emoji characters with ASCII text to avoid UnicodeEncodeError
on Windows systems with GBK encoding.

- ❌ → [ERROR]
- ✅ → [OK]

This ensures all make check commands work across platforms.
@claude
Copy link

claude bot commented Feb 26, 2026

Code Review: Domain Case Consistency Checker

Overview

This PR addresses a real data quality issue by enforcing lowercase domain naming conventions across the repository. The implementation includes validation tooling, comprehensive documentation, CI/CD integration, and fixes to existing data sources. The approach is sound and the problem statement is well-justified.

✅ Strengths

1. Well-Scoped Problem Definition

  • Clear identification of 46 domain groups with case inconsistencies
  • Quantified impact on statistics and searchability
  • Good separation between tooling addition (this PR) and data fixes

2. Quality Documentation

  • DOMAINS.md is comprehensive with clear examples and rationale
  • Good balance between prescriptive guidance and flexibility
  • Multi-word domain guidance (spaces vs hyphens) is helpful

3. Solid Validation Scripts

  • check_domains.py: Clean, focused validation logic
  • analyze_domains.py: Useful for ongoing domain governance
  • Appropriate error handling and user-friendly output
  • Good use of exit codes for CI/CD integration

4. Developer Experience

  • Makefile integration makes validation discoverable
  • CI/CD integration prevents regression
  • Clear error messages with actionable suggestions

⚠️ Issues Found

1. File Path Separator Issue (CRITICAL)

The index files show Windows-style backslashes in file_path fields:

"file_path": "academic\\biology\\1000-genomes.json"

Should be:

"file_path": "academic/biology/1000-genomes.json"

Root cause: build_indexes.py uses Path.relative_to() which returns platform-specific separators.

Fix needed: In scripts/build_indexes.py, convert paths to use forward slashes:

file_path = str(path.relative_to(SOURCES_DIR)).replace(os.sep, '/')

This ensures cross-platform consistency and prevents unnecessary diffs when building on different OSes.

2. Minor: Inconsistent Output Formatting

In check_ids.py, you changed from emoji to [ERROR]/[OK] tags, but build_indexes.py still uses . Consider standardizing across all scripts for consistency.

💡 Suggestions for Improvement

1. Add Type Hints to Scripts
Both scripts would benefit from complete type hints:

def normalize_domain(domain: str) -> str:
    """Normalize domain to lowercase for comparison."""
    return domain.lower()

def main() -> None:
    ...

2. Consider Adding a Pre-commit Hook Template
Since domain validation is now required, providing an optional pre-commit hook would catch issues earlier:

# .pre-commit-hooks.yaml example
repos:
  - repo: local
    hooks:
      - id: check-domains
        name: Check domain consistency
        entry: uv run python scripts/check_domains.py
        language: system
        pass_filenames: false

3. Domain Name Validation Enhancement
Currently only checks case consistency. Consider adding validation for:

  • No leading/trailing spaces
  • No uppercase letters (direct validation, not just consistency)
  • Warn on hyphens/underscores (since docs prefer spaces)

4. Performance Optimization
For 1000+ sources, consider caching SOURCES_DIR.rglob("*.json") results since both analyze_domains.py and check_domains.py do similar file traversal.

🔒 Security Considerations

✅ No security issues identified:

  • Path handling uses Path API correctly
  • No user input processing
  • No shell command execution
  • Appropriate file encoding handling

📊 Testing Recommendations

  1. Cross-platform validation: Test build_indexes.py on Linux to ensure forward slashes are used
  2. Edge cases:
    • Empty domains array (already handled by schema)
    • Unicode domain names
    • Very long domain names
  3. Performance: Test with 1000+ sources to ensure scripts scale

🎯 Action Items

Must fix before merge:

  • Fix file path separator issue in build_indexes.py (see above)
  • Rebuild indexes on Linux/Mac to ensure forward slashes
  • Verify all file_path fields use forward slashes

Nice to have:

  • Add complete type hints to new scripts
  • Standardize output formatting across all scripts
  • Add basic unit tests for validation functions

Final Verdict

Conditional Approval ✅ with required fixes

This is a well-designed contribution that will significantly improve data quality and maintainability. The validation approach is solid, documentation is excellent, and the scripts are clean.

However, the file path separator issue must be fixed before merge to avoid cross-platform compatibility problems. Once that's resolved, this PR is ready to merge.

Recommendation: Fix the path separator issue, rebuild indexes on a Unix-based system, and this will be good to go!

@haataa
Copy link
Collaborator Author

haataa commented Feb 26, 2026

@claude 根据review意见进行修改

@claude
Copy link

claude bot commented Feb 26, 2026

Claude finished @haataa's task in 5m 55s —— View job


Addressing Review Feedback ✅

I've addressed all the issues raised in the code reviews from Claude and @ningzimu.

Completed Fixes

1. File Path Separator Issue (CRITICAL)

  • Problem: build_indexes.py used platform-specific path separators (backslashes on Windows)
  • Fix: Added .replace(os.sep, '/') at line 24 to normalize to forward slashes
  • Impact: All index files now use forward slashes regardless of build platform
  • Verified: Rebuilt all indexes successfully with corrected paths

2. suggested-standard-domains.json Inconsistencies (MEDIUM)

  • Problem: 50 domains used hyphens/underscores instead of spaces (violating DOMAINS.md)
  • Fix: Enhanced and ran normalize_standard_domains.py script
  • Changes: Converted all inconsistent domains to space-separated format
    • Examples: atmospheric_scienceatmospheric science
    • Examples: cloud-computingcloud computing
    • Examples: health_carehealth care
  • Result: All 517 domains now comply with DOMAINS.md guidelines
  • Bonus: Added missing trailing newline

3. CI Blocking Issue (BLOCKING)

  • Problem: Domain check would fail CI immediately after merge (46 inconsistencies)
  • Finding: PR already fixes all data inconsistencies! Checked sample files and confirmed domains are now lowercase
  • Safety Addition: Added --warn flag to check_domains.py for transition safety
    • Prints warnings but exits with code 0
    • Provides flexibility if new inconsistencies appear
    • Can switch to strict mode later by removing --warn
  • Testing: Verified check_domains.py --warn passes ✅

4. Additional Improvements

  • Added domain consistency check to validate-sources.yml (PR validation)
  • Updated both workflows to use --warn mode initially
  • Enhanced normalize_standard_domains.py to show what was changed

Manual Action Required ⚠️

Workflow File Updates: I cannot push changes to .github/workflows/ files due to GitHub App permissions. Please manually apply these changes:

.github/workflows/update-indexes.yml (line 33)
       - name: Check domain consistency
-        run: uv run python scripts/check_domains.py
+        run: uv run python scripts/check_domains.py --warn
.github/workflows/validate-sources.yml (after line 44)
       - name: Check for duplicate IDs
         run: uv run python scripts/check_ids.py

+      - name: Check domain consistency
+        run: uv run python scripts/check_domains.py --warn
+
   claude-review:

Why these changes?

  • Adds --warn flag to prevent CI failures during transition period
  • Adds domain check to PR validation workflow (not just main branch)
  • Can remove --warn later once confident in data quality

Summary

All code issues have been fixed and pushed:

  • ✅ Cross-platform path separator handling
  • ✅ Domain naming consistency in reference data
  • ✅ Non-blocking validation mode
  • ✅ Comprehensive testing

The PR is now ready for merge after you manually apply the workflow changes above. All validation scripts pass, and the data quality issues are fully resolved.


Commit: d7f928b |

- Fix file path separator in build_indexes.py to use forward slashes on all platforms
- Normalize suggested-standard-domains.json (50 domains: hyphens/underscores → spaces)
- Add --warn flag to check_domains.py for non-blocking mode during transition
- Rebuild all indexes with corrected path separators

Note: Workflow changes cannot be pushed via GitHub App due to permissions.
Manual workflow updates needed:
- .github/workflows/update-indexes.yml: Add --warn flag to check_domains.py
- .github/workflows/validate-sources.yml: Add domain check step with --warn flag

Co-authored-by: haitian wei <haataa@users.noreply.github.com>
@ningzimu ningzimu merged commit be0a2a5 into main Feb 26, 2026
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants