diff --git a/doc/ANALYZER_RULES.md b/doc/ANALYZER_RULES.md new file mode 100644 index 00000000..4cffdde5 --- /dev/null +++ b/doc/ANALYZER_RULES.md @@ -0,0 +1,73 @@ +# Coding Rules for migrate-confluence-rules + +## 1. Analyzer Processor Pattern + +Each XML entity processed by the Analyzer requires: + +### File Naming +- Location: `src/Analyzer/Processor/` +- Pattern: `{EntityName}.php` (singular, PascalCase) +- Examples: `Page.php`, `BlogPost.php`, `Users.php`, `Comments.php` + +### Class Convention +- Implements: `IAnalyzerProcessor` +- Extends: `ProcessorBase` +- Name: `{EntityName}` class in namespace `HalloWelt\MigrateConfluence\Analyzer\Processor` + +### Database Table Requirement +Each processor must have corresponding table(s) in WorkspaceDB: +- Primary table: `snake_case` plural form (e.g., `pages`, `blog_posts`) +- Meta/auxiliary tables: `{primary}_meta`, `{primary}_additional`, etc. +- Registration: Must be added to `WorkspaceDB::createTables()` and `$allowedTables` whitelist + +## 2. WorkspaceDB Table Registration + +For any new processor, follow this checklist: + +1. Define table schema in `WorkspaceDB::createTableXxx()` method +2. Add table name to `$allowedTables` array in `getAllData()` +3. Register creation call in `createTables()` method +4. Add indexes in `createIndexes()` if performance-critical +5. Add export method in JSON export chain +6. Create add method: `add{EntityName}()` (e.g., `addPage()`, `addBlogPost()`, `addAttachment()`) + - Method signature: `public function add{EntityName}( ... ): void` + - Inserts a single object record into the corresponding table + - Example: `WorkspaceDB::addPage(...)` inserts into `pages` table + +## 3. Filename Conventions + +| Component | Location | Pattern | Example | +|-----------|----------|---------|---------| +| Processor | `src/Analyzer/Processor/` | `{Entity}.php` | `Page.php` | +| Composer Processor | `src/Composer/Processor/` | `{Entity}.php` | `Pages.php` | +| Converter | `src/Converter/Processor/` | `{Operation}Macro.php` | `CodeMacro.php` | +| Postprocessor | `src/Converter/Postprocessor/` | `{Fix/Operation}.php` | `FixLineBreaks.php` | +| Preprocessor | `src/Converter/Preprocessor/` | Domain-specific | `HtmlPreprocessor.php` | + +## Wiki Title Conventions + +- Wiki titles have to be created using `HalloWelt\MigrateConfluence\Utility\TitleBuilder` or `HalloWelt\MediaWiki\Lib\Migration\TitleBuilder` + +## 4. Database Relationships + +Current entities and their tables: +- **Spaces** → `spaces`, `spaces_descriptions` +- **Pages** → `pages`, `pages_meta` +- **Blog Posts** → `blog_posts`, `blog_posts_meta` +- **Body Contents** → `body_contents`, `body_contents_bodies` +- **Attachments** → `attachments`, `attachments_meta`, `page_attachments`, `additional_attachments` +- **Users** → `users` +- **Comments** → `comments` +- **Labels** → `labels`, `labellings` +- **Content Properties** → `content_properties` +- **Gliffy** → `gliffy` +- **PageTemplates** → `page_templates`, `page_template_contents` + +## 5. Adding a New Processor + +Steps: +1. Create `src/Analyzer/Processor/{Entity}.php` extending `ProcessorBase` +2. Add table creation to `WorkspaceDB` +3. Register in `ConfluenceAnalyzer::processXML()` +4. Create corresponding Composer processor if needed +5. Create Converter processor if transformation required diff --git a/doc/COMPOSER_RULES.md b/doc/COMPOSER_RULES.md new file mode 100644 index 00000000..027b0fb3 --- /dev/null +++ b/doc/COMPOSER_RULES.md @@ -0,0 +1,128 @@ +# Coding Rules for Composer Component + +The Composer assembles converted WikiText content and resources into a MediaWiki importable XML format. + +## 1. Processor Pattern + +Composer processors handle building specific parts of the final MediaWiki XML. + +### File Naming & Location +- Location: `src/Composer/Processor/{Entity}.php` +- Pattern: Plural entity names (Pages, Files, Comments) +- Examples: `Pages.php`, `Comments.php`, `Files.php` + +### Class Convention +- Implements: `IConfluenceComposerProcessor` +- Extends: `ProcessorBase` +- Namespace: `HalloWelt\MigrateConfluence\Composer\Processor` +- Method to implement: `process( Builder $builder, ... ): void` + +### Processor Responsibilities +- Read converted data from workspace files +- Read metadata from `WorkspaceDB` +- Build XML elements using `Builder` class +- Add pages, files, or metadata to the MediaWiki XML output + +## 2. Processor Methods + +### Standard Methods in ProcessorBase +- `__construct()`: Accept `Builder`, `DBComposerDataLookup`, `Workspace`, `Output`, etc. +- `process()`: Main entry point for building XML elements +- `getName()`: Return processor identifier string + +### File Naming & Location +- Location: `src/Composer/Processor/{Name}ContentPostProcessor.php` +- Example: `TemplateContentPostProcessor.php` + +### Class Convention +- Implements: `IPageContentPostProcessor` +- Namespace: `HalloWelt\MigrateConfluence\Composer\Processor` +- Method to implement: `process( string $pageId, string $pageTitle, string $content ): string` + +### Responsibilities +- Accept page content as WikiText string + +## 3. Processor Registration + +All processors must be registered in `ConfluenceComposer::buildXML()`: + +1. **Create processor instance** with required dependencies: + - `Builder` instance + - `DBComposerDataLookup` for data access + - `Workspace` for file access + - `Output` for progress reporting + - `MigrationConfig` for settings + +2. **Call processor** in appropriate order: + - Files: typically first (attachments, images) + - Pages: main content + - Comments: page comments + - Post-processors: applied per-page during processing + +### Example Registration Pattern +```php +$processors = [ + new Files( + $builder, $composerDataLookup, $this->workspace, + $this->output, $this->dest, $this->migrationConfig, + $deploymentInfo + ), + new Pages( + $builder, $composerDataLookup, $this->workspace, + $this->output, $this->dest, $this->migrationConfig, + $deploymentInfo + ), +]; +``` + +## 4. Data Lookup Pattern + +### DBComposerDataLookup +- Provides convenient access to composed data from database +- Methods like `getPageData()`, `getAttachmentData()`, etc. +- Filters and caches results for performance + +## 6. Builder Integration + +### Required Data for Builder +- **Pages**: title, content, timestamp, author, page_id +- **Files**: filename, content (binary), description, upload_date + +## 7. Progress Reporting + +### Output Integration +- Use `$this->output->writeln()` for progress messages +- Report processing status per entity type +- Indicate progress: "Processing 250/1000 pages..." + +### Logging +- Use `DBLog` for errors or warnings +- Log skipped items and reasons +- Log final statistics + +## 8. Configuration & Deployment Info + +### MigrationConfig Usage +- Access namespaces configuration +- Access file extension whitelist +- Access custom replacements or mappings +- Passed to constructor, stored as instance variable + +### ComposerDeploymentInfo +- Stores deployment-specific information +- Passed to all processors for consistency +- Used for namespace and prefix mapping + +## 9. Adding a New Processor + +Steps to add a new Composer Processor: + +1. Create `src/Composer/Processor/{Entity}Processor.php` +2. Implement `IConfluenceComposerProcessor` or extend `ProcessorBase` +3. Implement `process()` method: + - Accept `Builder` and required data sources + - Read from workspace/database as needed + - Call appropriate `Builder` methods +4. Register in `ConfluenceComposer::buildXML()` constructor +5. Add appropriate data lookup methods to `DBComposerDataLookup` if needed +6. Test end-to-end XML output diff --git a/doc/CONVERTER_RULES.md b/doc/CONVERTER_RULES.md new file mode 100644 index 00000000..d7f6ddb0 --- /dev/null +++ b/doc/CONVERTER_RULES.md @@ -0,0 +1,108 @@ +# Coding Rules for Converter Component + +The Converter transforms Confluence Storage XML content into MediaWiki WikiText format. It processes DOM documents through processors, preprocessors, and postprocessors. + +## 1. Processor Pattern + +Converter processors handle transformation of specific Confluence elements or macros. + +### File Naming & Location +- **Macro Processors**: `src/Converter/Processor/{MacroName}Macro.php` + - Examples: `CodeMacro.php`, `TocMacro.php`, `PanelMacro.php` +- **Content Processors**: `src/Converter/Processor/{ElementType}.php` + - Examples: `Image.php`, `PageLink.php`, `UserLink.php`, `Emoticon.php` +- **Base Classes**: `src/Converter/Processor/{BaseType}Base.php` + - Examples: `MacroProcessorBase.php`, `StructuredMacroProcessorBase.php`, `LinkProcessorBase.php` + +### Class Convention +- Implements: `IProcessor` +- Extends: One of the base classes (`MacroProcessorBase`, `StructuredMacroProcessorBase`, `LinkProcessorBase`) +- Namespace: `HalloWelt\MigrateConfluence\Converter\Processor` +- Method to implement: `process( DOMDocument $dom ): void` + - Searches for target elements/macros in the DOM + - Transforms them using DOM manipulation + +### Pattern Specifics +- For macro processors: implement `getMacroName(): string` to specify target macro name +- Use DOM manipulation to locate elements via `getElementsByTagName()`, `getElementsByClassName()`, etc. +- Replace or modify DOM nodes in place +- Handle parameters from `ac:parameter` attributes (Confluence format) + +## 2. Preprocessor Pattern + +Preprocessors prepare the HTML/DOM **before** macro conversion to fix structural issues. + +### File Naming & Location +- HTML Preprocessors: `src/Converter/Preprocessor/html/{Name}.php` + - Example: `CDATAClosingFixer.php` +- DOM Preprocessors: `src/Converter/Preprocessor/dom/{Name}.php` + - Examples: `HoistMacroFromHeading.php`, `SanitizeLinkContent.php`, `Table.php` + +### Class Convention +- Implements: `IHtmlPreprocessor` or `IDomPreprocessor` +- Namespace: `HalloWelt\MigrateConfluence\Converter\Preprocessor\{html|dom}` +- Method to implement: + - `IHtmlPreprocessor`: `process( string $html ): string` + - `IDomPreprocessor`: `process( DOMDocument $dom ): void` + +## 3. Postprocessor Pattern + +Postprocessors fix content **after** macro conversion and PANDOC HTML-to-WikiText transformation. + +### File Naming & Location +- Location: `src/Converter/Postprocessor/{Fix|Operation}.php` +- Examples: `FixLineBreakInHeadings.php`, `FixMultilineTable.php`, `NestedHeadings.php` +- Use `Fix` prefix for bug fixes, descriptive name for enhancements + +### Class Convention +- Implements: `IPostprocessor` +- Namespace: `HalloWelt\MigrateConfluence\Converter\Postprocessor` +- Method to implement: `process( string $output ): string` + - Takes WikiText string as input + - Returns modified WikiText string + - Use regex or string manipulation for text-level changes + +### Usage Pattern +- Applied in sequence after HTML-to-WikiText conversion +- Each postprocessor should handle one specific concern +- Can be disabled/reordered via configuration + +## 4. Processor Registration + +All processors must be registered in `ConfluenceConverter::__construct()`: + +1. **Processors**: Add to processor instantiation list + - Order matters (executed in registration order) +2. **Preprocessors**: Add to appropriate preprocessor chain + - HTML preprocessors before DOM preprocessing + - DOM preprocessors before macro conversion +3. **Postprocessors**: Add to postprocessor chain + - Order: Fix issues bottom-up (earlier fixes enable later ones) + +## 5. DOM Processing Best Practices + +- Use `DOMXPath` for complex queries instead of `getElementsByTagName()` +- Always iterate over a copy of the NodeList before modifying: + ```php + $nodes = []; + foreach ($dom->getElementsByTagName('macro') as $node) { + $nodes[] = $node; + } + foreach ($nodes as $node) { + // Safe to modify DOM here + } + ``` +- Replace nodes using `appendChild()` and `removeChild()` +- Set attributes with `setAttribute()`, get with `getAttribute()` +- Create new elements with `createElement()` + +## 6. Naming Conventions Summary + +| Type | Location | Pattern | Example | +|------|----------|---------|---------| +| Macro Processor | `Processor/` | `{MacroName}Macro.php` | `CodeMacro.php` | +| Content Processor | `Processor/` | `{ElementType}.php` | `Image.php` | +| Processor Base | `Processor/` | `{Type}ProcessorBase.php` | `MacroProcessorBase.php` | +| HTML Preprocessor | `Preprocessor/html/` | `{Name}.php` | `CDATAClosingFixer.php` | +| DOM Preprocessor | `Preprocessor/dom/` | `{Name}.php` | `Table.php` | +| Postprocessor | `Postprocessor/` | `{Fix\|Operation}.php` | `FixLineBreakInHeadings.php` | diff --git a/doc/EXTRACTOR_RULES.md b/doc/EXTRACTOR_RULES.md new file mode 100644 index 00000000..ebe48c00 --- /dev/null +++ b/doc/EXTRACTOR_RULES.md @@ -0,0 +1,88 @@ +# Coding Rules for Extractor Component + +The Extractor reads data from the analyzed workspace and extracts body contents, attachments, and other resources into the file system. + +## 1. Component Structure + +### Main Class +- **Class**: `ConfluenceExtractor` +- **Location**: `src/Extractor/ConfluenceExtractor.php` +- **Extends**: `ExtractorBase` +- **Implements**: `IDestinationPathAware` + +### Responsibilities +- Read analyzed data from `WorkspaceDB` +- Extract file contents and attachments from the source export +- Organize extracted files in the destination workspace directory +- Create structured output for downstream processing (Converter, Composer) + +## 2. Extraction Methods + +The Extractor provides methods to extract each data type: + +### Method Naming Convention +- Pattern: `extract{EntityType}()` +- Examples: `extractBodyContents()`, `extractAttachments()` +- Access level: `private` or `protected` +- Called from `doExtract()` method + +### Common Methods +- `extractBodyContents()`: Extract page/blog post body content +- `extractAttachments()`: Extract file attachments +- `extractImages()`: Extract inline images +- `extractAdditionalAttachments()`: Extract reference attachments + +## 3. WorkspaceDB Integration + +### Database Access +- `WorkspaceDB` is initialized in `initWorkspaceDB()` method +- Use `getAllData(string $table)` to read analyzed records +- Iterate over records to extract associated files + +## 4. File Organization + +### Adding New Extraction Type + +1. Create `extract{Type}()` method in `ConfluenceExtractor` +2. Query relevant table from `WorkspaceDB` +3. Create destination subdirectory structure +4. Write extracted files using `FilenameResolver` / `FilenameBuilder` +5. Call method from `doExtract()` in appropriate sequence + +## 5. Supporting Utilities + +### Filename Handling +- **`FilenameResolver`**: Convert Confluence titles/paths to filesystem filenames +- **`FilenameBuilder`**: Construct file paths for organized storage +- **`DBLog`**: Log extraction progress and errors + +### Configuration +- **`MigrationConfig`**: Access migration settings +- Initialize via `initMigrationConfig()` method + +## 6. Lifecycle & Dependencies + +### Initialization Order (in `doExtract()`) +1. `initMigrationConfig()` - Load migration settings +2. `initWorkspaceDB()` - Connect to analyzed data +3. `initDBLog()` - Initialize logging +4. Execute extraction methods in logical order + +### Important Notes +- Extractor runs **after** Analyzer, **before** Converter +- Uses data written by Analyzer to WorkspaceDB +- Output files feed into Converter for content transformation +- Should not modify or corrupt source data + +## 7. Error Handling + +### Logging +- Use `DBLog` to record extraction errors and warnings +- Call `$this->dbLog->addLogEntry()` for significant events +- Log file move/copy failures explicitly + +### Robustness +- Verify destination directories exist before writing +- Handle missing files gracefully +- Handle encoding issues in filenames +- Verify file permissions before extracting