Refactor: Modernize Codebase Architecture & Implement Full-Link Async I/O#690
Closed
wchiways wants to merge 5 commits intodataabc:masterfrom
Closed
Refactor: Modernize Codebase Architecture & Implement Full-Link Async I/O#690wchiways wants to merge 5 commits intodataabc:masterfrom
wchiways wants to merge 5 commits intodataabc:masterfrom
Conversation
…th pydantic - move date validation to datetime_util.is_valid_date - implement SpiderConfig using pydantic for robust validation - catch ValidationError in spider.py for user-friendly error reporting - add unit tests for config and datetime utilities
- Update Writer base class to use async/await - Integrate aiofiles for asynchronous file writing (txt, csv, json) - Migrate PostWriter to use aiohttp for async network requests - Update Spider to await writer methods - Add async writer tests
Fix the error occurring during the build process
Contributor
Author
|
本次PR基于python 3.10进行重构,可能低版本不一定兼容,所以决定Close |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Architecture Overview
graph TD subgraph "Sync" A[Spider] -->|Blocking| B(Downloader) A -->|Blocking| C(Writer) C --> D[File/DB] style A fill:#f9f,stroke:#333,stroke-width:2px end subgraph "Async" AA[Spider] -->|Await| BB(Downloader) AA -->|Await| CC(Writer) CC -->|Non-Blocking| DD[File/DB/API] style AA fill:#bbf,stroke:#333,stroke-width:2px endConfiguration Validation Flow
sequenceDiagram participant User participant Spider participant SpiderConfig participant Pydantic participant DateTimeUtil User->>Spider: Run with config.json Spider->>SpiderConfig: Load Dict SpiderConfig->>Pydantic: Validate Fields Pydantic->>DateTimeUtil: is_valid_date(date_str) alt Invalid DateTimeUtil-->>Pydantic: False Pydantic-->>Spider: ValidationError Spider-->>User: Exit with Error Message else Valid DateTimeUtil-->>Pydantic: True Pydantic-->>SpiderConfig: Valid Config Object SpiderConfig-->>Spider: Ready to Scrape endPR 1: Modernize Path Handling and Formatting
Title:
refactor: replace os.path with pathlib and use f-stringsDescription:
Improved code readability and cross-platform compatibility by modernizing file path handling and string formatting.
os.pathand string concatenation for file paths with the object-orientedpathlib.Pathlibrary. This ensures robust path handling across different operating systems.%string formatting with Python 3.6+ f-strings for better performance and cleaner syntax.PR 2: Modernize Data Structures
Title:
refactor: use dataclasses for User/Weibo and introduce SpiderConfigDescription:
Transitioned core data structures to modern Python standards to reduce boilerplate and improve maintainability.
UserandWeiboclasses to use the standard librarydataclassesmodule. This simplifies class definitions while retaining memory efficiency viaslots=True.SpiderConfigclass to encapsulate configuration parameters, replacing raw dictionary usage for better type hints and code clarity.PR 3: Modernize Configuration and Eliminate Code Duplication
Title:
refactor: eliminate code duplication and enhance config validation with pydanticDescription:
This update focuses on improving the robustness and maintainability of the configuration management system.
dataclassimplementation ofSpiderConfigwith a PydanticBaseModel. This provides automatic type coercion, strict validation, and better IDE support.weibo_spider/datetime_util.py. The redundant_is_datefunctions inconfig.pyandconfig_util.pywere removed.spider.pyto catchpydantic.ValidationErrorduring startup, providing clean, human-readable error messages for configuration issues instead of generic stack traces.tests/test_config.pyandtests/test_datetime_util.pyto ensure the new validation logic and utilities are correct and protected against regressions.PR 4: Full-Link Async I/O for Writers
Title:
feat: implement full-link async I/O for writersDescription:
This update completes the transition to a fully asynchronous architecture by removing blocking I/O operations from the data writing pipeline.
Writerabstract base class and all 8 implementations (Txt,Csv,Json,Mongo,MySql,Sqlite,Kafka,Post) to supportasync/await.aiofileslibrary intoTxtWriter,CsvWriter, andJsonWriter. This ensures that writing large amounts of data to disk does not block theasyncioevent loop.PostWriterfrom the synchronousrequestslibrary toaiohttp, allowing for non-blocking API notifications.Spiderclass toawaitall writer operations, ensuring the scraper remains responsive even during heavy I/O tasks.aiofilestorequirements.txt.tests/test_writers_async.pywith mock-based testing to verify the new asynchronous writing logic.