Open
Conversation
…d unit tests for scraping utilities
…ts for new functionality
… and correct description in tests
sandragjacinto
approved these changes
Apr 16, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves scraped-text normalization (notably for the ird_le_mag scraper) by adding new syntax-correction utilities, refactoring the scraper to use them, and introducing tests for the text-cleaning helpers. It also updates two workflow entrypoints to load local environment variables at runtime.
Changes:
- Add new text-normalization helpers and apply Unicode normalization in
remove_html_stuff. - Refactor
IRDLeMagCollectorto normalize bothfull_contentanddescriptionvia a sharedcorrect_text_syntax()method. - Add a new
scraping_utilstest suite and adjust an existing IRD Le Mag expectation string.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
welearn_datastack/utils_/scraping_utils.py |
Adds Unicode normalization and new spacing helpers; removes an unused license helper. |
welearn_datastack/plugins/scrapers/ird_le_mag.py |
Applies shared syntax correction to extracted content/description. |
welearn_datastack/nodes_workflow/QdrantSyncronizer/qdrant_syncronizer.py |
Loads local .env before running main(). |
welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py |
Loads local .env before running main(). |
tests/test_scraping_utils.py |
Introduces tests for scraping utility functions. |
tests/document_collector_hub/plugins_test/test_irl_le_mag.py |
Updates expected description string to match new normalization output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.assertEqual(ret, awaited_str) | ||
|
|
||
| def test_remove_html_stuff(self): | ||
| input_str = "<p>Lorem ipsum</p>" |
| remover.feed(text + "\n") | ||
| txt = remover.get_text() | ||
| ret = unescape(txt) | ||
| ret = unicodedata.normalize("NFKD", ret) |
Comment on lines
+131
to
+151
| def add_space_after_closing_sign(string: str) -> str: | ||
| """ | ||
| Add a space after a closing sign if there is not already one | ||
| Args: | ||
| string (str): the string to clean | ||
| Returns: | ||
| str: the cleaned string | ||
| """ | ||
| return re.sub(r"([.»\")\]}])(?=[^\s.,;:!?)»\]}])", r"\1 ", string) | ||
|
|
||
|
|
||
| def add_space_before_capital_letter(string: str) -> str: | ||
| """ | ||
| Add a space before a capital letter if there is not already one | ||
| Args: | ||
| string (str): the string to clean | ||
| Returns: | ||
| str: the cleaned string | ||
| """ | ||
| return re.sub(r"([a-zàâäéèêëîïôöùûüÿç])([A-ZÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ])", r"\1 \2", string) | ||
|
|
Comment on lines
+141
to
+146
| """ | ||
| The content of the page is not well formatted, we need to clean it and add spaces after closing signs and before capital letters | ||
|
|
||
| :param content: the content of the page as a string | ||
| :return: the content of the page with the correct syntax | ||
| """ |
Comment on lines
137
to
139
| if __name__ == "__main__": | ||
| load_dotenv_local() | ||
| main() |
Comment on lines
233
to
235
| if __name__ == "__main__": | ||
| load_dotenv_local() | ||
| main() |
Comment on lines
+32
to
+37
| def test_clean_return_to_line(self): | ||
| input_str = "Lorem." "Ipsum" | ||
|
|
||
| awaited_str = "Lorem.Ipsum" | ||
| ret = clean_return_to_line(input_str) | ||
| self.assertEqual(ret, awaited_str) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements focused on text cleaning, normalization, and code quality, particularly for the
ird_le_magscraper and utility functions. The main changes include the addition of new text normalization utilities, refactoring of theird_le_magplugin to use these utilities, and enhanced test coverage for the new functions. There are also minor updates to environment variable loading and Unicode normalization.Text normalization and cleaning improvements:
scraping_utils.pyfor text normalization:add_space_after_closing_sign,add_space_before_capital_letter, and updatedremove_html_stuffto use Unicode normalization (unicodedata.normalize). These changes help ensure that scraped text is consistently formatted and free from encoding issues. [1] [2]ird_le_magplugin to use the new text normalization utilities by introducing acorrect_text_syntaxstatic method, ensuring that both content and description fields are cleaned and properly formatted. [1] [2] [3]Testing enhancements:
tests/test_scraping_utils.pyto cover the new and existing text cleaning functions, improving confidence in the utility code.Environment and configuration:
document_vectorizer.pyandqdrant_syncronizer.pyto load local environment variables at runtime, ensuring configuration is correctly set before execution. [1] [2] [3] [4]Minor changes and clean-up:
get_url_license_from_dc_formatfunction fromscraping_utils.pyfor codebase simplification.test_irl_le_mag.pyto use normalized Unicode characters, aligning with the new normalization logic.