Fix/ird le mag syntax by lpi-tn · Pull Request #126 · CyberCRI/welearn-datastack

lpi-tn · 2026-04-16T14:46:49Z

This pull request introduces several improvements focused on text cleaning, normalization, and code quality, particularly for the ird_le_mag scraper and utility functions. The main changes include the addition of new text normalization utilities, refactoring of the ird_le_mag plugin to use these utilities, and enhanced test coverage for the new functions. There are also minor updates to environment variable loading and Unicode normalization.

Text normalization and cleaning improvements:

Added new utility functions in scraping_utils.py for text normalization: add_space_after_closing_sign, add_space_before_capital_letter, and updated remove_html_stuff to use Unicode normalization (unicodedata.normalize). These changes help ensure that scraped text is consistently formatted and free from encoding issues. [1] [2]
Refactored the ird_le_mag plugin to use the new text normalization utilities by introducing a correct_text_syntax static method, ensuring that both content and description fields are cleaned and properly formatted. [1] [2] [3]

Testing enhancements:

Added a comprehensive new test suite in tests/test_scraping_utils.py to cover the new and existing text cleaning functions, improving confidence in the utility code.

Environment and configuration:

Updated document_vectorizer.py and qdrant_syncronizer.py to load local environment variables at runtime, ensuring configuration is correctly set before execution. [1] [2] [3] [4]

Minor changes and clean-up:

Removed the unused get_url_license_from_dc_format function from scraping_utils.py for codebase simplification.
Fixed a test string in test_irl_le_mag.py to use normalized Unicode characters, aligning with the new normalization logic.

…d unit tests for scraping utilities

…ts for new functionality

… and correct description in tests

Copilot

Pull request overview

This PR improves scraped-text normalization (notably for the ird_le_mag scraper) by adding new syntax-correction utilities, refactoring the scraper to use them, and introducing tests for the text-cleaning helpers. It also updates two workflow entrypoints to load local environment variables at runtime.

Changes:

Add new text-normalization helpers and apply Unicode normalization in remove_html_stuff.
Refactor IRDLeMagCollector to normalize both full_content and description via a shared correct_text_syntax() method.
Add a new scraping_utils test suite and adjust an existing IRD Le Mag expectation string.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`welearn_datastack/utils_/scraping_utils.py`	Adds Unicode normalization and new spacing helpers; removes an unused license helper.
`welearn_datastack/plugins/scrapers/ird_le_mag.py`	Applies shared syntax correction to extracted content/description.
`welearn_datastack/nodes_workflow/QdrantSyncronizer/qdrant_syncronizer.py`	Loads local `.env` before running `main()`.
`welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py`	Loads local `.env` before running `main()`.
`tests/test_scraping_utils.py`	Introduces tests for scraping utility functions.
`tests/document_collector_hub/plugins_test/test_irl_le_mag.py`	Updates expected description string to match new normalization output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        self.assertEqual(ret, awaited_str)
+
+    def test_remove_html_stuff(self):
+        input_str = "<p>Lorem&nbspipsum</p>"


    remover.feed(text + "\n")
    txt = remover.get_text()
    ret = unescape(txt)
+    ret = unicodedata.normalize("NFKD", ret)


+def add_space_after_closing_sign(string: str) -> str:
+    """
+    Add a space after a closing sign if there is not already one
+    Args:
+        string (str): the string to clean
+    Returns:
+        str: the cleaned string
+    """
+    return re.sub(r"([.»\")\]}])(?=[^\s.,;:!?)»\]}])", r"\1 ", string)
+
+
+def add_space_before_capital_letter(string: str) -> str:
+    """
+    Add a space before a capital letter if there is not already one
+    Args:
+        string (str): the string to clean
+    Returns:
+        str: the cleaned string
+    """
+    return re.sub(r"([a-zàâäéèêëîïôöùûüÿç])([A-ZÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ])", r"\1 \2", string)
+


+        """
+            The content of the page is not well formatted, we need to clean it and add spaces after closing signs and before capital letters
+
+        :param content: the content of the page as a string
+        :return: the content of the page with the correct syntax
+        """


 if __name__ == "__main__":
+    load_dotenv_local()
    main()


 if __name__ == "__main__":
+    load_dotenv_local()
    main()


+    def test_clean_return_to_line(self):
+        input_str = "Lorem." "Ipsum"
+
+        awaited_str = "Lorem.Ipsum"
+        ret = clean_return_to_line(input_str)
+        self.assertEqual(ret, awaited_str)


lpi-tn added 4 commits April 14, 2026 14:33

fix(classifiers): enhance SDG classification logic and logging

2bac977

fix(scraping_utils): remove unused license extraction function and ad…

0828fc2

…d unit tests for scraping utilities

fix(scraping_utils): enhance text cleaning functions and add unit tes…

b410d67

…ts for new functionality

fix(qdrant_syncronizer): load environment variables in main execution…

53b9c68

… and correct description in tests

lpi-tn requested review from Copilot and sandragjacinto April 16, 2026 14:46

Copilot started reviewing on behalf of lpi-tn April 16, 2026 14:47 View session

sandragjacinto approved these changes Apr 16, 2026

View reviewed changes

fix(ird_le_mag): remove unused function import and organize imports

91e561c

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/ird le mag syntax#126

Fix/ird le mag syntax#126
lpi-tn wants to merge 5 commits intomainfrom
Fix/ird-le-mag-syntax

lpi-tn commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lpi-tn commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants