Skip to content

Fix/ird le mag syntax#126

Open
lpi-tn wants to merge 5 commits intomainfrom
Fix/ird-le-mag-syntax
Open

Fix/ird le mag syntax#126
lpi-tn wants to merge 5 commits intomainfrom
Fix/ird-le-mag-syntax

Conversation

@lpi-tn
Copy link
Copy Markdown
Collaborator

@lpi-tn lpi-tn commented Apr 16, 2026

This pull request introduces several improvements focused on text cleaning, normalization, and code quality, particularly for the ird_le_mag scraper and utility functions. The main changes include the addition of new text normalization utilities, refactoring of the ird_le_mag plugin to use these utilities, and enhanced test coverage for the new functions. There are also minor updates to environment variable loading and Unicode normalization.

Text normalization and cleaning improvements:

  • Added new utility functions in scraping_utils.py for text normalization: add_space_after_closing_sign, add_space_before_capital_letter, and updated remove_html_stuff to use Unicode normalization (unicodedata.normalize). These changes help ensure that scraped text is consistently formatted and free from encoding issues. [1] [2]
  • Refactored the ird_le_mag plugin to use the new text normalization utilities by introducing a correct_text_syntax static method, ensuring that both content and description fields are cleaned and properly formatted. [1] [2] [3]

Testing enhancements:

  • Added a comprehensive new test suite in tests/test_scraping_utils.py to cover the new and existing text cleaning functions, improving confidence in the utility code.

Environment and configuration:

  • Updated document_vectorizer.py and qdrant_syncronizer.py to load local environment variables at runtime, ensuring configuration is correctly set before execution. [1] [2] [3] [4]

Minor changes and clean-up:

  • Removed the unused get_url_license_from_dc_format function from scraping_utils.py for codebase simplification.
  • Fixed a test string in test_irl_le_mag.py to use normalized Unicode characters, aligning with the new normalization logic.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves scraped-text normalization (notably for the ird_le_mag scraper) by adding new syntax-correction utilities, refactoring the scraper to use them, and introducing tests for the text-cleaning helpers. It also updates two workflow entrypoints to load local environment variables at runtime.

Changes:

  • Add new text-normalization helpers and apply Unicode normalization in remove_html_stuff.
  • Refactor IRDLeMagCollector to normalize both full_content and description via a shared correct_text_syntax() method.
  • Add a new scraping_utils test suite and adjust an existing IRD Le Mag expectation string.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
welearn_datastack/utils_/scraping_utils.py Adds Unicode normalization and new spacing helpers; removes an unused license helper.
welearn_datastack/plugins/scrapers/ird_le_mag.py Applies shared syntax correction to extracted content/description.
welearn_datastack/nodes_workflow/QdrantSyncronizer/qdrant_syncronizer.py Loads local .env before running main().
welearn_datastack/nodes_workflow/DocumentVectorizer/document_vectorizer.py Loads local .env before running main().
tests/test_scraping_utils.py Introduces tests for scraping utility functions.
tests/document_collector_hub/plugins_test/test_irl_le_mag.py Updates expected description string to match new normalization output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.assertEqual(ret, awaited_str)

def test_remove_html_stuff(self):
input_str = "<p>Lorem&nbspipsum</p>"
remover.feed(text + "\n")
txt = remover.get_text()
ret = unescape(txt)
ret = unicodedata.normalize("NFKD", ret)
Comment on lines +131 to +151
def add_space_after_closing_sign(string: str) -> str:
"""
Add a space after a closing sign if there is not already one
Args:
string (str): the string to clean
Returns:
str: the cleaned string
"""
return re.sub(r"([.»\")\]}])(?=[^\s.,;:!?)»\]}])", r"\1 ", string)


def add_space_before_capital_letter(string: str) -> str:
"""
Add a space before a capital letter if there is not already one
Args:
string (str): the string to clean
Returns:
str: the cleaned string
"""
return re.sub(r"([a-zàâäéèêëîïôöùûüÿç])([A-ZÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ])", r"\1 \2", string)

Comment on lines +141 to +146
"""
The content of the page is not well formatted, we need to clean it and add spaces after closing signs and before capital letters

:param content: the content of the page as a string
:return: the content of the page with the correct syntax
"""
Comment on lines 137 to 139
if __name__ == "__main__":
load_dotenv_local()
main()
Comment on lines 233 to 235
if __name__ == "__main__":
load_dotenv_local()
main()
Comment on lines +32 to +37
def test_clean_return_to_line(self):
input_str = "Lorem." "Ipsum"

awaited_str = "Lorem.Ipsum"
ret = clean_return_to_line(input_str)
self.assertEqual(ret, awaited_str)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants