Skip to content

fix(warehouses): Catch nan strings that look like floats to Python.#338

Merged
martyngigg merged 1 commit into
mainfrom
check-str-before-strip-html
Jun 8, 2026
Merged

fix(warehouses): Catch nan strings that look like floats to Python.#338
martyngigg merged 1 commit into
mainfrom
check-str-before-strip-html

Conversation

@martyngigg

@martyngigg martyngigg commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

An error:

File "/home/ubuntu/var/git/warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py", line 112, in to_text
    return html2text(x) if x is not None else x
           ~~~~~~~~~^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 1028, in html2text
    return h.handle(html)
           ~~~~~~~~^^^^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 149, in handle
    self.feed(data)
    ~~~~~~~~~^^^^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 144, in feed
    data = data.replace("</' + 'script>", "</ignore>")
           ^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'replace'

appeared in the Opralogweb extract logs. It came from a text containing 'nan' that had been converted to a Python float and html2text failed on it. This adds protection to only process strings.

Note that this was spotted by a user as we don't have the required visibility of the ELT pipelines.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved handling of additional comment conversion during markdown formatting.

@martyngigg martyngigg requested a review from a team as a code owner June 8, 2026 16:05
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ab9d930-5d40-44ef-b18b-3f19ab1099e1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR updates the conditional logic in the additional_comment_to_markdown function within the opralogweb accelerator ingestion module. A single-line change modifies the lambda condition that determines whether to preserve original AdditionalComment values or apply HTML-to-text conversion.

Changes

AdditionalComment markdown conversion

Layer / File(s) Summary
AdditionalComment HTML-to-markdown conversion condition
warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py
The conditional check inside the apply lambda for converting AdditionalComment values is modified, changing the logic that determines when html2text conversion is applied versus when the original value is retained.

Poem

🐰 A tiny fix, though logic seems quite tight,
A clause revised to set the conversion right,
From None to strings, the condition takes its turn,
One line adjusted—a subtle markdown lesson to learn! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title describes catching 'nan' strings that look like floats, which aligns with the PR's objective to prevent html2text being called on float-converted values, matching the main change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

An exception was beong thrown saying float did not
have a replace method. The value of the cell was nan.
Just leave this as it is.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py (1)

108-121: ⚠️ Potential issue | 🟠 Major

Add a test that covers the pandas NaN/"nan" edge case for AdditionalComment markdown conversion.
The existing opralogweb integration test (warehouses/facility_ops_landing/ingest/accelerator/opralogweb/tests/test_extract_and_load.py) only generates AdditionalComment as normal strings via the fake source (AdditionalComment=f"<p>{record['AdditionalComment']}</p>"), so it never exercises None, float('nan'), or the "nan" string case pandas may coerce.
Add a focused unit test for additional_comment_to_markdown covering: None, float('nan')/pandas-NaN, "nan" as a string, and a valid HTML string; assert the expected output for each.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`
around lines 108 - 121, Add a focused unit test for the function
additional_comment_to_markdown that builds a pyarrow.Table with
AdditionalComment values [None, float('nan'), "nan", "<p>bold</p>"], calls
additional_comment_to_markdown, and asserts the results: the first two entries
are pandas.isna (to cover None and pandas/NumPy NaN), the third equals the
literal "nan", and the fourth equals html2text("<p>bold</p>") (use html2text to
compute the expected markdown). Place this new test alongside the existing tests
(e.g., in a new test_additional_comment_to_markdown test module) so it runs with
the existing test suite and references additional_comment_to_markdown and
html2text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`:
- Line 117: The lambda's condition (x is None and isinstance(x, str)) is
impossible and causes html2text to be called on non-strings (e.g., NaN floats);
change the expression to only call html2text when x is actually a string—replace
the lambda with something like: use html2text(x) if isinstance(x, str) else x
(or, if you need to treat NaN specially, check math.isnan(x) for floats and
return x unchanged). Update the lambda where html2text is used so it guards with
isinstance(x, str) (and import math/numpy.isnan if you choose the NaN check).

---

Outside diff comments:
In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`:
- Around line 108-121: Add a focused unit test for the function
additional_comment_to_markdown that builds a pyarrow.Table with
AdditionalComment values [None, float('nan'), "nan", "<p>bold</p>"], calls
additional_comment_to_markdown, and asserts the results: the first two entries
are pandas.isna (to cover None and pandas/NumPy NaN), the third equals the
literal "nan", and the fourth equals html2text("<p>bold</p>") (use html2text to
compute the expected markdown). Place this new test alongside the existing tests
(e.g., in a new test_additional_comment_to_markdown test module) so it runs with
the existing test suite and references additional_comment_to_markdown and
html2text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c119d22b-67af-46f9-b8bd-6d987b7ba549

📥 Commits

Reviewing files that changed from the base of the PR and between fd4bb45 and 275c503.

📒 Files selected for processing (1)
  • warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py

Comment thread warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py Outdated
@martyngigg martyngigg force-pushed the check-str-before-strip-html branch from 275c503 to 1c11568 Compare June 8, 2026 16:11
@martyngigg martyngigg requested a review from a team as a code owner June 8, 2026 16:11
@martyngigg

Copy link
Copy Markdown
Member Author

Test failure is an error pulling container images. Fix needed urgently.

@martyngigg martyngigg merged commit b93399d into main Jun 8, 2026
2 of 4 checks passed
@martyngigg martyngigg deleted the check-str-before-strip-html branch June 8, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant