fix(warehouses): Catch nan strings that look like floats to Python. by martyngigg · Pull Request #338 · ISISNeutronMuon/analytics-data-platform

martyngigg · 2026-06-08T16:05:12Z

Summary

An error:

File "/home/ubuntu/var/git/warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py", line 112, in to_text
    return html2text(x) if x is not None else x
           ~~~~~~~~~^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 1028, in html2text
    return h.handle(html)
           ~~~~~~~~^^^^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 149, in handle
    self.feed(data)
    ~~~~~~~~~^^^^^^
  File "/home/ubuntu/.cache/uv/environments-v2/opralogweb-60e3316910e64632/lib/python3.13/site-packages/html2text/__init__.py", line 144, in feed
    data = data.replace("</' + 'script>", "</ignore>")
           ^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'replace'

appeared in the Opralogweb extract logs. It came from a text containing 'nan' that had been converted to a Python float and html2text failed on it. This adds protection to only process strings.

Note that this was spotted by a user as we don't have the required visibility of the ELT pipelines.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved handling of additional comment conversion during markdown formatting.

coderabbitai · 2026-06-08T16:07:49Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ab9d930-5d40-44ef-b18b-3f19ab1099e1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR updates the conditional logic in the additional_comment_to_markdown function within the opralogweb accelerator ingestion module. A single-line change modifies the lambda condition that determines whether to preserve original AdditionalComment values or apply HTML-to-text conversion.

Changes

AdditionalComment markdown conversion

Layer / File(s)	Summary
AdditionalComment HTML-to-markdown conversion condition `warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`	The conditional check inside the apply lambda for converting AdditionalComment values is modified, changing the logic that determines when html2text conversion is applied versus when the original value is retained.

Poem

🐰 A tiny fix, though logic seems quite tight,
A clause revised to set the conversion right,
From None to strings, the condition takes its turn,
One line adjusted—a subtle markdown lesson to learn! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title describes catching 'nan' strings that look like floats, which aligns with the PR's objective to prevent html2text being called on float-converted values, matching the main change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

An exception was beong thrown saying float did not have a replace method. The value of the cell was nan. Just leave this as it is.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py (1)
108-121: ⚠️ Potential issue | 🟠 Major

Add a test that covers the pandas NaN/"nan" edge case for AdditionalComment markdown conversion.
The existing opralogweb integration test (warehouses/facility_ops_landing/ingest/accelerator/opralogweb/tests/test_extract_and_load.py) only generates AdditionalComment as normal strings via the fake source (AdditionalComment=f"{record['AdditionalComment']}"), so it never exercises None, float('nan'), or the "nan" string case pandas may coerce.
Add a focused unit test for additional_comment_to_markdown covering: None, float('nan')/pandas-NaN, "nan" as a string, and a valid HTML string; assert the expected output for each.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`
around lines 108 - 121, Add a focused unit test for the function
additional_comment_to_markdown that builds a pyarrow.Table with
AdditionalComment values [None, float('nan'), "nan", "bold"], calls
additional_comment_to_markdown, and asserts the results: the first two entries
are pandas.isna (to cover None and pandas/NumPy NaN), the third equals the
literal "nan", and the fourth equals html2text("bold") (use html2text to
compute the expected markdown). Place this new test alongside the existing tests
(e.g., in a new test_additional_comment_to_markdown test module) so it runs with
the existing test suite and references additional_comment_to_markdown and
html2text.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`:
- Line 117: The lambda's condition (x is None and isinstance(x, str)) is
impossible and causes html2text to be called on non-strings (e.g., NaN floats);
change the expression to only call html2text when x is actually a string—replace
the lambda with something like: use html2text(x) if isinstance(x, str) else x
(or, if you need to treat NaN specially, check math.isnan(x) for floats and
return x unchanged). Update the lambda where html2text is used so it guards with
isinstance(x, str) (and import math/numpy.isnan if you choose the NaN check).

---

Outside diff comments:
In `@warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py`:
- Around line 108-121: Add a focused unit test for the function
additional_comment_to_markdown that builds a pyarrow.Table with
AdditionalComment values [None, float('nan'), "nan", "<p>bold</p>"], calls
additional_comment_to_markdown, and asserts the results: the first two entries
are pandas.isna (to cover None and pandas/NumPy NaN), the third equals the
literal "nan", and the fourth equals html2text("<p>bold</p>") (use html2text to
compute the expected markdown). Place this new test alongside the existing tests
(e.g., in a new test_additional_comment_to_markdown test module) so it runs with
the existing test suite and references additional_comment_to_markdown and
html2text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c119d22b-67af-46f9-b8bd-6d987b7ba549

📥 Commits

Reviewing files that changed from the base of the PR and between fd4bb45 and 275c503.

📒 Files selected for processing (1)

warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py

martyngigg · 2026-06-08T16:15:54Z

Test failure is an error pulling container images. Fix needed urgently.

martyngigg requested a review from a team as a code owner June 8, 2026 16:05

Catch nan strings that look like floats to Python.

1c11568

An exception was beong thrown saying float did not have a replace method. The value of the cell was nan. Just leave this as it is.

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread warehouses/facility_ops_landing/ingest/accelerator/opralogweb/opralogweb.py Outdated

martyngigg force-pushed the check-str-before-strip-html branch from 275c503 to 1c11568 Compare June 8, 2026 16:11

martyngigg requested a review from a team as a code owner June 8, 2026 16:11

martyngigg merged commit b93399d into main Jun 8, 2026
2 of 4 checks passed

martyngigg deleted the check-str-before-strip-html branch June 8, 2026 16:16

martyngigg mentioned this pull request Jun 9, 2026

Implement Apache Airflow as ELT orchestration tool #320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(warehouses): Catch nan strings that look like floats to Python.#338

fix(warehouses): Catch nan strings that look like floats to Python.#338
martyngigg merged 1 commit into
mainfrom
check-str-before-strip-html

martyngigg commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

martyngigg commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

martyngigg commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martyngigg commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

martyngigg commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading