Skip to content

Comments

Fix: Handle malformed CSV lines in WikiInfo parser#833

Open
Annu881 wants to merge 1 commit intodbpedia:masterfrom
Annu881:fix/wikiinfo-csv-parsing
Open

Fix: Handle malformed CSV lines in WikiInfo parser#833
Annu881 wants to merge 1 commit intodbpedia:masterfrom
Annu881:fix/wikiinfo-csv-parsing

Conversation

@Annu881
Copy link

@Annu881 Annu881 commented Feb 11, 2026

Fixes #831

Changes

  • Updated WikiInfo.scala to validate the field count before parsing each line.
  • Added regex validation for language codes to prevent invalid entries.
  • Replaced exceptions with logged warnings for malformed lines.
  • Modified the method to return None instead of throwing exceptions, allowing the extraction pipeline to continue gracefully.

Testing

  • Tested using download.test.properties (Yiddish wiki).
  • Verified that data downloads and extraction complete successfully even with malformed CSV lines.
  • Confirmed that warnings are properly logged for problematic lines.

Summary by CodeRabbit

  • Bug Fixes
    • Improved data processing resilience. Invalid records are now handled gracefully with warning logs instead of triggering application crashes, increasing system stability and uptime when encountering malformed data.

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

The WikiInfo.scala parser has been modified to gracefully handle malformed CSV lines by replacing exception-throwing logic with guarded early returns. When encountering lines with fewer than 15 fields or invalid language codes, the parser now logs warnings and returns None, allowing processing to continue rather than aborting.

Changes

Cohort / File(s) Summary
CSV Validation with Graceful Error Handling
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala
Replaced two exception-throwing code paths in the fromLine method with guard conditions that log warnings and return None. Added validation for field count (< 15 fields) and language code validity before parsing, preventing crashes on malformed CSV lines.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: replacing exceptions with graceful handling of malformed CSV lines in the WikiInfo parser, which matches the core modification.
Linked Issues check ✅ Passed All coding requirements from issue #831 are met: field count validation prevents crashes on lines with fewer than 15 fields, language code validation prevents invalid entries, exceptions are replaced with logged warnings, and None is returned for malformed lines allowing pipeline continuation.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objectives: field validation, language code validation, and exception-to-logging conversion in WikiInfo.scala align precisely with issue #831 requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (1)

79-83: Use wikiCode instead of fields(2) in the regex check for consistency.

wikiCode is already assigned on Line 79 — reuse it on Line 80 to avoid the redundant array access.

Proposed fix
      val wikiCode = fields(2)
-      if (! ConfigUtils.LanguageRegex.pattern.matcher(fields(2)).matches) {
+      if (! ConfigUtils.LanguageRegex.pattern.matcher(wikiCode).matches) {
        logger.warning("expected language code in field with index [2], found line ["+line+"] - skipping line")
        return None
      }

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash during download due to malformed CSV lines in wikipedias.csv

1 participant