Skip to content

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824

Open
vaibhav45sktech wants to merge 9 commits intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor
Open

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824
vaibhav45sktech wants to merge 9 commits intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor

Conversation

@vaibhav45sktech
Copy link
Contributor

@vaibhav45sktech vaibhav45sktech commented Jan 24, 2026

Replaces hardcoded URI strings with context.ontology lookups and improves code quality.

Changes:

  • Use context.ontology.properties() and context.ontology.classes() instead of raw URIs
  • Fix pronoun regex: word boundaries + case-insensitive + proper escaping
  • Pre-instantiate langStringDatatype at class level
  • Handle division-by-zero in gender ratio calculation
  • Clean up whitespace and formatting

Resolves issue #825

Summary by CodeRabbit

  • Improvements

    • More accurate gender extraction using language-aware pronoun counting and stricter person-entity validation
    • Emits a single language-tagged string result when confidence thresholds are met; suppresses output otherwise
    • Applies minimum-count and dominance-difference thresholds to reduce false positives
  • Refactor

    • Consolidated processing to compute pronoun counts and decide output before emission
    • Uses ontology-driven properties and class checks for consistent validation

@coderabbitai
Copy link

coderabbitai bot commented Jan 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Verifies the subject is a dbo:Person via context.ontology, counts language-specific pronouns in the page text, determines a dominant gender using minCount and minDifference thresholds, and emits a single foaf:gender quad with rdf:langString when thresholds are satisfied.

Changes

Cohort / File(s) Summary
Gender Extractor Logic
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala
Replaced heuristic with pronoun-based extraction; added ontology-based rdf:type (Person) check; load language-aware pronoun map; read full page text and count pronouns with case-insensitive word-boundary matching; aggregate per-gender counts, compute dominant gender and threshold checks (minCount, minDifference); use context.ontology for URIs; emit single foaf:gender Quad with rdf:langString only when thresholds met; removed prior multi-pass/early-return flow.

Sequence Diagram(s)

sequenceDiagram
    participant WikiPage
    participant GenderExtractor
    participant OntologyContext as Context/Ontology
    participant QuadSink as QuadWriter

    WikiPage->>GenderExtractor: provide Page (text, meta)
    GenderExtractor->>OntologyContext: query rdf:type for subject
    OntologyContext-->>GenderExtractor: rdf:type(s) (e.g., dbo:Person)
    alt is Person
        GenderExtractor->>GenderExtractor: load pronoun map for language
        GenderExtractor->>GenderExtractor: count pronouns (word-boundary, case-insensitive)
        GenderExtractor->>GenderExtractor: aggregate counts, compute max/second, apply minCount/minDifference
        alt thresholds met
            GenderExtractor->>QuadSink: emit foaf:gender Quad (rdf:langString)
        else thresholds not met
            GenderExtractor-->>QuadSink: no output
        end
    else not Person
        GenderExtractor-->>QuadSink: no output
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the primary refactoring focus: replacing hardcoded URIs with ontology lookups and removing whitespace in GenderExtractor, which aligns with the main changes described.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (2)

26-45: Inconsistent indentation throughout the class body.

Continuation lines for val declarations (e.g., Lines 28, 31, 34, 36, 38, 40) sit at the same indent level as their private val keywords, making them visually ambiguous—they look like standalone statements rather than value expressions. Additionally, Lines 43–45 inside extract are indented at the class level (2 spaces) instead of the method-body level (4 spaces), while Lines 47+ correctly use 4-space indentation.

🔧 Suggested indentation fix (representative excerpt)
   private val language: String =
-  context.language.wikiCode
+    context.language.wikiCode

   private val pronounMap: Map[String, String] =
-  GenderExtractorConfig.pronounsMap(language)
+    GenderExtractorConfig.pronounsMap(language)

   private val genderProperty =
-  context.ontology.properties("foaf:gender")
+    context.ontology.properties("foaf:gender")
   private val typeProperty =
-  context.ontology.properties("rdf:type")
+    context.ontology.properties("rdf:type")
   private val personClass =
-  context.ontology.classes("Person")
+    context.ontology.classes("Person")
   private val langStringDatatype =
-  new Datatype("rdf:langString")
+    new Datatype("rdf:langString")

   override def extract(node: PageNode, subjectUri: String): Seq[Quad] = {
-  /** First pass: extract mappings to detect rdf:type */
-  val mappingGraph: Seq[Quad] =
-  super.extract(node, subjectUri)
+    /** First pass: extract mappings to detect rdf:type */
+    val mappingGraph: Seq[Quad] =
+      super.extract(node, subjectUri)

60-73: Pre-compile pronoun regexes at the class level.

new Regex(...) is created for every pronoun on every extract() call. Since pronounMap is immutable and fixed at construction time, compile the regexes once to avoid repeated work per page.

♻️ Suggested refactor

Add a class-level val after pronounMap:

+  /** Pre-compiled pronoun regexes: (regex, gender) */
+  private val pronounRegexes: Seq[(Regex, String)] =
+    pronounMap.map { case (pronoun, gender) =>
+      (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
+    }.toSeq

Then simplify the counting loop inside extract:

     var genderCounts: Map[String, Int] =
       Map.empty.withDefaultValue(0)

-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
+    for ((regex, gender) <- pronounRegexes) {
+      val count = regex.findAllIn(wikiText).size
       genderCounts =
         genderCounts.updated(gender, genderCounts(gender) + count)
     }

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont ,Kindly review my pr whenever available

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 33-34: The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
- Around line 36-43: Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.
🧹 Nitpick comments (4)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (4)

20-27: Residual blank lines contradict the "clean up whitespace" objective.

Lines 22–25 contain four consecutive blank lines between the constructor's closing paren and extends MappingExtractor(context). These appear to be leftover formatting artifacts.

♻️ Suggested cleanup
     def redirects : Redirects
   } 
-     
-   
-  
-
 ) 
-extends MappingExtractor(context) 
+) extends MappingExtractor(context)
 {

45-48: Inconsistent indentation inside extract body.

Lines 46–48 (/** First pass … */, val mappingGraph, super.extract(…)) sit at 2-space indent — the same level as class members — while the rest of the method body (lines 50+) is indented at 4+ spaces. This applies to the class-level vals (lines 29–44) as well, where the assigned values appear at the class-body margin rather than indented from the =.

Please apply a consistent indentation style throughout the file.


64-76: Pre-compile pronoun regexes at the class level instead of inside extract.

new Regex(…) is called for every pronoun on every page. Since pronounMap is fixed at construction time, these regexes can be compiled once and reused, avoiding redundant compilation across potentially millions of pages.

♻️ Suggested refactor

Add a class-level field:

private val pronounRegexes: Seq[(Regex, String)] =
  pronounMap.toSeq.map { case (pronoun, gender) =>
    (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
  }

Then simplify the loop in extract:

-    var genderCounts: Map[String, Int] =
-      Map.empty.withDefaultValue(0)
-
-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
-      genderCounts =
-        genderCounts.updated(gender, genderCounts(gender) + count)
-    }
+    val genderCounts: Map[String, Int] =
+      pronounRegexes.foldLeft(Map.empty[String, Int].withDefaultValue(0)) {
+        case (acc, (regex, gender)) =>
+          val count = regex.findAllIn(wikiText).size
+          acc.updated(gender, acc(gender) + count)
+      }

78-78: genderCounts.isEmpty is only true when pronounMap is empty — consider clarifying intent.

Because the loop on lines 67–76 always inserts an entry (even when count is 0), genderCounts will only be empty when pronounMap itself is empty. The guard works as a safety net for unconfigured languages but reads as though it's checking "no pronouns found in the text." A brief inline comment clarifying the intent would help future readers.

Comment on lines +33 to +34
private val pronounMap: Map[String, String] =
GenderExtractorConfig.pronounsMap(language)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find GenderExtractor and GenderExtractorConfig files
find . -type f -name "*GenderExtractor*.scala" | head -20

Repository: dbpedia/extraction-framework

Length of output: 236


🏁 Script executed:

#!/bin/bash
# Search for GenderExtractorConfig to understand pronounsMap implementation
rg -n "pronounsMap" --type scala -B 3 -A 3

Repository: dbpedia/extraction-framework

Length of output: 2702


🏁 Script executed:

#!/bin/bash
# Look at the full context of GenderExtractor.scala around lines 33-34
fd "GenderExtractor.scala$" -x cat -n {} | head -60

Repository: dbpedia/extraction-framework

Length of output: 2341


🏁 Script executed:

#!/bin/bash
# Check where GenderExtractor is instantiated
rg -n "GenderExtractor" --type scala -B 2 -A 2 | grep -A 2 -B 2 "new\|GenderExtractor("

Repository: dbpedia/extraction-framework

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Check if supportedLanguages is used anywhere for validation
rg -n "supportedLanguages" --type scala -B 2 -A 2

Repository: dbpedia/extraction-framework

Length of output: 8474


🏁 Script executed:

#!/bin/bash
# Look for tests related to GenderExtractor
fd ".*GenderExtractor.*Test.*" --type f

Repository: dbpedia/extraction-framework

Length of output: 54


Add language validation before accessing pronounsMap.

GenderExtractorConfig.pronounsMap(language) will throw NoSuchElementException for unconfigured languages since it uses unsafe Map.apply() without validation. Similar extractors in the codebase (MediaExtractor, PndExtractor, PersondataExtractor, ImageExtractor) guard with require() before accessing their config maps. Add a require(GenderExtractorConfig.supportedLanguages.contains(language), ...) or use getOrElse(language, Map.empty) to handle unsupported languages gracefully.

🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 33 - 34, The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.

Comment on lines +36 to +43
private val genderProperty =
context.ontology.properties("foaf:gender")
private val typeProperty =
context.ontology.properties("rdf:type")
private val personClass =
context.ontology.classes("Person")
private val langStringDatatype =
new Datatype("rdf:langString")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the Ontology class definition
fd -e scala -name "*.scala" | head -20 | xargs grep -l "class Ontology"

Repository: dbpedia/extraction-framework

Length of output: 304


🏁 Script executed:

# Look for the Ontology class file
rg -l "class Ontology" --type scala

Repository: dbpedia/extraction-framework

Length of output: 989


🏁 Script executed:

# Check how properties and classes are defined/accessed in the Ontology class
rg -n "class Ontology|def properties|def classes|val properties|val classes" --type scala -B2 -A5

Repository: dbpedia/extraction-framework

Length of output: 38353


🏁 Script executed:

# Check how GenderExtractor uses ontology
rg -n "context\.ontology\." --type scala -B2 -A2 | head -50

Repository: dbpedia/extraction-framework

Length of output: 5842


🏁 Script executed:

wc -l core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scala

Repository: dbpedia/extraction-framework

Length of output: 141


🏁 Script executed:

head -50 core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scala

Repository: dbpedia/extraction-framework

Length of output: 1999


🏁 Script executed:

# Check how the OntologyReader populates the maps
wc -l core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala

Repository: dbpedia/extraction-framework

Length of output: 151


🏁 Script executed:

# Look for where maps are populated in OntologyReader
rg -n "new Ontology|Map\[|classes.*toMap|properties.*toMap" core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala -A2 -B2

Repository: dbpedia/extraction-framework

Length of output: 6865


🏁 Script executed:

# Check GenderExtractor itself to see full context
wc -l core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 149


🏁 Script executed:

head -50 core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 1665


Use safe getter methods provided by the Ontology class.

Lines 36-43 use unsafe direct Map.apply calls on context.ontology.properties() and context.ontology.classes(), which will throw NoSuchElementException if the keys don't exist. The Ontology class provides safe getOntologyProperty() and getOntologyClass() methods that return Option types (see Ontology.scala lines 24-30). Use these instead:

private val genderProperty = context.ontology.getOntologyProperty("foaf:gender").getOrElse(throw new IllegalStateException("Ontology missing property 'foaf:gender'"))
private val typeProperty = context.ontology.getOntologyProperty("rdf:type").getOrElse(throw new IllegalStateException("Ontology missing property 'rdf:type'"))
private val personClass = context.ontology.getOntologyClass("Person").getOrElse(throw new IllegalStateException("Ontology missing class 'Person'"))

This pattern is also used unsafely throughout the codebase (AnchorTextExtractor, FileTypeExtractor, etc.), so consider addressing it more broadly.

🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 36 - 43, Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.

@sonarqubecloud
Copy link

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@vaibhav45sktech
Copy link
Contributor Author

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@jimkont

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant