refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in `GenderExtractor` by vaibhav45sktech · Pull Request #824 · dbpedia/extraction-framework

vaibhav45sktech · 2026-01-24T18:41:29Z

Replaces hardcoded URI strings with context.ontology lookups and improves code quality.

Changes:

Use context.ontology.properties() and context.ontology.classes() instead of raw URIs
Fix pronoun regex: word boundaries + case-insensitive + proper escaping
Pre-instantiate langStringDatatype at class level
Handle division-by-zero in gender ratio calculation
Clean up whitespace and formatting

Resolves issue #825

Summary by CodeRabbit

Improvements
- More accurate gender extraction using language-aware pronoun counting and stricter person-entity validation
- Emits a single language-tagged string result when confidence thresholds are met; suppresses output otherwise
- Applies minimum-count and dominance-difference thresholds to reduce false positives
Refactor
- Consolidated processing to compute pronoun counts and decide output before emission
- Uses ontology-driven properties and class checks for consistent validation

coderabbitai · 2026-01-24T18:42:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Verifies the subject is a dbo:Person via context.ontology, counts language-specific pronouns in the page text, determines a dominant gender using minCount and minDifference thresholds, and emits a single foaf:gender quad with rdf:langString when thresholds are satisfied.

Changes

Cohort / File(s)	Summary
Gender Extractor Logic `core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`	Replaced heuristic with pronoun-based extraction; added ontology-based rdf:type (Person) check; load language-aware pronoun map; read full page text and count pronouns with case-insensitive word-boundary matching; aggregate per-gender counts, compute dominant gender and threshold checks (`minCount`, `minDifference`); use `context.ontology` for URIs; emit single `foaf:gender` Quad with `rdf:langString` only when thresholds met; removed prior multi-pass/early-return flow.

Sequence Diagram(s)

sequenceDiagram
    participant WikiPage
    participant GenderExtractor
    participant OntologyContext as Context/Ontology
    participant QuadSink as QuadWriter

    WikiPage->>GenderExtractor: provide Page (text, meta)
    GenderExtractor->>OntologyContext: query rdf:type for subject
    OntologyContext-->>GenderExtractor: rdf:type(s) (e.g., dbo:Person)
    alt is Person
        GenderExtractor->>GenderExtractor: load pronoun map for language
        GenderExtractor->>GenderExtractor: count pronouns (word-boundary, case-insensitive)
        GenderExtractor->>GenderExtractor: aggregate counts, compute max/second, apply minCount/minDifference
        alt thresholds met
            GenderExtractor->>QuadSink: emit foaf:gender Quad (rdf:langString)
        else thresholds not met
            GenderExtractor-->>QuadSink: no output
        end
    else not Person
        GenderExtractor-->>QuadSink: no output
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the primary refactoring focus: replacing hardcoded URIs with ontology lookups and removing whitespace in GenderExtractor, which aligns with the main changes described.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (2)
26-45: Inconsistent indentation throughout the class body.

Continuation lines for val declarations (e.g., Lines 28, 31, 34, 36, 38, 40) sit at the same indent level as their private val keywords, making them visually ambiguous—they look like standalone statements rather than value expressions. Additionally, Lines 43–45 inside extract are indented at the class level (2 spaces) instead of the method-body level (4 spaces), while Lines 47+ correctly use 4-space indentation.
🔧 Suggested indentation fix (representative excerpt)
   private val language: String =
-  context.language.wikiCode
+    context.language.wikiCode

   private val pronounMap: Map[String, String] =
-  GenderExtractorConfig.pronounsMap(language)
+    GenderExtractorConfig.pronounsMap(language)

   private val genderProperty =
-  context.ontology.properties("foaf:gender")
+    context.ontology.properties("foaf:gender")
   private val typeProperty =
-  context.ontology.properties("rdf:type")
+    context.ontology.properties("rdf:type")
   private val personClass =
-  context.ontology.classes("Person")
+    context.ontology.classes("Person")
   private val langStringDatatype =
-  new Datatype("rdf:langString")
+    new Datatype("rdf:langString")

   override def extract(node: PageNode, subjectUri: String): Seq[Quad] = {
-  /** First pass: extract mappings to detect rdf:type */
-  val mappingGraph: Seq[Quad] =
-  super.extract(node, subjectUri)
+    /** First pass: extract mappings to detect rdf:type */
+    val mappingGraph: Seq[Quad] =
+      super.extract(node, subjectUri)
60-73: Pre-compile pronoun regexes at the class level.

new Regex(...) is created for every pronoun on every extract() call. Since pronounMap is immutable and fixed at construction time, compile the regexes once to avoid repeated work per page.
♻️ Suggested refactor

Add a class-level val after pronounMap:
+  /** Pre-compiled pronoun regexes: (regex, gender) */
+  private val pronounRegexes: Seq[(Regex, String)] =
+    pronounMap.map { case (pronoun, gender) =>
+      (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
+    }.toSeq
Then simplify the counting loop inside extract:
     var genderCounts: Map[String, Int] =
       Map.empty.withDefaultValue(0)

-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
+    for ((regex, gender) <- pronounRegexes) {
+      val count = regex.findAllIn(wikiText).size
       genderCounts =
         genderCounts.updated(gender, genderCounts(gender) + count)
     }

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

vaibhav45sktech · 2026-01-27T05:03:45Z

Greetings @jimkont ,Kindly review my pr whenever available

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 33-34: The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
- Around line 36-43: Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.

🧹 Nitpick comments (4)

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (4)
20-27: Residual blank lines contradict the "clean up whitespace" objective.

Lines 22–25 contain four consecutive blank lines between the constructor's closing paren and extends MappingExtractor(context). These appear to be leftover formatting artifacts.
♻️ Suggested cleanup
     def redirects : Redirects
   } 
-     
-   
-  
-
 ) 
-extends MappingExtractor(context) 
+) extends MappingExtractor(context)
 {
45-48: Inconsistent indentation inside extract body.

Lines 46–48 (/** First pass … */, val mappingGraph, super.extract(…)) sit at 2-space indent — the same level as class members — while the rest of the method body (lines 50+) is indented at 4+ spaces. This applies to the class-level vals (lines 29–44) as well, where the assigned values appear at the class-body margin rather than indented from the =.

Please apply a consistent indentation style throughout the file.

64-76: Pre-compile pronoun regexes at the class level instead of inside extract.

new Regex(…) is called for every pronoun on every page. Since pronounMap is fixed at construction time, these regexes can be compiled once and reused, avoiding redundant compilation across potentially millions of pages.
♻️ Suggested refactor

Add a class-level field:
private val pronounRegexes: Seq[(Regex, String)] =
  pronounMap.toSeq.map { case (pronoun, gender) =>
    (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender)
  }
Then simplify the loop in extract:
-    var genderCounts: Map[String, Int] =
-      Map.empty.withDefaultValue(0)
-
-    for ((pronoun, gender) <- pronounMap) {
-      val regex =
-        new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b")
-
-      val count =
-        regex.findAllIn(wikiText).size
-
-      genderCounts =
-        genderCounts.updated(gender, genderCounts(gender) + count)
-    }
+    val genderCounts: Map[String, Int] =
+      pronounRegexes.foldLeft(Map.empty[String, Int].withDefaultValue(0)) {
+        case (acc, (regex, gender)) =>
+          val count = regex.findAllIn(wikiText).size
+          acc.updated(gender, acc(gender) + count)
+      }
78-78: genderCounts.isEmpty is only true when pronounMap is empty — consider clarifying intent.

Because the loop on lines 67–76 always inserts an entry (even when count is 0), genderCounts will only be empty when pronounMap itself is empty. The guard works as a safety net for unconfigured languages but reads as though it's checking "no pronouns found in the text." A brief inline comment clarifying the intent would help future readers.

coderabbitai · 2026-02-08T18:00:00Z

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

+  private val pronounMap: Map[String, String] =
+  GenderExtractorConfig.pronounsMap(language)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find GenderExtractor and GenderExtractorConfig files find . -type f -name "*GenderExtractor*.scala" | head -20

Repository: dbpedia/extraction-framework

Length of output: 236

🏁 Script executed:

#!/bin/bash # Search for GenderExtractorConfig to understand pronounsMap implementation rg -n "pronounsMap" --type scala -B 3 -A 3

Repository: dbpedia/extraction-framework

Length of output: 2702

🏁 Script executed:

#!/bin/bash # Look at the full context of GenderExtractor.scala around lines 33-34 fd "GenderExtractor.scala$" -x cat -n {} | head -60

Repository: dbpedia/extraction-framework

Length of output: 2341

🏁 Script executed:

#!/bin/bash # Check where GenderExtractor is instantiated rg -n "GenderExtractor" --type scala -B 2 -A 2 | grep -A 2 -B 2 "new\|GenderExtractor("

Repository: dbpedia/extraction-framework

Length of output: 604

🏁 Script executed:

#!/bin/bash # Check if supportedLanguages is used anywhere for validation rg -n "supportedLanguages" --type scala -B 2 -A 2

Repository: dbpedia/extraction-framework

Length of output: 8474

🏁 Script executed:

#!/bin/bash # Look for tests related to GenderExtractor fd ".*GenderExtractor.*Test.*" --type f

Repository: dbpedia/extraction-framework

Length of output: 54

Add language validation before accessing pronounsMap.

GenderExtractorConfig.pronounsMap(language) will throw NoSuchElementException for unconfigured languages since it uses unsafe Map.apply() without validation. Similar extractors in the codebase (MediaExtractor, PndExtractor, PersondataExtractor, ImageExtractor) guard with require() before accessing their config maps. Add a require(GenderExtractorConfig.supportedLanguages.contains(language), ...) or use getOrElse(language, Map.empty) to handle unsupported languages gracefully.

🤖 Prompt for AI Agents

In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala` around lines 33 - 34, The code calls GenderExtractorConfig.pronounsMap(language) directly which throws NoSuchElementException for unsupported languages; update GenderExtractor to validate the language before accessing the map by either adding a require(GenderExtractorConfig.supportedLanguages.contains(language), "unsupported language: "+language) at the start of the GenderExtractor constructor/object or use a safe lookup like GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning pronounMap so unsupported languages are handled gracefully; reference symbols: pronounMap, GenderExtractorConfig.pronounsMap, GenderExtractorConfig.supportedLanguages, GenderExtractor.

coderabbitai · 2026-02-08T18:00:00Z

core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

+  private val genderProperty =
+  context.ontology.properties("foaf:gender")
+  private val typeProperty =
+  context.ontology.properties("rdf:type")
+  private val personClass =
+  context.ontology.classes("Person")
+  private val langStringDatatype =
+  new Datatype("rdf:langString")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the Ontology class definition fd -e scala -name "*.scala" | head -20 | xargs grep -l "class Ontology"

Repository: dbpedia/extraction-framework

Length of output: 304

🏁 Script executed:

# Look for the Ontology class file rg -l "class Ontology" --type scala

Repository: dbpedia/extraction-framework

Length of output: 989

🏁 Script executed:

# Check how properties and classes are defined/accessed in the Ontology class rg -n "class Ontology|def properties|def classes|val properties|val classes" --type scala -B2 -A5

Repository: dbpedia/extraction-framework

Length of output: 38353

🏁 Script executed:

# Check how GenderExtractor uses ontology rg -n "context\.ontology\." --type scala -B2 -A2 | head -50

Repository: dbpedia/extraction-framework

Length of output: 5842

🏁 Script executed:

wc -l core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scala

Repository: dbpedia/extraction-framework

Length of output: 141

🏁 Script executed:

head -50 core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scala

Repository: dbpedia/extraction-framework

Length of output: 1999

🏁 Script executed:

# Check how the OntologyReader populates the maps wc -l core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala

Repository: dbpedia/extraction-framework

Length of output: 151

🏁 Script executed:

# Look for where maps are populated in OntologyReader rg -n "new Ontology|Map\[|classes.*toMap|properties.*toMap" core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala -A2 -B2

Repository: dbpedia/extraction-framework

Length of output: 6865

🏁 Script executed:

# Check GenderExtractor itself to see full context wc -l core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 149

🏁 Script executed:

head -50 core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 1665

Use safe getter methods provided by the Ontology class.

Lines 36-43 use unsafe direct Map.apply calls on context.ontology.properties() and context.ontology.classes(), which will throw NoSuchElementException if the keys don't exist. The Ontology class provides safe getOntologyProperty() and getOntologyClass() methods that return Option types (see Ontology.scala lines 24-30). Use these instead:

private val genderProperty = context.ontology.getOntologyProperty("foaf:gender").getOrElse(throw new IllegalStateException("Ontology missing property 'foaf:gender'")) private val typeProperty = context.ontology.getOntologyProperty("rdf:type").getOrElse(throw new IllegalStateException("Ontology missing property 'rdf:type'")) private val personClass = context.ontology.getOntologyClass("Person").getOrElse(throw new IllegalStateException("Ontology missing class 'Person'"))

This pattern is also used unsafely throughout the codebase (AnchorTextExtractor, FileTypeExtractor, etc.), so consider addressing it more broadly.

🤖 Prompt for AI Agents

In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala` around lines 36 - 43, Replace unsafe Map.apply lookups for genderProperty, typeProperty and personClass with the safe Ontology getters: use context.ontology.getOntologyProperty("foaf:gender"), getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person") and handle the Option by calling .getOrElse(throw new IllegalStateException(...)) with clear messages; update the private vals genderProperty, typeProperty and personClass to use these safe getters so missing ontology entries produce a controlled IllegalStateException instead of NoSuchElementException.

sonarqubecloud · 2026-02-11T14:47:54Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

vaibhav45sktech · 2026-02-11T14:51:46Z

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

vaibhav45sktech · 2026-02-22T06:17:15Z

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

vaibhav45sktech · 2026-02-23T13:18:13Z

Greetings @jimkont , @Integer-Ctrl Could you please review my pr .

@jimkont

@jimkont

updated chnages

d5867cd

This was referenced Jan 24, 2026

GenderExtractor: Replace hardcoded URI strings with ontology lookups #810

Closed

GenderExtractor: Replace hardcoded URI strings with ontology lookups #822

Open

vaibhav45sktech added 5 commits February 8, 2026 17:35

smallchnage

693d5f9

change

c533fec

indentation

d5cc749

schange

e6312c3

sschnage

436543b

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

vaibhav45sktech added 3 commits February 11, 2026 20:04

Merge branch 'dbpedia:master' into fix-gender-extractor

1b2f860

removedspaces

a9017ca

spaces

62cdd3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in `GenderExtractor`#824

refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in `GenderExtractor`#824
vaibhav45sktech wants to merge 9 commits intodbpedia:masterfrom
vaibhav45sktech:fix-gender-extractor

vaibhav45sktech commented Jan 24, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 24, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

vaibhav45sktech commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

sonarqubecloud bot commented Feb 11, 2026

Uh oh!

vaibhav45sktech commented Feb 11, 2026

Uh oh!

vaibhav45sktech commented Feb 22, 2026

Uh oh!

vaibhav45sktech commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		private val pronounMap: Map[String, String] =
		GenderExtractorConfig.pronounsMap(language)

Conversation

vaibhav45sktech commented Jan 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

vaibhav45sktech commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Feb 11, 2026

Quality Gate passed

Uh oh!

vaibhav45sktech commented Feb 11, 2026

Uh oh!

vaibhav45sktech commented Feb 22, 2026

Uh oh!

vaibhav45sktech commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vaibhav45sktech commented Jan 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 24, 2026 •

edited

Loading