refactor: Replace hardcoded URIs with ontology lookups and remove whitespaces in GenderExtractor#824
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughVerifies the subject is a dbo:Person via Changes
Sequence Diagram(s)sequenceDiagram
participant WikiPage
participant GenderExtractor
participant OntologyContext as Context/Ontology
participant QuadSink as QuadWriter
WikiPage->>GenderExtractor: provide Page (text, meta)
GenderExtractor->>OntologyContext: query rdf:type for subject
OntologyContext-->>GenderExtractor: rdf:type(s) (e.g., dbo:Person)
alt is Person
GenderExtractor->>GenderExtractor: load pronoun map for language
GenderExtractor->>GenderExtractor: count pronouns (word-boundary, case-insensitive)
GenderExtractor->>GenderExtractor: aggregate counts, compute max/second, apply minCount/minDifference
alt thresholds met
GenderExtractor->>QuadSink: emit foaf:gender Quad (rdf:langString)
else thresholds not met
GenderExtractor-->>QuadSink: no output
end
else not Person
GenderExtractor-->>QuadSink: no output
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Greetings @jimkont ,Kindly review my pr whenever available |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`:
- Around line 33-34: The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
- Around line 36-43: Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.
🧹 Nitpick comments (4)
core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala (4)
20-27: Residual blank lines contradict the "clean up whitespace" objective.Lines 22–25 contain four consecutive blank lines between the constructor's closing paren and
extends MappingExtractor(context). These appear to be leftover formatting artifacts.♻️ Suggested cleanup
def redirects : Redirects } - - - - ) -extends MappingExtractor(context) +) extends MappingExtractor(context) {
45-48: Inconsistent indentation insideextractbody.Lines 46–48 (
/** First pass … */,val mappingGraph,super.extract(…)) sit at 2-space indent — the same level as class members — while the rest of the method body (lines 50+) is indented at 4+ spaces. This applies to the class-level vals (lines 29–44) as well, where the assigned values appear at the class-body margin rather than indented from the=.Please apply a consistent indentation style throughout the file.
64-76: Pre-compile pronoun regexes at the class level instead of insideextract.
new Regex(…)is called for every pronoun on every page. SincepronounMapis fixed at construction time, these regexes can be compiled once and reused, avoiding redundant compilation across potentially millions of pages.♻️ Suggested refactor
Add a class-level field:
private val pronounRegexes: Seq[(Regex, String)] = pronounMap.toSeq.map { case (pronoun, gender) => (new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b"), gender) }Then simplify the loop in
extract:- var genderCounts: Map[String, Int] = - Map.empty.withDefaultValue(0) - - for ((pronoun, gender) <- pronounMap) { - val regex = - new Regex("(?i)\\b" + Regex.quote(pronoun) + "\\b") - - val count = - regex.findAllIn(wikiText).size - - genderCounts = - genderCounts.updated(gender, genderCounts(gender) + count) - } + val genderCounts: Map[String, Int] = + pronounRegexes.foldLeft(Map.empty[String, Int].withDefaultValue(0)) { + case (acc, (regex, gender)) => + val count = regex.findAllIn(wikiText).size + acc.updated(gender, acc(gender) + count) + }
78-78:genderCounts.isEmptyis only true whenpronounMapis empty — consider clarifying intent.Because the loop on lines 67–76 always inserts an entry (even when
countis 0),genderCountswill only be empty whenpronounMapitself is empty. The guard works as a safety net for unconfigured languages but reads as though it's checking "no pronouns found in the text." A brief inline comment clarifying the intent would help future readers.
| private val pronounMap: Map[String, String] = | ||
| GenderExtractorConfig.pronounsMap(language) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find GenderExtractor and GenderExtractorConfig files
find . -type f -name "*GenderExtractor*.scala" | head -20Repository: dbpedia/extraction-framework
Length of output: 236
🏁 Script executed:
#!/bin/bash
# Search for GenderExtractorConfig to understand pronounsMap implementation
rg -n "pronounsMap" --type scala -B 3 -A 3Repository: dbpedia/extraction-framework
Length of output: 2702
🏁 Script executed:
#!/bin/bash
# Look at the full context of GenderExtractor.scala around lines 33-34
fd "GenderExtractor.scala$" -x cat -n {} | head -60Repository: dbpedia/extraction-framework
Length of output: 2341
🏁 Script executed:
#!/bin/bash
# Check where GenderExtractor is instantiated
rg -n "GenderExtractor" --type scala -B 2 -A 2 | grep -A 2 -B 2 "new\|GenderExtractor("Repository: dbpedia/extraction-framework
Length of output: 604
🏁 Script executed:
#!/bin/bash
# Check if supportedLanguages is used anywhere for validation
rg -n "supportedLanguages" --type scala -B 2 -A 2Repository: dbpedia/extraction-framework
Length of output: 8474
🏁 Script executed:
#!/bin/bash
# Look for tests related to GenderExtractor
fd ".*GenderExtractor.*Test.*" --type fRepository: dbpedia/extraction-framework
Length of output: 54
Add language validation before accessing pronounsMap.
GenderExtractorConfig.pronounsMap(language) will throw NoSuchElementException for unconfigured languages since it uses unsafe Map.apply() without validation. Similar extractors in the codebase (MediaExtractor, PndExtractor, PersondataExtractor, ImageExtractor) guard with require() before accessing their config maps. Add a require(GenderExtractorConfig.supportedLanguages.contains(language), ...) or use getOrElse(language, Map.empty) to handle unsupported languages gracefully.
🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 33 - 34, The code calls GenderExtractorConfig.pronounsMap(language)
directly which throws NoSuchElementException for unsupported languages; update
GenderExtractor to validate the language before accessing the map by either
adding a require(GenderExtractorConfig.supportedLanguages.contains(language),
"unsupported language: "+language) at the start of the GenderExtractor
constructor/object or use a safe lookup like
GenderExtractorConfig.pronounsMap.getOrElse(language, Map.empty) when assigning
pronounMap so unsupported languages are handled gracefully; reference symbols:
pronounMap, GenderExtractorConfig.pronounsMap,
GenderExtractorConfig.supportedLanguages, GenderExtractor.
| private val genderProperty = | ||
| context.ontology.properties("foaf:gender") | ||
| private val typeProperty = | ||
| context.ontology.properties("rdf:type") | ||
| private val personClass = | ||
| context.ontology.classes("Person") | ||
| private val langStringDatatype = | ||
| new Datatype("rdf:langString") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the Ontology class definition
fd -e scala -name "*.scala" | head -20 | xargs grep -l "class Ontology"Repository: dbpedia/extraction-framework
Length of output: 304
🏁 Script executed:
# Look for the Ontology class file
rg -l "class Ontology" --type scalaRepository: dbpedia/extraction-framework
Length of output: 989
🏁 Script executed:
# Check how properties and classes are defined/accessed in the Ontology class
rg -n "class Ontology|def properties|def classes|val properties|val classes" --type scala -B2 -A5Repository: dbpedia/extraction-framework
Length of output: 38353
🏁 Script executed:
# Check how GenderExtractor uses ontology
rg -n "context\.ontology\." --type scala -B2 -A2 | head -50Repository: dbpedia/extraction-framework
Length of output: 5842
🏁 Script executed:
wc -l core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scalaRepository: dbpedia/extraction-framework
Length of output: 141
🏁 Script executed:
head -50 core/src/main/scala/org/dbpedia/extraction/ontology/Ontology.scalaRepository: dbpedia/extraction-framework
Length of output: 1999
🏁 Script executed:
# Check how the OntologyReader populates the maps
wc -l core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scalaRepository: dbpedia/extraction-framework
Length of output: 151
🏁 Script executed:
# Look for where maps are populated in OntologyReader
rg -n "new Ontology|Map\[|classes.*toMap|properties.*toMap" core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala -A2 -B2Repository: dbpedia/extraction-framework
Length of output: 6865
🏁 Script executed:
# Check GenderExtractor itself to see full context
wc -l core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scalaRepository: dbpedia/extraction-framework
Length of output: 149
🏁 Script executed:
head -50 core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scalaRepository: dbpedia/extraction-framework
Length of output: 1665
Use safe getter methods provided by the Ontology class.
Lines 36-43 use unsafe direct Map.apply calls on context.ontology.properties() and context.ontology.classes(), which will throw NoSuchElementException if the keys don't exist. The Ontology class provides safe getOntologyProperty() and getOntologyClass() methods that return Option types (see Ontology.scala lines 24-30). Use these instead:
private val genderProperty = context.ontology.getOntologyProperty("foaf:gender").getOrElse(throw new IllegalStateException("Ontology missing property 'foaf:gender'"))
private val typeProperty = context.ontology.getOntologyProperty("rdf:type").getOrElse(throw new IllegalStateException("Ontology missing property 'rdf:type'"))
private val personClass = context.ontology.getOntologyClass("Person").getOrElse(throw new IllegalStateException("Ontology missing class 'Person'"))This pattern is also used unsafely throughout the codebase (AnchorTextExtractor, FileTypeExtractor, etc.), so consider addressing it more broadly.
🤖 Prompt for AI Agents
In `@core/src/main/scala/org/dbpedia/extraction/mappings/GenderExtractor.scala`
around lines 36 - 43, Replace unsafe Map.apply lookups for genderProperty,
typeProperty and personClass with the safe Ontology getters: use
context.ontology.getOntologyProperty("foaf:gender"),
getOntologyProperty("rdf:type") and context.ontology.getOntologyClass("Person")
and handle the Option by calling .getOrElse(throw new
IllegalStateException(...)) with clear messages; update the private vals
genderProperty, typeProperty and personClass to use these safe getters so
missing ontology entries produce a controlled IllegalStateException instead of
NoSuchElementException.
|
|
Greetings @jimkont , @Integer-Ctrl Could you please review my pr . |
|
|



Replaces hardcoded URI strings with
context.ontologylookups and improves code quality.Changes:
context.ontology.properties()andcontext.ontology.classes()instead of raw URIslangStringDatatypeat class levelResolves issue #825
Summary by CodeRabbit
Improvements
Refactor