Conversation
These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes. Please accept this change.
|
Hmm, I suppose the weakness of my approach is that it doesn't play as nicely with your self._no_word_boundaries flag. I tried to make it insensitive to punctuation and case, but maybe you have a suggestion on how to make it better. Do you have a suggestion on how to implement Regex here while keeping the linear runtime from the dictionary approach? |
|
Is this still being updated to be made compatible? |
Basically, my new implementation of has_bad_word is in linear time with the number of bad words and the size of the input string and the old one is at least quadratic and impractical for sufficiently large datasets. But, my implementation doesn't match a sub-word, for example "asdfuckasdf" would be missed by my implementation but caught by yours. My recommendation is to split into two functions, keeping yours the way it is, and using mine as a much faster but less sensitive check. |
These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes.