Skip to content

Improved runtime of has_bad_word#19

Open
aarashy wants to merge 4 commits intoareebbeigh:masterfrom
aarashy:patch-2
Open

Improved runtime of has_bad_word#19
aarashy wants to merge 4 commits intoareebbeigh:masterfrom
aarashy:patch-2

Conversation

@aarashy
Copy link
Copy Markdown

@aarashy aarashy commented May 26, 2019

These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes.

aarashy added 4 commits May 26, 2019 14:03
These methods used to have quadratic runtime of O(|Profane Words| * |input_text|) (possible even slower if you consider the runtime of the regex operations within the censor method). Using censor to implement has_bad_word is fundamentally inefficient. I wanted to use ProfanityFilter on my large dataset (millions of YouTube comments) and it was prohibitively slow. My new implementation leverages a dictionary to run in linear time and quits early if it finds a profane word, rather than doing tons of unnecessary computations. The old implementation made little progress in an hour on my dataset, whereas my implementation did the whole dataset in under 2 minutes. Please accept this change.
@aarashy
Copy link
Copy Markdown
Author

aarashy commented May 26, 2019

Hmm, I suppose the weakness of my approach is that it doesn't play as nicely with your self._no_word_boundaries flag. I tried to make it insensitive to punctuation and case, but maybe you have a suggestion on how to make it better. Do you have a suggestion on how to implement Regex here while keeping the linear runtime from the dictionary approach?

@DonaldTsang
Copy link
Copy Markdown

Is this still being updated to be made compatible?

@aarashy
Copy link
Copy Markdown
Author

aarashy commented Jan 17, 2020

Is this still being updated to be made compatible?

Basically, my new implementation of has_bad_word is in linear time with the number of bad words and the size of the input string and the old one is at least quadratic and impractical for sufficiently large datasets. But, my implementation doesn't match a sub-word, for example "asdfuckasdf" would be missed by my implementation but caught by yours.

My recommendation is to split into two functions, keeping yours the way it is, and using mine as a much faster but less sensitive check.

@duttonw
Copy link
Copy Markdown
Collaborator

duttonw commented Nov 25, 2024

Hi @aarashy ,

Are you able to add a unit test for what you have suggested.

Regards,

@duttonw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants