These are programs that find document similarity using word embeddings and cosine similarity. TF-IDF, co-occurence amtrix, Word2Vec, Fasttext and GloVe are used for obtaining word embeddings. The repository also contains a program to check the plagiarism of a pdf file against a local corpus. Programs that use GloVe need the glove.6B.50d.txt file downloaded in the working directory (not provided in the repository) Programs of basic level contains programs that check similarity of small sentences. Progressing to the medium lvel, there are programs that check similarity of 2 pdf files. In the advanced level, there are programs that check the similarity of documents in the 20newsgroups dataset.
Reginasabs/NLP-DocSimilarity
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|