add argument for wikidump path + use IndexedRowMatrix to track docIds#108
Open
derlin wants to merge 1 commit intosryza:masterfrom
Open
add argument for wikidump path + use IndexedRowMatrix to track docIds#108derlin wants to merge 1 commit intosryza:masterfrom
derlin wants to merge 1 commit intosryza:masterfrom
Conversation
…atrix to keep track of document Ids
In chapter 6, it says "creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ...". A less "hackish" way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, doc ids are embedded in the svd model. There are many advantages, one of which is that it is now possible to save the svd model for later use.
To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function "monotically_increasing_id":
import org.apache.spark.sql.functions._
docTermMatrix.withColumn("id",monotonically_increasing_id)
but this generates huge ids (about 10 digits long), which is harder to read. Hence the "addNiceRowId" method.
Collaborator
|
This looks like a good suggestion. The book has just gone to press though, so I'm not sure we can add this for the 2nd edition. But it can stay here as a note and suggestion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
About path argument: I found it easier to be able to pass the path to the wikidump file as an argument instead of recompiling every time I want to use another dump.
About docIds: In chapter 6, it says
Another way to keep track of docIds is to use
IndexRowMatrixinstead ofRowMatrix. This way, document ids are embedded in the svd model and don't depend on the partitioning anymore. This technique has many advantages, one of which is that it is now possible to save the svd model for later use.To generate doc ids, I still use the
zipWithUniqueIdavailable for RDD only. A better way would be to use the sql functionmonotically_increasing_id:but this generates huge ids (about 10 digits long), which is harder to read. Hence the
addNiceRowIdmethod.(By the way, loved your book, nice work !)