Skip to content

LeBoldus-Lab/Clustering-Algorithms-Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Clustering-Algorithms-Tutorial

The following PLINK command can convert your VCF to raw, which has rows for samples and columns for SNPs, just like the data you showed.

If the factors are 0/1 only, you have the dominance effect. If they are 0/1/2, it is additive (most common). It's not clear to me what you mean by saying a 0 represents no SNP at a position, because the VCFs I'm used to working with only have a row for a position if there is a SNP there... Do you mean 0 represents both alleles are the reference allele and there is no alternative allele at the position? This would be the dominance effect.

If it is additive:

plink --vcf my_vcf_path.vcf --recode A --out my_output_path

If dominant:

plink --vcf my_vcf_path.vcf --recode D --out my_output_path

This will produce the file my_output_path.raw

I suggest using fread from the data.table package to quickly read this file into R, but if it is small enough you can probably get away with base R functions like read.csv. If it is very massive (millions of SNPs), you'll need to read read.big.matrix from the bigmemory package.

Hopefully this will work :

my_data_in_R <- fread("my_output_path.raw")

The raw file has 6 extra columns at the front that you can get rid of with:

my_data_in_R <- my_data_in_R[, 7:ncol(my_data_in_R)]

I think it would be easiest to do the conversion on a whole-genome VCF, load the whole thing into R and then divide as needed after loading into R.

Reference for PLINK raw file format: https://www.cog-genomics.org/plink/1.9/formats#raw

Please let me know if anything is unclear or too slow... The PLINK conversion is not very fast but I believe PLINK is still the fastest tool for this purpose.

Michael Nagle

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages