r - How to improve this code for getting pairwise? -


it question build upon previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).

it different previous one, more details provided, , libraries , example file provided according comments dwin. so, submitted new question. mind teach me how modify code further?

to load necessary libraries:

source("http://bioconductor.org/bioclite.r") bioclite() 

my protseq.fasta file has following contents:

>drugbank_target|1 peptidoglycan synthetase ftsi (db00303)   mvkfnssrksgkskktirkltapetvkqnkpqkvfekcfmrgrymlstvlillglcalvaraayvqsinadtlsneadkr slrkdevlsvrgsildrngqllsvsvpmsaivadpktmlkensladkeriaalaeelgmtendlvkkieknsksgylyla rqvelskanyirrlkikgiiletehrrfyprveeaahvvgytdidgngiegieksfnsllvgkdgsrtvrkdkrgnivah isdekkydaqdvtlsideklqsmvyreikkavsennaesgtavlvdvrtgevlamatapsynpnnrvgvkselmrnrait dtfepgstvkpfvvltalqrgvvkrdeiidttsfklsgkeivdvapraqqtldeilmnssnrgvsrlalrmppsalmety qnaglskptdlgligeqvgilnanrkrwadieratvaygygitatplqiarayatlgsfgvyrplsitkvdppvigkrvf sekitkdivgilekvaiknkramvegyrvgvktgtarkienghyvnkyvaftagiapisdpryalvvlindpkageyygg avsapvfsnimgyalranaipqdaeaaentttksakrivyigehknqkvn >drugbank_target|3 histidine decarboxylase (db00114; db00117)   mmepeeyrergremvdyicqylstvrerrvtpdvqpgylraqlpesapedpdswdsifgdieriimpgvvhwqsphmhay ypaltswpsllgdmladainclgftwasspactelemnvmdwlakmlglpehflhhhpssqgggvlqstvsestlialla arknkilemktsepdadesclnarlvayasdqahssvekaglislvkmkflpvddnfslrgealqkaieedkqrglvpvf vcatlgttgvcafdclselgpicareglwlhidaayagtaflcpefrgflkgieyadsftfnpskwmmvhfdctgfwvkd kyklqqtfsvnpiylrhansgvatdfmhwqiplsrrfrsvklwfvirsfgvknlqahvrhgtemakyfeslvrndpsfei pakrhlglvvfrlkgpncltenvlkeiakagrlflipatiqdkliirftvtsqfttrddilrdwnlirdaatlilsqhct sqpsprvgnlisqirgarawacgtslqsvsgagddpvqarkiikqpqrvgagpmkrenglhletlldpvddcfseeapda tkhklssflfsylsvqtkkktvrslscnsvpvsaqkplpteasvknggssrvrifsrfpedmmmlkksafkklikfysvp sfpecssqcglqlpccplqamv >drugbank_target|5 glutaminase liver isoform, mitochondrial (db00130; db00142)   mrsmkalqkalsragshcgrggwghpsrspllgggvrhhlseaaaqgretphshqpqhqdhdssesgmlsrlgdllfyti aegqertpihkfttalkatglqtsdprlrdcmsemhrvvqesssgglldrdlfrkcvsssivlltqafrkkfvipdfeef tghvdrifedvkeltggkvaayipqlaksnpdlwgvslctvdgqrhsvghtkipfclqscvkpltyaisistlgtdyvhk fvgkepsglrynklsldeegiphnpmvnagaivvsslikmdcnkaekfdfvlqylnkmagneymgfsnatfqseketgdr nyaigyyheekkcfpkgvdmmaaldlyfqlcsvevtcesgsvmaatlanggicpitgesvlsaeavrntlslmhscgmyd fsgqfafhvglpaksavsgaillvvpnvmgmmclsppldklgnshrgtsfcqklvslfnfhnydnlrhcarkldprrega eirnktvvnllfaaysgdvsalrrfalsamdmeqkdydsrtalhvaaaeghievvkflieackvnpfakdrwgnipldda vqfnhlevvkllqdyqdsytlsetqaeaaaealskenlesmv >drugbank_target|6 coagulation factor xiii chain (db00130; db01839; db02340)   setsrtafggrravppnnsnaaeddlptvelqgvvprgvnlqeflnvtsvhlfkerwdtnkvdhhtdkyennklivrrgq sfyvqidfsrpydprrdlfrveyvigrypqenkgtyipvpivselqsgkwgakivmredrsvrlsiqsspkcivgkfrmy vavwtpygvlrtsrnpetdtyilfnpwceddavyldnekereeyvlndigvifygevndiktrswsygqfedgildtcly vmdraqmdlsgrgnpikvsrvgsamvnakddegvlvgswdniyaygvppsawtgsvdilleyrssenpvrygqcwvfagv fntflrclgiparivtnyfsahdndanlqmdifleedgnvnskltkdsvwnyhcwneawmtrpdlpvgfggwqavdstpq ensdgmyrcgpasvqaikhghvcfqfdapfvfaevnsdliyitakkdgthvvenvdathigklivtkqiggdgmmditdt ykfqegqeeerlaletalmygakkplntegvmksrsnvdmdfevenavlgkdfklsitfrnnshnrytitaylsanitfy tgvpkaefkketfdvtleplsfkkeavliqageymgqlleqaslhffvtarinetrdvlakqkstvltipeiiikvrgtq vvgsdmtvtvqftnplketlrnvwvhldgpgvtrpmkkmfreirpnstvqweevcrpwvsghrkliasmssdslrhvyge ldvqiqrrpsm 

to load data r analysis, have done:

require("biostrings") data(blosum100) seqs <- readfasta("./protseq.fasta", strip.descs=true) 

to the pairwise numbers, there total of 4 sequences, have done:

number <-c(1:4); dat <- expand.grid(number,number, stringsasfactors=false) datr <- dat[dat[,1] > dat[,2] , ] 

in order calculate score 1 one, can this:

 score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100, gapopening=0, gapextension=-5)) 

however, have problem add new column "score" include score each pairs of proteins. tried this, did not work.

datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100, gapopening=0, gapextension=-5))}) 

could mind comments how further improve it? dwin , diliop wonderful solutions previous question.

try:

datr$score <- sapply(1:nrow(datr), function(i) {     x <- datr[i,1]     y <- datr[i,2]     score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100,gapopening=0, gapextension=-5)) }) 

to able reference sequences better using names, might want tidy datr doing following:

colnames(datr) <- c("seq1id", "seq2id", "score") datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc) datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc) 

or if want extract accession ids i.e. contents of parentheses, use stringr such:

library(stringr) datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "db[0-9\\ ;db]+")) 

hope helps!


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -