r - How to improve this code for getting pairwise? -
it question build upon previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).
it different previous one, more details provided, , libraries , example file provided according comments dwin. so, submitted new question. mind teach me how modify code further?
to load necessary libraries:
source("http://bioconductor.org/bioclite.r") bioclite()
my protseq.fasta file has following contents:
>drugbank_target|1 peptidoglycan synthetase ftsi (db00303) mvkfnssrksgkskktirkltapetvkqnkpqkvfekcfmrgrymlstvlillglcalvaraayvqsinadtlsneadkr slrkdevlsvrgsildrngqllsvsvpmsaivadpktmlkensladkeriaalaeelgmtendlvkkieknsksgylyla rqvelskanyirrlkikgiiletehrrfyprveeaahvvgytdidgngiegieksfnsllvgkdgsrtvrkdkrgnivah isdekkydaqdvtlsideklqsmvyreikkavsennaesgtavlvdvrtgevlamatapsynpnnrvgvkselmrnrait dtfepgstvkpfvvltalqrgvvkrdeiidttsfklsgkeivdvapraqqtldeilmnssnrgvsrlalrmppsalmety qnaglskptdlgligeqvgilnanrkrwadieratvaygygitatplqiarayatlgsfgvyrplsitkvdppvigkrvf sekitkdivgilekvaiknkramvegyrvgvktgtarkienghyvnkyvaftagiapisdpryalvvlindpkageyygg avsapvfsnimgyalranaipqdaeaaentttksakrivyigehknqkvn >drugbank_target|3 histidine decarboxylase (db00114; db00117) mmepeeyrergremvdyicqylstvrerrvtpdvqpgylraqlpesapedpdswdsifgdieriimpgvvhwqsphmhay ypaltswpsllgdmladainclgftwasspactelemnvmdwlakmlglpehflhhhpssqgggvlqstvsestlialla arknkilemktsepdadesclnarlvayasdqahssvekaglislvkmkflpvddnfslrgealqkaieedkqrglvpvf vcatlgttgvcafdclselgpicareglwlhidaayagtaflcpefrgflkgieyadsftfnpskwmmvhfdctgfwvkd kyklqqtfsvnpiylrhansgvatdfmhwqiplsrrfrsvklwfvirsfgvknlqahvrhgtemakyfeslvrndpsfei pakrhlglvvfrlkgpncltenvlkeiakagrlflipatiqdkliirftvtsqfttrddilrdwnlirdaatlilsqhct sqpsprvgnlisqirgarawacgtslqsvsgagddpvqarkiikqpqrvgagpmkrenglhletlldpvddcfseeapda tkhklssflfsylsvqtkkktvrslscnsvpvsaqkplpteasvknggssrvrifsrfpedmmmlkksafkklikfysvp sfpecssqcglqlpccplqamv >drugbank_target|5 glutaminase liver isoform, mitochondrial (db00130; db00142) mrsmkalqkalsragshcgrggwghpsrspllgggvrhhlseaaaqgretphshqpqhqdhdssesgmlsrlgdllfyti aegqertpihkfttalkatglqtsdprlrdcmsemhrvvqesssgglldrdlfrkcvsssivlltqafrkkfvipdfeef tghvdrifedvkeltggkvaayipqlaksnpdlwgvslctvdgqrhsvghtkipfclqscvkpltyaisistlgtdyvhk fvgkepsglrynklsldeegiphnpmvnagaivvsslikmdcnkaekfdfvlqylnkmagneymgfsnatfqseketgdr nyaigyyheekkcfpkgvdmmaaldlyfqlcsvevtcesgsvmaatlanggicpitgesvlsaeavrntlslmhscgmyd fsgqfafhvglpaksavsgaillvvpnvmgmmclsppldklgnshrgtsfcqklvslfnfhnydnlrhcarkldprrega eirnktvvnllfaaysgdvsalrrfalsamdmeqkdydsrtalhvaaaeghievvkflieackvnpfakdrwgnipldda vqfnhlevvkllqdyqdsytlsetqaeaaaealskenlesmv >drugbank_target|6 coagulation factor xiii chain (db00130; db01839; db02340) setsrtafggrravppnnsnaaeddlptvelqgvvprgvnlqeflnvtsvhlfkerwdtnkvdhhtdkyennklivrrgq sfyvqidfsrpydprrdlfrveyvigrypqenkgtyipvpivselqsgkwgakivmredrsvrlsiqsspkcivgkfrmy vavwtpygvlrtsrnpetdtyilfnpwceddavyldnekereeyvlndigvifygevndiktrswsygqfedgildtcly vmdraqmdlsgrgnpikvsrvgsamvnakddegvlvgswdniyaygvppsawtgsvdilleyrssenpvrygqcwvfagv fntflrclgiparivtnyfsahdndanlqmdifleedgnvnskltkdsvwnyhcwneawmtrpdlpvgfggwqavdstpq ensdgmyrcgpasvqaikhghvcfqfdapfvfaevnsdliyitakkdgthvvenvdathigklivtkqiggdgmmditdt ykfqegqeeerlaletalmygakkplntegvmksrsnvdmdfevenavlgkdfklsitfrnnshnrytitaylsanitfy tgvpkaefkketfdvtleplsfkkeavliqageymgqlleqaslhffvtarinetrdvlakqkstvltipeiiikvrgtq vvgsdmtvtvqftnplketlrnvwvhldgpgvtrpmkkmfreirpnstvqweevcrpwvsghrkliasmssdslrhvyge ldvqiqrrpsm
to load data r analysis, have done:
require("biostrings") data(blosum100) seqs <- readfasta("./protseq.fasta", strip.descs=true)
to the pairwise numbers, there total of 4 sequences, have done:
number <-c(1:4); dat <- expand.grid(number,number, stringsasfactors=false) datr <- dat[dat[,1] > dat[,2] , ]
in order calculate score 1 one, can this:
score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100, gapopening=0, gapextension=-5))
however, have problem add new column "score" include score each pairs of proteins. tried this, did not work.
datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100, gapopening=0, gapextension=-5))})
could mind comments how further improve it? dwin , diliop wonderful solutions previous question.
try:
datr$score <- sapply(1:nrow(datr), function(i) { x <- datr[i,1] y <- datr[i,2] score(pairwisealignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionmatrix=blosum100,gapopening=0, gapextension=-5)) })
to able reference sequences better using names, might want tidy datr
doing following:
colnames(datr) <- c("seq1id", "seq2id", "score") datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc) datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc)
or if want extract accession ids i.e. contents of parentheses, use stringr
such:
library(stringr) datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "db[0-9\\ ;db]+"))
hope helps!
Comments
Post a Comment