The README contains change logs for recent updates. It was updated with the Funseq v2.1.2 source codes. Change logs: Updated gencode coding, promoter, enhancer, lincRNA (May-12-2015) Recalculated score cut for all Motifs Get Motifs under TF peaks and DHS peaks as 'ENCODE.tf.bound.union.bed' (May-12-2015) ----------------- Directory ----------------- * gencode/ : processed GENCODE annotations (these files can be replaced with other GENCODE version) Intron, promoter, cds, utr files can be produced by '3.gencode.process.pl' under '1. Building Data Context'. cds.fa, cds.interval are used for VAT (variants annotation tool) and can be produced following the instructions in 'http://vat.gersteinlab. org/'. * gene_lists/ : folder for gene prioritization lists. (the program automatically read all files in the folder and use the first field separated by '.' to tag genes. Users can put more gene lists in this folder, please use the desired tag in the file name). Currently, we have two files: 1. cancer.gene (known cancer genes) 2. actionable.gene ('druggable' genes) * networks/ : folder for network centralities. (the program automatically read all files in the folder and use the first field separated by '.' to tag networks. Users can put more networks in this folder, please use the desired tag in the file name). Currently, we have three files (two columns: gene_name, centrality): 1. PPI.degree (protein-protein interaction network, degree centrality) 2. REG.degree (regulatory network, degree centrality) 3. PHOS.degree (phosphorylation network, degree centrality) * cancer_recurrence/ : folder for cancer recurrence data. (the program automatically read all files in the folder and use the first field separated by '.' to tag cancer type. This file will be updated with more cancer whole-genome sequencing data). Currently, we have 10 types of cancers. * user_annotations/ : folder for user specific annotation sets. (the program automatically read all files in the folder and use the first field sep arated by '.' to annotation type. Please prepare your annotations in BED format, using the 4th column for additional information, if needed). ------------------ Data Files ------------------ * 1kg.phase1.snp.bed.gz (bed format) Content : all 1KG phaseI SNVs in bed format. Columns : chromosome , SNVs start position (0-based), SNVs end position, MAF (minor allele frequency) Purpose : to filter out variants against 1KG SNVs based on allele frequencies. * 1kg.phase1.snp.bed.gz.tbi Index file of 1kg.phase1.snp.bed.gz * ENCODE.annotation.gz (bed format) Content : compiled annotation files from ENCODE, Gencode v7 and others, includes DHS, TF peak, Pseudogene, ncRNA, enhancers Columns : chromosome , annotation start position (0-based), annotation end position, annotation name. Purpose : to find SNVs in annotated regions. * ENCODE.tf.bound.union.bed (bed format) Content : transcription factor (TF) binding motifs under ENCODE TF peaks. Columns : chromosome, start position (0-based), end position, motif name, , strand, TF name * GENE.strong.selection Content : genes under strong negative selection (fraction of rare SNVs among non-synonymous variants). * drm.gene.bed Content : distal regulatory modules with gene information, generated with new algorithm (~769K elements with ~17K genes). Purpose : to associate noncoding SNVs with genes * ultra.conserved.hg19.bed Content : regions defined as 'ultra-conserved' regions (Bejerano, et al., 2004). * sensitive.nc.bed Content : sensitive regions defined in (Khurana, et al., 2013) User can generate novel sensitive regions using scripts under '1. Building Data Context'. * All_hg19_RS.bw File downloaded from UCSC. Gerp score file. * hot_regions.bed Content : highly occupied regions defined in (Yip, et al., 2012) * motif.PFM Content : position frequency matrix Purpose : used for motif breaking & motif gaining calculation * motif.score.cut File used to speed up the motif-gaining analysis. Can be generated by '5.PWM.score.cut.pl' under '1. Building Data Context'. * regulatory.network Content : two columns, (TF, genes_regulated_by_TF) Purpose : find TFs regulating known cancer genes. * human_ancestor_GRCh37_e59.fa Human ancestral genome (hg19) Purpose : for motif breaking calculation in personal or germ-line genome. * Note : for only somatic analysis, these files are not needed. * weighted.score.txt Weighted scores for features