Monday, 5 May 2014

Tagged under: , , , , ,

How to download a list of the gene length of all genes from the UCSC genome browser?

Some basic information like length can be obtained in the GTF file (you can get the latest ENCODE GTF from the UCSC genome browser). Other information are also available and obtainable from the UCSC genome browser  table browser. Select the group and tracks that you want from the corresponding dropdown boxes. You can export the data as a text file or forward it to the galaxy interface for analysis.
If you are familiar with some programming then you can pull out the lengths (it is not explicitly given in the output or in the GTF). Also, the gene length may be different from the mRNA length because of the presence of introns in the former. The third field in the GTF file denotes the feature (gene, transcript, exon, etc); you can parse the GTF for picking up only desired features, lets say the mRNAs. The fourth and fifth fields denote start and end positions respectively; all you have to do is endstart to obtain the length.
A small scriptlet such as this would work in linux/cygwin terminal (I wont go in detail because this is not the scope of this forum):
awk -F "\t" '$3=="transcript"{print $9, $5-$4}' file.gtf > output.txt
The $9 refers to the ninth field which holds information about that feature including the gene name, gene id etc.

0 comments: