I've uploaded data from the human genome project and some annotations. These include the genes and their locations on the genome (when known). The Gene Ontology groups and hierarchy are also now online, with membership info for human genes and evidence codes. Better links for citations will be added soon (e.g. links out to public web pages for genes, Pubmed for publications).
Very much looking forward to seeing how this small kernel grows. Transcripts, disease annotations, and known chromosomal aberrations would be valuable, and of course links to the protein schema. With the addition of other species' genomes, notions of synteny and orthology will be logical directions to explore.
I would be interested in hearing the thoughts of others on the schema, data, and ideas on how to proceed.
Initial human genomics data loaded
-
-
-
This looks good. Here's a link to /biology/gene for anyone who would like to start there.
One question - is the NCBI ID the same as an Entrez ID?-
Yes, the NCBI ID and Entrez ID's should be the same. I'm not sure which name for the identifier is more appropriate (or perhaps some other name).
I originally had some confusion when doing a cursory search to make sure the ID's were the same. Some things match up nicely, but others do not. Here's an example. The protein col13a1 has an entrez id of 12817. The gene has an id of 1305. There was no protein on freebase with ID 1305 and no gene on freebase with ID 12817. Going to the Entrez gene site, it looks like the ID 12817 may correspond to a mouse protein rather than a human one.
It might be worthwhile to see if we can programmatically compare the two data sets in whole instead of trying to do it piecemeal.
-
This is great. Looking forward to playing with the information
-
The protein information from the signaling gateway is species specific, and includes lots of mouse proteins, however, when Patrick imported the data, there was no "organism" or "species" object to match that link to so this information is not captured (and may lead to confusion). For example, the Cyclin D1 (http://www.freebase.com/view/%239202a8c04000641f80000000051757c6) entry with Entrez ID (12443) is for Mus Musculus, but that's not clear.
I'd love to be able to link up the protein information with corresponding organism/species information. I'm assuming the appropriate class is "Organism Classification"? Should we (I?) modify the Protein type appropriately? -
In terms of linking proteins to species, we originally had a species property on the Genome type, but decided to punt on that until taxonomy was figured out. The model then would be that a gene links to its genome which then links to the species. For orthologs we could add a specific forward and reverse properties between the genes of two species. If this were a good way to go for genes, then a similar model would work for proteins (and a Proteome) as well.
-

