Jg I have been following your progress on the Organism Classification type in the sandbox. It's looking good. I have a question: how will one select all species from a certain kingdom? e.g. select all species in the Plantae kingdom. Thanks.
// Frank
Discussions on Biology
-
-
-
Hi Frank.
The plan is to connect the taxa in a hierarchy, with parents and children. So to get everything in Plantae you would start there and then follow the lower_classifications all the way down.
-
That makes sense, thanks. Would this be an expensive (slow?) query for the Freebase system?
-
Also sorry for the double post but will the 'Also known as' field contain a plant's common names from the ITIS data? I'm working on a web application to incorporate this data and because users search a lot by common name it would be very useful to have. Double thanks!
-
Can you give an example? Usually the common name (if known) is the topic title and the scientific name is in the Scientific Name field.
-
Frank, we dont expect the query to be slow, but as soon as we have the data loaded Ill try it out. Some databases (e.g. Species 2000 have a field which is the flattened complete hierarchy, which we could fall back to if we need it)
As for common names, I agree with Jeff that they should be the name field, and there is alias if we need it. Also ITIS has many names in Spanish, which I plan to import.
-
>>> Can you give an example? Usually the common name (if known) is the topic title and the scientific name is in the Scientific Name field. <<<
Sure. For example Crocus vernus ( http://sandbox.freebase.com/view/crocus_vernus ) has two common names, Dutch Crocus and Spring Crocus. If I have followed correctly the final format would be that one of the two common names would be the Name, 'Crocus vernus' is in the Scientific Name field and the remaining common name can be added as an alias?
>>> As for common names, I agree with Jeff that they should be the name field, and there is alias if we need it. Also ITIS has many names in Spanish, which I plan to import. <<<
Above all I was curious as to how common names would be handled because I didn't see a common name field per say. It never occurred to me that the Name field may be used :)
Thanks for the prompt reply guys.
Thanks for the prompt replies guys. -
>>> Sure. For example Crocus vernus ( http://sandbox.freebase.com/view/crocus_vernus ) has two common names, Dutch Crocus and Spring Crocus. If I have followed correctly the final format would be that one of the two common names would be the Name, 'Crocus vernus' is in the Scientific Name field and the remaining common name can be added as an alias? <<<
That sounds right.
>>> Thanks for the prompt replies guys. <<<
Now that they've got the RSS feeds working, the discussion boards work a lot better.
-
For that specific example, Spring Crocus may be too generic. For example this database only has the other common name http://plants.usda.gov/java/profile?symbol=CRVE4
-
Here's a better example http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=34342 has 5 common names listed.
-
OK, let me look at the actual ITIS database and see if we can distinguish one as primary, which would allow the others to be aliases....
-
Well it turns out that they are not distinguished in ITIS. http://www.itis.gov/vernac.pdf
I propose that if the first name matches the Wikipedia name we leave it alone, and we load all the vernaculars as /common/topic/alias. For new topics from ITIS we use the first name, capitalize the first word and use the remainder (if any) as aliases.
-
Sounds like a good plan of action jg :-) Looking forward to updates.
-
-
-
I have been working for some time on a large scale import of taxo data, reconciled with the topics we already have. I would like to briefly summarize where im at and solicit feedback on what is or is not important in the first load.
There are about 38,000 taxa already in the system as topics. These include plants and fungi. For example Erythroxylum ellipticum. Step one is to type these things. Step two is to add them into a hierarchy with their upper and lower taxa and their rank. I have written tools to do this based on the ITIS database.
This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:
ITIS
Species 2000
DiscoverLife
NCBI
EOL
International Plant Names Index
Wikispecies
Wikipedia-en taxoboxes
If there are others that folks are aware of please let me know. There are a number of issues related to licensing, style of attribution, quality, and size of these databases.
The ITIS database currently seems like the best one to start with with the addition of foreign keys for Species 2000 and NCBI. Its a high quality 'core' set and the data is indeed unencumbered.
DiscoverLife and Species 2000 are both interesting and larger datasets. They both combine approx 50 databases in one place. This greatly complicates licensing since they are republishing data that comes from domain experts such as Fishbase. Both SP2000 and DiscoverLife aim to get to the approximately 1.8 million species mark in a few years. I think we should be just as concerned with the richness of the interlinking of the data on freebase as we are with completeness. For example being able to query across genes, diseases and species requires interlinking them all, not just having 20,000 Scarabaeidae unlinked to much else.
Im also very interested in sources of CC or GFDL images of organisms if anyone has researched that world.
-
If you look at the Species page in the right column there are already many topics that have been filled in with taxonomies all the way up to Domain, using the type Organism Classification. I'm worried that data load will overwrite the work already done on this type.
-
Hi Jeff.
What Im proposing to load would not break any data loaded by users, but add to it. Its not feasible to add several hundred thousand things by hand so we need to work together on this. I will contact you by email and we can chat about how best to collaborate. -
As you suggested, I'll follow up on the Organism Classification discussion page.
-
I can't find the Organism Classification discussion page, but wondered if there was any progress on this. I see there are about 212 organism classification entries so far and wanted to know the progress of the proposed bulk upload, especially in the Animal kingdom. I have some of this data available but perhaps someone else is going to upload it? Thoughts?
-
On the type page for Organism Classification, in the Actions window click Discuss "Organism Classification". I don't know the status of the data upload that jg is working on.
-
Hi Hilary
I have made some progress on the bulk upload. There were 70,000+ wikipedia topics to reconcile with but Im almost done, so I expect to load something on sandbox this week. Ill post more when its uploaded for inspection. After the 70,000 load Ill do 400,000 more from ITIS, again on sandbox.
-
> This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:
> ITIS
> Species 2000
> DiscoverLife
> NCBI
> EOL
> International Plant Names Index
> Wikispecies
> Wikipedia-en taxoboxes
There's Tree of life which I've mentioned before but which apparently has a licensing issue.http://www.tolweb.org/tree/home.pages/downloadtree.html
( http://www.dbfordummies.com/Example/Ex710.asp )
( http://paste.uni.cc/11838 )For butterflies and moths there is BAMONA and All Leps.
(Bamona isn't easily accessible online however the maintainer Thomas Naberhaus is willing to create an extract for use elsewhere; I'm working with him on that just now in fact for a Flickr project to create a leps field guide)
http://www.lepbarcoding.org/files/nth_am_lep_full_checklist.xls
I have a question which may be more appropriate in some techie thread but I'll ask it here anyway since the specific example that first brought it to mind was the TOL dump; will there be facilities for importing via XML? It seems at least as useful as spreadsheet imports.
Spreadsheets seem fine for flat data typical of a relational database, but I hope that Freebase has support for hierarchical data as well.
-
This question is probably better discussed on the developers mailing list.
However, the reason that spreadsheets are supported first is that it is easy for non-technical people with domain expertise to understand a spreadsheet, and a mapping is fairly straightforward.
Freebase data is not hierarchical; it is a general graph, which can represent hierarchies, but isn’t constrained the way XML is. That means that the mappings are more complex, and probably better handled with a custom application, at least until we come up with some superduper UI assistant.
If the data you want to import can be flattened, then a flat import tool can be used, but otherwise, you will probably need some kind of application that comprehends and maps your data.
-
-
-
Additional properties: CRC64 & SHA-1 checksums
Would it be possible to add two properties to the Protein type? Unfortunately, there is no single reliable identifier to use when searching for information across databases. In freebase there is a Entrez gene ID property which is great, but, what if you are basing your search on UniProt or some other identifier? UniProt computes the CRC64 checksum for each sequence in their database making cross database searches (for exact matches) quick and simple. Recently a proposal was made to use SHA-1 as a checksum to avoid possible "collisions".
See the Nature Precedings article at: http://precedings.nature.com/tags/SEGUID
I'm currently working on an application that may benefit from such an addition to the Protein type. The idea is that my application would search freebase for exact protein matches based on a checksum and display/return the comments/annotations to the end user.
Thanks.
-
-
-
I've uploaded data from the human genome project and some annotations. These include the genes and their locations on the genome (when known). The Gene Ontology groups and hierarchy are also now online, with membership info for human genes and evidence codes. Better links for citations will be added soon (e.g. links out to public web pages for genes, Pubmed for publications).
Very much looking forward to seeing how this small kernel grows. Transcripts, disease annotations, and known chromosomal aberrations would be valuable, and of course links to the protein schema. With the addition of other species' genomes, notions of synteny and orthology will be logical directions to explore.
I would be interested in hearing the thoughts of others on the schema, data, and ideas on how to proceed.
-
This looks good. Here's a link to /biology/gene for anyone who would like to start there.
One question - is the NCBI ID the same as an Entrez ID?-
Yes, the NCBI ID and Entrez ID's should be the same. I'm not sure which name for the identifier is more appropriate (or perhaps some other name).
I originally had some confusion when doing a cursory search to make sure the ID's were the same. Some things match up nicely, but others do not. Here's an example. The protein col13a1 has an entrez id of 12817. The gene has an id of 1305. There was no protein on freebase with ID 1305 and no gene on freebase with ID 12817. Going to the Entrez gene site, it looks like the ID 12817 may correspond to a mouse protein rather than a human one.
It might be worthwhile to see if we can programmatically compare the two data sets in whole instead of trying to do it piecemeal.
-
This is great. Looking forward to playing with the information
-
The protein information from the signaling gateway is species specific, and includes lots of mouse proteins, however, when Patrick imported the data, there was no "organism" or "species" object to match that link to so this information is not captured (and may lead to confusion). For example, the Cyclin D1 (http://www.freebase.com/view/%239202a8c04000641f80000000051757c6) entry with Entrez ID (12443) is for Mus Musculus, but that's not clear.
I'd love to be able to link up the protein information with corresponding organism/species information. I'm assuming the appropriate class is "Organism Classification"? Should we (I?) modify the Protein type appropriately? -
In terms of linking proteins to species, we originally had a species property on the Genome type, but decided to punt on that until taxonomy was figured out. The model then would be that a gene links to its genome which then links to the species. For orthologs we could add a specific forward and reverse properties between the genes of two species. If this were a good way to go for genes, then a similar model would work for proteins (and a Proteome) as well.
-
-
-
An article that might be of interest to the community here since it covers trends in the sharing of biology data:
Key biology databases go wiki
Jim Giles
News @ Nature
Published online: 14 February 2007; | doi:10.1038/445691a
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=17301755&ordinalpos=42&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum
or
http://www.nature.com/news/2007/070212/full/445691a.html
-
-
-
Hi Biology data fans. I'm working with hilary on a simple first model for proteins. We've got a dataset of ~4000 signaling proteins, and I am creating a schema for these on Sandbox today. I'll post a link when this is done, either later today or some time tomorrow.
-
OK, I have created a very spare schema for proteins over on Sandbox. Here's a link to the schema
http://sandbox.freebase.com/view/schema?id=%2Fbiology%2Fprotein
And here's an example instance of the type:
http://sandbox.freebase.com/view?id=%239202a8c04000641f8000000000007cb1
Each of the 4000 or so proteins from the Signaling Gateway will have a URL pointing back to their Gateway entry, along with (if it exists) the text blurb from the matching Wikipedia article. I've also created a field for the Entrez database key.
Things I've left off for the moment:
- Organism - other people are working on the model for this type, and I'd like to wait and see what they do before we introduce this property.
- Protein Function - I suspect there will be a large number of different protein functions, as we get instances of proteins from other databases, and so I want to model this in a clean way. I'm not sure how to do this yet, and so I think we should stick with the spartan schema we have now until we have data from a second source to help prove the model.
Feedback appreciated. This schema should be on Sandbox until Monday May 28 (the next Sandbox refresh).
-
-
-
Has some one already heard of the "The encyclopedia of Life (http://www.eol.org)?
-
