Discussions on Please talk to us before uploading large data sets

What's the best way to do that?

Is the best way to do that just by sending an email to alpha@metaweb.com?

either is fine. Thanks!

Economic Time Series Data

I've just signed on with an interest in quantitative economic data (and other stuff, but let's start there. ) First, I don't see any such data available; a query only produced a definition. Since much of this is government prepared it is public domain and I'd like to begin uploading it with currently active time series and some interesting historical series. My vision was that if we had a fait number of others involved we could easily allocate the tasks of regular upload of current data as well. I'm not a developer, so I need guidance, but I do know data and where it is available in a public domain format. Let me know how to proceed (and when if it is still too soon). Of course, if I've missed it and this kind of data is already there, let me know. Thanks, Bob A

Hi Bob- The best way to get something like this going is to create the data types in your private domain, then try loading a small subset of the data to see if your model is working or not (you don't need to dump it all up there). Getting the model right is actually the hard part. :-) Once you're happy with the model, add a discussion post to the domain you think your data would be most appropriate for, and the admins can help you take it from there. And feel free to post any questions that come up during the modeling process. Good luck - and have fun!

Hi Bob, I'm interested in time series as well, and have initiated a discussion of freebase support for them associated with the Data Modelling entry in the help pages. fyi, I'm developing a wiki environment for collaborative analysis of public domain data, especially time series. I'm investigating the possibility of using Freebase as the data engine. Regards, Mike

Nonprofit Organizations

I'd like to start work on the IRS tax-exempt organizations, but Sandbox is not allowing me to log in. Do I have to wait for Monday when Sandbox is refreshed.

If you're a newly registered user, then unfortunately, yes, you've guessed correctly. User registration data is copied over from freebase.com to sandbox.freebase.com once a week (usually on Monday afternoon). Sorry - we hope to improve this in the future.

census data

We would like to use freebase to store census data of India. There are over 120 fields and many thousands of records. Say over 300,000 records. Our idea is to also add GIS info for these records eventually. For now, it would be great if we can upload a few thousand records, which are available as xls files.

dinesh - This sounds like a very interesting data set. As you may have noticed we have been somewhat selective in our use of the US Census data, trying to identify the level of detail that is useful to people as well as integrating the Census data across Freebase domains. I'd be very interested to work with you to select and map the India Census fields as well as cull the appropriate level of detail from the records (for instance, we have deliberately omitted the "block" level of detail in the US Census.)

I did not get a notification of this response, even though I am watching this thread. Thus my delay in response.

Good to know you can assist. Look forward. What next?
Let me know how we can communicate further. Shall I upload a few files of census data somewhere and make it available to you?

BTW, I have not found an discussion on US census data on freebase. Can you give me a link? In fact I wanted to look at any census data type but could not locate them (on freebase).

d

ps: cc replies to my email if you can

Food Recipes

Hello, I have found that there are a lot of recipes available in the public domain as mealmaster format or many times as RML (recipe xml files). I have wrote a script to convert those formats and insert them in my sql database. The thing is that I am sure this is a very valuable information for a freebase and it will also help me write my application without needed to store myself all the recipes.

The list of recipes I am talking about is here: http://dsquirrel.tripod.com/recipeml/indexrecipes2.html
Please, take a look at on of the files and tell me if this is valuable information for Freebase and I will start experimenting in sandbox.freebase.com and see how the data will be stored. Also I did not see a category for recipes here on freebase and I am thinking if you wanted to include it at a later stage or what?

Thanks,
Kiril

Hi Kiril-

There was some discussion in the "Food" domain awhile back about adding recipes. You might want to post something there and see if anyone else is starting to work on it. You can read that discussion here.

Link is broken or leads to nothing?! is there any solution adding recipes in freebase?

sebastian 

Hi Sebastian,

User skud was looking for reviewers on her food domain.  I would think that would be a good start for then developing a type for recipes where it could use those types.  Perhaps you may want to collaborate with other users on creating a recipe type?  You could start a discussion on the data-modeling list (if you're not part of the list, you can join here).

Currency exchange rate

I've just setup an Exchange rate compound value and an Exchanged currency type for handling exchange rates for currency. I've currently associated one value to US $ and would like to do a much larger import of exchange rate data from the federal reserve bank.

Can you give us an estimate of how large a data-set you're talking about?

Can you give us an estimate of how large a data-set you're talking about?

Right now, I've got USD-AUD from 1990 to 1999, which is 2561 records. I'm getting the data from: http://www.federalreserve.gov/releases/h10/Hist/default1999.htm I have a script ready to go to do the import of the AUD-USD dataset, and afterwards I'd like to import more recent data, and then start to import other currencies from the reserve banks datasets.

You've generated a lot of interest here over your dataset, and we'd like to work with you to put your types in the top-level /finance domain. What we probably would do is hook the "exchange rate" type to the existing "currency" type, rather than keeping the "exchanged currency" type. What do you think?

That was what I originally wanted to do, but realized I obviously couldn't. Having made exchanged currency though, I'm not 100% sure that it's not a better option now. The basic idea is that not all currencies are exchanged. On the other hand, I think just having a blank target/source field may be enough to let you know it's not exchanged :)

I've connected "exchange rate" to the currency type; it was more complicated to do than I expected, so please take a look at the types and let me know whether I did it correctly. If it's set up correctly, we can move the "exchange rate" type to the finance domain.

I looked at the currency pages and they look fine, and the schema looks fine, so I'm going to assume it was connected correctly. I've been trying to figure out if we want to do reciprocal exchange rates, since with AUD/USD I can calculate USD/AUD easily, would it be a good idea to add both records, or to just let people do the calculations themselves?

I've moved "exchange rate" to the finance domain. This change will be copied to sandbox tonight, so you will be able to try our your import there. I'm not sure about the reciprocal exchange rates, but I'll ask around.

I think that because the source and target are explicitly modeled there is no need to add the reciprocal rates since an application can just reverse them.

I attempted an import on sandbox today.. Seems to have not worked out so well. After doing a lot of reformating on the date to get it into a form that the API would accept, I ended up getting a 503 every time I tried to do the import. I didn't seem to see anything on sandbox's web interface so I kept trying. Much later I looked and there were records, but It didn't look like they'd all been imported, so I made my import script do the imports in 100 unit segments, and started to get a 2 entries match error, so I looked that up and discovered that I had duplicate records even though I'd use unless_exists in my create clause! I'm not exactly sure what to do at this point, is there a way to delete all the records in a type on sandbox?

There's no quick way to delete anything. You could de-type them, though, which would get them out of the way of your scripts, at least. Questions about scripts, MQL, and the API can be posted to the Freebase developers' list, also, where you'll probably get a quicker response than the 17 hours it took for this (sorry! my RSS reader lost this message for some reason): http://lists.freebase.com/mailman/listinfo/developers.

I've subscribed to developers now. Thanks for the detyping tip. I de-typed all the entries and imported 3 datasets (1971-1989, 1990-1999, 2000-2008) for USD-AUD into sandbox and everything looks ok. I did get a 400 error about midway into doing 1971-1989 but I just restarted it from the last entry that was added and it worked out.

'Similar artists' data for musical artists

Hi there, 

I have a data set with for about 38,000 artists with a MusicBrainz ID a list of the artists that are most related to it according to last.fm.
I have about 60,000 more without a MusicBrainz ID and I can obviously retrieve more data through the last.fm webservices.

My plan is to add all artists with a MusicBrainz ID that are more than 80% similar to an artist (about two or three artists, usually) as a 'similar artist' relation. Is that ok?

As an example, I added links for about 100 artists to the sandbox. I do my lookups based on the MusicBrainz ID, and do not create new artists. See: Édith Piaf

Any comments? Should I somehow link the artists to their last.fm page after processing? Add their last.fm urlname as a key, perhaps?

Thanks,

- Jeroen

Thanks, Jeroen. First, please be sure that the data can be contributed legally; we did not collect the artist similarity information from MusicBrainz, for instance, because it is licensed CreativeCommons-Non-Commercial, and as a commercial enterprise, we can’t legally use that data.

If you have permission to load that data, then this sounds like a great idea! You could also add the last.fm page as a Web link. For lookup, since last.fm and Freebase both use MusicBrainz keys, a key is probably not needed

Ugh, right, I forgot to check the NonCommercial part. Never mind...

NNDB Links for People

I would like to add about 20,000 links from people to their NNDB pages. To do this I've created the NNDB Profile Page type.

I've gone through all the people on Freebase and matched them to their NNDB page by name. In the cases where several people share the same name, I simply ignore those pages.

Would it be appropriate to upload this data to the sandbox?

It would indeed. Please go ahead.

Ok, The complete set of links has been written to the sandbox. Please check out the results and let me know if I can write them to the main database.

The data looks good!

However, I lied when I said earlier, on the mailing list, that the IMDB Profile Page model was the way to go. We now have the ability to use keys into an external database as a way to generate URIs, which further provides uniqueness checking. I am working on converting our IMDb references into this form. It would be great if you could wait on the final NNDB load and model it that way; I will be happy to show you how once I figure it out myself. (-:

Sounds like a great way to model these things. I look forward to learning how to use this new technique.

Postsecondary Schools

I'd like to get the list of all postsecondary schools loaded.

The nasty format is here: http://www.ed.gov/offices/OSFAP/PEPS/dataextracts.html and I can get it into a better format but the current definition of "school" in freebase leaves a lot to be desired.

 

What should I do?

The types you should be looking at are /education/institution and /education/university. Between them, they should have all the properties for post-secondary schools. (Educational Institution has properties that are common to schools of all levels.)

If there is data in your dataset that you want to load,and which doesn't match any existing properties, we can talk about adding new properties. The first place to start, though, would be to create a type in your private domain to hold those properties so you can test them out and so that people can review them before adding them to the types in the education domain.

This looks like a great dataset. Feel free to keep posting questions while you work on it.

Link airports as containedby for apropriate locations

Airport is co typed as location so I would like to set airport as the containedby value for the correct locations.

For example Amsterdam Schiphol should be contained by Amsterdam and Netherlands.

Got some script and run it on the sandbox, it will get the airports serves location and add the airport as contains value for the serves location and all locations that contain the serves.

Like Amsterdam Schiphol serves Amsterdam so schiphol is added as a value contains for amsterdam, amsterdam is contained by netherlands so schiphol is added as contains value for netherlands.

Is Schiphol actually contained by Amsterdam?  I am not that familiar with Dutch location containment, but it seems like Schiphol is contained by the municipality of Haarlemmermeer and possibly contained by the city Schiphol-Rijk.

Are you just adding containment by country, or containment by administrative division also?  If it is also the latter, there could also be the situation where an airport may serve one location, but be contained by a different administrative division.  The example I'm thinking of is Newark Airport is contained by New Jersey, but serves New York City.

You have a point, I think there is no airport that is actually in the city they serve, it would be just outside the city. But nobody says thei're flying to schiphol rijk or whatever (unknown small) town that is closest to the airport or containing the airport.

I guess thats why serves is there. In case of Schiphol, the administrative division Amsterdam is responsible for Schiphol. 

I will only add the countries containing the airport.

FYI - We are trying to get our hands on some country/airport data also, so if licensing is compatible, we may be able to add country containment to a large percentage of airport topics, and perhaps and new ones!