Convenient, public data for sample applications and examples

Dave_Hartwick · May 5, 2014, 9:04am

Hi,

I’m thinking of adapting a training project I did for a previous employer and making it available for free online. One of the reasons I haven’t bothered is that the data in it belongs to that employers and I don’t want to make up a few hundred names and associated information. I thought about using sports statistics but those probably are the property of, for example, Major League baseball, and I could get in trouble if I use it without their express written permission.

What do people use for stuff like this? It would help if there are at least names and start and end dates for this project.

bob_2 · May 5, 2014, 10:12am

I typed list of random names into a searc engine and it came up with many choices, one of which was this http://listofrandomnames.com/

Unless the data you have is so specific to your employer that it could be identified, I can’t see the problem with that.

md2000 · May 5, 2014, 2:20pm

Another option is to randomize - I.e. shuffle last names, addresses, etc. so that John Smith and Donald Jones become Don Smith and John Jones. As long as you avoid the obvious (i.e. keep gender and first name together to avoid the “huh?” response) and as long as you don’t need valid data (i.e. if I swap 123 Morningside Lane With 21 Main St. Apt 13 - the example cares not if 123 Main St. does not have apartments, or if the zip code does not match the address, or the building does not exist…) then nobody can complain.

Don’t be obvious so it’s reversible i.e. don’t just shuffle everything down one record. Better yet. shuffle every first name down one record, then randomize, so that if the random algorithm misses one, it’s still not valid. Ditto for stree number - shuffle down 3 or 4, so that last name and street number don’t match, then apply a randomizer over and over - for A= 1 to number of lines swap Line A with Line (RND(B)) For phone numbers, randomly replace the odd digits maybe. (PHONE = PHONE +Rnd(X)*1000)

then do a SQL comparison to the original set (if that’s your tech level) to make sure you did not randomize back to any originals. (i.e. first and last names match, last name and street address match).

You’ll end up with a few funny ones, Guido Nakamura or Rajesh Schmidt, but certainly nobody will be personally identifiable.

Raza · May 5, 2014, 5:31pm

I imported a whole heap of US Census name data, plus an entire cities’ worth of street names; I can generate a set of basic random (random first name with random last name, not an actual person) name/address data for you, providing that the addresses don’t have to be real (they won’t be geocode-able).

Quercus · May 5, 2014, 7:18pm

Facts, for instance baseball statistics, are not generally able to be someone’s property. You can copyright an expression (such as a beautiful chart you made of some data), but you can’t copyright the data itself. Nor can baseball plausibly claim that basic statistics are trade secrets or anything else.
You might remember the stuff that needs express written consent is using the images or descriptions of the game produced by the broadcaster; there’s nothing there that says you can’t talk about the game yourself.

So go nuts with sports data, if you find a decent source.

Topic		Replies	Views
Several probabilities questions Factual Questions	7	1814	November 27, 2012
Computer Programs? Factual Questions	14	6405	May 12, 2010
Need way to hide serial numbers in string Factual Questions	9	1585	January 19, 2010
Programming alternative Factual Questions	41	12017	May 2, 2010
Email non-answers The BBQ Pit	56	1076	June 26, 2024

Convenient, public data for sample applications and examples

Related topics