I’m thinking of adapting a training project I did for a previous employer and making it available for free online. One of the reasons I haven’t bothered is that the data in it belongs to that employers and I don’t want to make up a few hundred names and associated information. I thought about using sports statistics but those probably are the property of, for example, Major League baseball, and I could get in trouble if I use it without their express written permission.
What do people use for stuff like this? It would help if there are at least names and start and end dates for this project.
Another option is to randomize - I.e. shuffle last names, addresses, etc. so that John Smith and Donald Jones become Don Smith and John Jones. As long as you avoid the obvious (i.e. keep gender and first name together to avoid the “huh?” response) and as long as you don’t need valid data (i.e. if I swap 123 Morningside Lane With 21 Main St. Apt 13 - the example cares not if 123 Main St. does not have apartments, or if the zip code does not match the address, or the building does not exist…) then nobody can complain.
Don’t be obvious so it’s reversible i.e. don’t just shuffle everything down one record. Better yet. shuffle every first name down one record, then randomize, so that if the random algorithm misses one, it’s still not valid. Ditto for stree number - shuffle down 3 or 4, so that last name and street number don’t match, then apply a randomizer over and over - for A= 1 to number of lines swap Line A with Line (RND(B)) For phone numbers, randomly replace the odd digits maybe. (PHONE = PHONE +Rnd(X)*1000)
then do a SQL comparison to the original set (if that’s your tech level) to make sure you did not randomize back to any originals. (i.e. first and last names match, last name and street address match).
You’ll end up with a few funny ones, Guido Nakamura or Rajesh Schmidt, but certainly nobody will be personally identifiable.
I imported a whole heap of US Census name data, plus an entire cities’ worth of street names; I can generate a set of basic random (random first name with random last name, not an actual person) name/address data for you, providing that the addresses don’t have to be real (they won’t be geocode-able).
Facts, for instance baseball statistics, are not generally able to be someone’s property. You can copyright an expression (such as a beautiful chart you made of some data), but you can’t copyright the data itself. Nor can baseball plausibly claim that basic statistics are trade secrets or anything else.
You might remember the stuff that needs express written consent is using the images or descriptions of the game produced by the broadcaster; there’s nothing there that says you can’t talk about the game yourself.
So go nuts with sports data, if you find a decent source.