maybe my terms are off, I am not an accountant. My point is, all the companies in America keep on generating this sort of accounting documents as part of their normal operation - presumably millions of them are in existence, somewhere.
So suppose we wanted to build a “CPA AI” or maybe just “lowly bookkeeper AI”. Usually building an AI for a new problem requires analysis of a large corpus of inputs and outputs where outputs may be existing human solutions to the problem. Well, so perhaps some of these documents can be considered an “input” (e.g. list of salaries to employees, list sales transactions, tax schedule etc) and others as output. Then we could build an AI and test if for the ability to generate predicted outputs sufficiently close to the observed (human) ones based on the same inputs.
Ok, so suppose we wanted to launch such a project tomorrow. Where could we get all those documents? Are some of the very essential of them confidential info not disclosed to anybody? Are most of them disclosed to the government on a regular basis so that, hypothetically, they could be requested from the government in anonymized form? Would we need to go around various companies asking for good will donation of info, after anonymization?