where could one get a corpus/corpora of balance sheets, tax reports and other accounting documents?

maybe my terms are off, I am not an accountant. My point is, all the companies in America keep on generating this sort of accounting documents as part of their normal operation - presumably millions of them are in existence, somewhere.

So suppose we wanted to build a “CPA AI” or maybe just “lowly bookkeeper AI”. Usually building an AI for a new problem requires analysis of a large corpus of inputs and outputs where outputs may be existing human solutions to the problem. Well, so perhaps some of these documents can be considered an “input” (e.g. list of salaries to employees, list sales transactions, tax schedule etc) and others as output. Then we could build an AI and test if for the ability to generate predicted outputs sufficiently close to the observed (human) ones based on the same inputs.

Ok, so suppose we wanted to launch such a project tomorrow. Where could we get all those documents? Are some of the very essential of them confidential info not disclosed to anybody? Are most of them disclosed to the government on a regular basis so that, hypothetically, they could be requested from the government in anonymized form? Would we need to go around various companies asking for good will donation of info, after anonymization?

The SEC maintains a database of all registered companies, there you can find balance sheets, income statements, and the like. The name of the system is EDGAR.

I do not believe that corporate tax returns are public record (unless the corporation is a non-profit), although there was some brouhaha over the last 6-8 years with inquiries from various senators to the SEC asking if the public would benefit if such returns were made public.