quoth http://groups.google.com/group/get-theinfo/browse_thread/thread/765e12ff2cbeaf13?pli=1
Most EDGAR docs (but not all) are available in a very poorly adhered to mark up language - no OCR required, but you do need to apply many heuristics to determine the intent of whoever wrote the document. For example, from memory, table columns are defined with a <C> tag, but this tag simply indicates the tab stop for the column delimits, not an XML style tag. So once you have found a table with columns, you need to do some guess work to determine what is a header, versus units, versus actual column content.
…
Every financial firm that I know of just pays boatloads of cash to one of Retuers, Bloomberg, etc., who have armies of ‘encoders’ who manually enter in the data into a normalized format. And even then, if you pony up the minimum of ~$10k/month for these feeds, you get to deal with the joy of an entirely different set of problems that you don’t ever get to scratch the surface of when you write your own parser (because you never get far enough along with solving that problem). The second order problems come from company restatements, changing accounting standards, changing reporting periods, etc.
can anybody comment or elaborate on these claims about the market for these parsing services or the “state of the art” mechanics of parsing the filing documents?
Do they really use lots of data entry people? Have they tried building visual, human-controlled productivity tools that would grab table contents regardless of the ad hoc html formatting issue?