I need a program to retrieve data from government sites

We have members/clients who need to know when their commercial accounts lose their business licenses. We would upload their commercial accounts receivable files to a company called SimpleVerity each month, and SimpleVerity would send them alerts whenever any of their accounts lost their business license. SimpleVerity was sold to another company, and then that company was sold to another company. The upshot is that one of our members/clients is not receiving the alerts they need to do business. The third company doesn’t seem to know what it’s doing, and they don’t seem to care about the product. Unfortunately, SimpleVerity was the only company providing this service. We’re thinking it might be feasible to provide the service ourselves. This is not something that I could do with Easytrieve, so we’d have to outsource the programming.

This is the concept of a plan I’ve come up with:

  • The member’s/client’s file is saved to a folder.
  • This new program would read the UBI number (or the state license number, or both) from the file on a daily/weekly/monthly basis and search the Washington State Dept of Labor & Industries database and the various county databases for expired/cancelled/invalid licenses.
  • When an invalid license is found, the program would write the UBI number, the business name, the business address, the license number, and the reason for cancellation to a file.
  • The output file would be emailed to the member/client.

So basically I want to save a file in a folder, and have a program that does everything else automatically.

The former president of our company, who retired but is on the Board, has a couple of people he can ask for such a program; but I said I’d ask you brainiacs at the Brain Trust opinions on what sort of program could do what we need.

ETA: It would also be nice to get the same information from Alaska and Hawaii, and maybe Oregon.

Is the data for all of the licenses available in one place? I’m guessing not and that it needs to be normalized and compiled into something.

Yeah…you’d need an API (application programming interface) which sits between you and the database you want to get info from. Chances are your database and theirs are not the same so the API will take info from the source database and put it where you want in yours (and maybe format it the way you like it).

You’d need one for every database you access (assuming they even allow you that access…they might or might not and will probably cost money).

In short, I think you need to hire someone to do this for you. Or, learn to do it yourself.

@Johnny_L.A , when you say government is this State or County? I think your talking all the countys in a state. Business licenses I think are county.

deleted-

The information is on the state Dept. of Labor & Industries, and more information is also on the databases of each of the counties.

Both. There are state licenses (Secretary of State, and Dept of Revenue), and counties also often require business licenses.

Negative. I’m only looking for ideas. TPTB will need to source it out.

This makes it sound like you have/get this data. Do these account receivable files have the date when the license will expire? Or did SimplyVerify take the business name and search license information across countys and state government?

We do not receive any license information. The client who is complaining sends UBI numbers, but I just checked another client’s file and that one doesn’t. So now I’m not entirely sure how the licenses were searched. But yes, they would search the state and county databases for business license information and then send alerts to the client when an expired license is found.

When you say “database”, what does that mean exactly? Is it a website, or something you have to download, or call them up for, or do they offer an API of their own (it’s rare)?

If it’s a website, generally speaking, the class of software you need are what developers call “web scrapers”. It’s a pretty standard category that many companies use for data retrieval from public sites (and are all the rage now because that’s how AI companies get their training data). Examples of commercial services in this category are https://webscraper.io/, https://www.octoparse.com/, and https://www.parsehub.com/ (not recommending any, just mentioning them as examples). If you’re hiring a programmer to do this, there are also many web scraping tools for whatever platform/language you need the finished app in.

When marketed direct to consumers, a variation on the same tools are often called website change monitors, and examples are https://visualping.io/, https://changedetection.io/, and https://changetower.com/.

You could potentially do this yourself with one of the above tools. Basically you record yourself searching a database, and then clean up the recorded steps (e.g. by replacing the UBI you typed in with a variable so it can apply to different UBIs). That will often work for simpler sites.

If you do need something custom-coded for you, broadly speaking, you just need to find a developer with experience in web scraping. Specific tools might include Scrapy, Selenium, Puppeteer, Playwright, etc., but you don’t need to use those yourself (they’re developer tools).

For reliability, I’d probably have the dev write this as a cloud service (“serverless” would be cheap for something like this) rather than a desktop app, and have it automatically check a few times a day (for redundancy, both because agencies can update their databases at different times and because sometimes there are network glitches). Be aware that this is very likely something that will require ongoing maintenance and support, not a write-once-and-forget thing, because databases and websites change over time, and some agencies may also deliberately block automated traffic or institute protections like squiggly CAPTCHAs (which will require workarounds).

If the dev writes a good program that can do these things, they can probably either make some money selling it on the side, or maybe open-source it for community use. You might also ask them to check for potential existing open-source scrapers for any given site.

Edit: The difference between a “good” and “bad” scraper in a situation like this is its resilience and reliability, as in whether it can self-detect when a lookup fails and alert its own developer to ask for a change. Of course the developer also has to be responsive and able to quickly fix situations as they arise. If you write one yourself and it breaks, most of the monitoring services will alert you somehow, but then you have to go in there and fix it yourself. How often this would be needed depends on how often the site itself changes.

There are many AI-assisted scraping tools these days, but because your needs are specific to certain databases and precision/correctness are important for business continuity, I would stay away from them; the older, manual scrapers are fine for something like this and would be more consistently reliable.

UBI numbers? Universal Business Identifier ? Am I close ?

If I wanted to check a business license, I would go to the SoS website and enter the company name. I actually do this quite a lot; not to check the status of a license, but to download a copy of the filing as supporting evidence when I want a Business Credit Report corrected. Typically, there is a list of companies to choose from. I look to see which company looks like the one I’m looking for (much easier if there’s only one return :wink: ) and look at the filing status. Then I can pull the filing and save it.

The tool we need would have to find the correct company, retrieve the filing status, dates, and reasons from the filing, and write the results.

Some states do that. Here in Washington the Secretary of State site doesn’t, but the Dept of Revenue site does (‘Click all boxes with motorcycles’ or whatever).

I don’ wanna. But I would if I need to. Just have someone else set it up.

You’ve posted a lot of good information. I’ll pass it on to the former president. As I said, he might have a couple of leads of his own.

Sure, that’s entirely possible. If it’s algorithmic (i.e. a dumb computer program could say “of these 5 choices, find the one closest to this address” or “matching this zip code” or whatever), it’s easy.

Even if it’s not that straightforward, if you can guide it with a set of rules, a modern LLM (the AI stuff, which any web scraper developer can also utilize via standard APIs or frameworks) can probably look up the corresponding businesses for you and take a guess at which business is likely to be correct. It won’t always be right, so it’s probably a good idea to have your software also return the other likely possibilities just in case.

Long story short, just give the developer the verification steps you would normally do as a human, and ask them to do something similar. If you can look up by UBI or some other unique identifier, of course that’s even better than a generic name.

That’s not necessarily a deal-breaker. There are relatively cheap paid services that send a screenshot/recording of the CAPTCHA to someone in a poorer country and they’ll solve it in a few seconds for a few pennies. Just talk to the developer about this and have them factor it into your monthly spending budget.

There might also be, eh, “grey-market” AIs that can solve them (sometimes better than real people!). It’s an ongoing arms race.

[Side note, not really relevant to the OP, but just interesting to bring up…]

The feds, at least, are trying to make public data more easily accessible by the (coding) public: e.g. https://data.gov/ and Data tools | resources.data.gov (which is itself open-source! GitHub - GSA/resources.data.gov: Resources for open data and enterprise data inventory management)

It’s a slow process though, and I think one party cares about it more than the other (and even then, not by a whole lot). Not a lot of devs go into government work to make these things easier to use, because the government has a lot more bureaucracy and usually pays less than private.

There’s also efforts like https://codeforamerica.org/ and various “civic hacking” groups that are trying to drag various governments and communities into the modern world.

Usually state/local govs are resource-starved on the IT side and don’t have the resources to make good APIs on their own.

It’s a situation I’d love to see change in the near future, either with more millennials & gen-Z/As going into government work, or maybe with AI assistance.

We made a lot of our GIS data public because it was a PITA to provide data sets to those that requested them. THAT took time. Now admittedly the site took a little time to set up, but for the most part we don’t have to fiddle with it at all. It saves us time.