I could run Max’s WindowsOS shell command in a virtual machine running Windows 10 easily enough, although a MacOS command line equivalent would be more convenient for me as a Mac-centric person… still, one cannot have everything! ![]()
I don’t have a device to test this on, but you can try this (note the backslashes before the ampersands):
open -a "Google Chrome" --args --enable-logging --headless=new --dump-dom --virtual-time-budget=10000 https://a030-goat.nyc.gov/goat/Function1B?borough=1\&street=W%20180th%20Street\&address=500 > dom.txt
~Max
Thanks Max! It runs, it deposits dom.txt at the root of my user folder, but dom.txt is an empty text file.
Max_S’s Windows command worked when I tried it on a Windows 10 virtual machine, so if I encounter this kind of situation in the future I could write a FileMaker script that executes an instruction to cmd, calculating the string akin to Max_S’s example and then imports from dom.txt and parses the result.
Still would like a MacOS equivalent…
I’ve been reading up on curl and it seems like if I understood the syntax better, it should be able to spit out a text file that contains the text that would be in a web browser’s window for the specified URL.
based on
I’d say
alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
then
chrome --enable-logging --headless (rest of line)
But that is a guess
Brian
Yes, but as others have said, that is not the information you want. The information simply isn’t there until the dynamic elements (JavaScript, in particular) have done their thing.
Max_S solution is a neat one, as it dumps the document after it has gone through this step. Only a proper browser is capable of this. Curl does not contain a JavaScript engine or any other HTML processing and so is incapable of it.
Aah, thanks for the clarification, Dr.Strangelove!
I will continue to play with headless-chrome command-line variants. I’ve got the Windows version working.
So…
alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
then
chrome --enable-logging --headless=new --dump-dom --virtual-time-budget=10000 https://a030-goat.nyc.gov/goat/Function1B?borough=1\&street=W%20180th%20Street\&address=500 > dom.txt
??
or then
chrome --enable-logging --headless ==
new --dump-dom --virtual-time-budget=10000 https://a030-goat.nyc.gov/goat/Function1B?borough=1\&street=W%20180th%20Street\&address=500 > dom.txt
??
Have you considered asking ChatGPT for help? It will even write code for you, but it’s really good at explaining how to do coding tasks, or looking over your code and telling you what’s wrong with it. It will definitely write web page scrapers for you if you can provide enough detail. My son had ChatGPT build up a REACT web site today without having to do any coding.
Sure thing. At a very high level, the steps the browser performs are:
- Fetches the text representation of a page from the server.
- Converts the text representation to the “DOM”, or document object model, which allows more convenient internal processing.
- Runs JavaScript elements, which may update the DOM, and including information pulled from additional fetches from the server.
- Renders the DOM into an image for display
Curl only performs step 1, and so is useless here. Chrome and other browsers perform all steps, but generally just produce an image. Max_S’s solution replaces step 4 with:
4b) Convert the DOM back to a text representation and save to disk
That text representation usually does not even exist, since there is no need for the browser to have it. But it is useful for debugging and so Chrome supports conversion.
If what Max says doesn’t work, I’d try it without the backslashes for the alias. I am unsure if both quotes and backslashes are necessary.
Quotes or backslashes, but not both (unless you want the actual character \, but you don’t).
All alias is doing is telling the shell (probably zsh on a Mac) that when it sees the command chrome it should really run the program at
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome
The problem is that the shell sees a space as a separator, so you need either quotes so it sees the entire string as a single unit, or the backslash to escape the space, so it is not seen as a separator.
So alias is just a timesaver; you can always type the entire /Applications/... path to the command to run it.
(Caveat on all of this: I have far more experience with bash than zsh, so if zsh is so messed up as to require quotes and an escape to handle a space, then give up and switch shells.)
If I were doing this, I would probably have a shell script something like
!#/bin/bash
CHROMEEXEC="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
CHROMEARGS="--args --enable-logging --headless=new --dump-dom --virtual-time-budget=10000"
URL="https://a030-goat.nyc.gov/goat/Function1B?"
ADDRESS="borough=1&street=W%20180th%20Street&address=500"
$CHROMEEXEC $CHROMEARGS ${URL}${ADDRESS}
Then once it was working for manually entered addresses in the URL and ADDRESS lines, make the script a bit more complicated to construct formatted addresses into the URL. Of course, switch to whatever other URL and appropriate input munging for the next problem.
(I couldn’t find my code in a brief search, but last time I needed to convert addresses to a lat/long I used Google’s map API from R, and I think it was trivial, because there were adequate coding examples online. I think it took longer to get signed up for Google map API access than it did to actually write and run the code. The first few thousand lookups are free, but there is a cost after that, so fine for a limited scope jobs.)