Text strings ON a web page but NOT in its source code. Getting the former

I occasionally have a need to parse data out of a web page.

Usually, when this has occurred, the same text that one sees in one’s web browser window exists somewhere in the source code, and I’m quite skilled at text parsing, so the fact that the snippet I need may be to the right of the 183rd occurrence of angle bracket + “p” + angle bracket, or whatever, isn’t a problem.

But I lately found myself confronted with a web page that obligingly conjures the answers I want and need for a database, but those answers are only on the human-discernable web browser window. They aren’t in the source code.

Here’s the web site:

… and here’s a variant on that URL that conveniently encodes the street address data I’ve already got, so that opening this URL has the answers right there on the screen (the x,y coordinates and the longitude and latitude coordinates):

My usual mechanism for such things is FileMaker Pro: it lets one define a “web viewer” which fetches a calculated URL that can reference FileMaker field values, and then you can scrape the results using a FileMaker function GetLayoutObjectAttribute ( “Web” ; “content” ) where “Web” is the object name of the web viewer. It returns source code which can then be parsed using standard text functions (those of you unfamiliar with FileMaker will probably know of similar text functions from Excel or Access or whatever).

I only had a few hundred addresses I needed it for, so I managed with a mildly klunky workaround that isn’t terribly relevant to this post

I ran a script that opened the calculated URL in a regular web browser, then I assigned a QuicKeys macro to the consecutive keystrokes Command-A for select all and Command-C for copy then returned focus to FileMaker to paste from clipboard into a global text field I could then parse

… but if I’d had a much larger array I would really really want a routine that could run unattended.

How does one trap for the contents (as opposed to the source code) of a web page? Anything more elegant than copy and paste?

I’m not sure I understand the problem you’re running into. The x and y coordinates are in the source code for the rendered page:

<p id="out_xy_coord_graphic">1003169, 0248057</p>

~Max

How are viewing the HTML code? And in which browser?

“View source” will give you the HTML orginally sent to the browser, but that is often altered on the fly by javascript.

The keyboard shortcuts Option + ⌘ + J (on macOS), or Shift + CTRL + J (on Windows/Linux should bring up the developer toolbox, which can show you the source code (really, the DOM, but lets not go there) in real time as it is altered by javascript.

The actual x y coordinate data is added to the DOM by javascript functions, so if you can’t use something that first renders javascript like Selenium, for such a small set you could try automating the API call to https://a030-goat.nyc.gov/Goat/Function1B/GetResponse with something like curl.

~Max

Using Brave, View Menu, Developer, View Source.

Here is what I get. The coordinates aren’t in it.

That’s also what I get if I do a plain old “Save As” from the file menu and save it as “Webpage, HTML only”

that’s also what I get in FileMaker when I grab GetLayoutObjectAttribute ( “Web” ; “content” ) and dump the outcome of that into a global text field.

I need a non-manual process of getting at the coordinates.

This would be nice if I needed the source code for a single rendered page. Doesn’t seem terribly helpful if I needed the source code for 50,000 consecutive addresses’ worth of pages.

I’m willing to see if I can learn what that sentence means.

If you need a vast amount of data, then it is time to automate the process.

Personally I’d write a python script, but that’s just my preference. You can do this in a variety of languages, or spin up a basic app in any language.

Perhaps the answer might be helped by these questions:

What do you want to achieve?
What measurements mean the procedure was successful?

ETA, also @Max_S reply below.

A normal web browser which executes javascript will send an HTTP post to that endpoint, the response is a JSON formatted text string. In this case,

"{\"display\":{\"AddressRangeKeys\":[{\"Key\":\"\",\"Value\":\"Ordinary Address Range\"}],\"AddressRangeList\":[{\"b7sc\":\"13735001\",\"bin\":\"1063602\",\"high_address_number\":\"             500\",\"low_address_number\":\"             500\",\"street_name\":\"WEST  180 STREET                \",\"tpad_bin_status\":\"No activity\",\"type\":\" \",\"type_meaning\":\"Ordinary Address Range\"},{\"b7sc\":\"11171001\",\"bin\":\"1063602\",\"high_address_number\":\"            2416\",\"low_address_number\":\"            2416\",\"street_name\":\"AMSTERDAM AVENUE                \",\"tpad_bin_status\":\"No activity\",\"type\":\" \",\"type_meaning\":\"Ordinary Address Range\"}],\"CompleteBINList\":[{\"bin\":\"1063602\",\"tpad\":\"No activity\"}],\"HighB7SCList\":[{\"b7sc\":\"11201001\",\"streetName\":\"AUDUBON AVENUE                  \"}],\"LowB7SCList\":[{\"b7sc\":\"11171001\",\"streetName\":\"AMSTERDAM AVENUE                \"}],\"SimilarNamesList\":[],\"in_boro\":\"1\",\"in_browse_flag\":\" \",\"in_func_code\":\"1B\",\"in_hnd\":\"500             \",\"in_hns\":\"           \",\"in_roadbed_request_switch\":\" \",\"in_stname1\":\"W 180TH STREET                  \",\"in_tpad_switch\":\"Y\",\"in_unit\":\"\",\"in_zip_code\":\"     \",\"out_No_Parking_lanes\":\" 2\",\"out_No_Total_Lanes\":\" 3\",\"out_No_Traveling_lanes\":\" 1\",\"out_TPAD_bin\":\"       \",\"out_TPAD_bin_status\":\"No activity\",\"out_TPAD_conflict_flag\":\"1\",\"out_ad\":\"72\",\"out_alx\":\"No Split\\/Change\",\"out_atomic_polygon\":\"101\",\"out_b10sc1\":\"13735001010\",\"out_bbl\":\"1021520046\",\"out_bbl_block\":\"2152\",\"out_bbl_lot\":\"46\",\"out_bid\":\"\",\"out_bike_lane\":\" 2\",\"out_bike_traffic_direction\":\"One-way against\",\"out_bin\":\"1063602\",\"out_bin_status\":\"No activity\",\"out_blockface_id\":\"1322603485\",\"out_boe_lgc_pointer\":\"1\",\"out_boe_preferred_b7sc\":\"13735001 \\/ WEST  180 STREET                \",\"out_boro_name1\":\"MANHATTAN\",\"out_cd\":\"13\",\"out_cd_eligible\":\"CD Eligible\",\"out_cdta_2020\":\"MN12\",\"out_census_block_2000\":\"1000\",\"out_census_block_2010\":\"4000\",\"out_census_block_2020\":\"4000\",\"out_census_block_suffix_2000\":\" \",\"out_census_block_suffix_2010\":\" \",\"out_census_block_suffix_2020\":\" \",\"out_census_tract_1990\":\" 261  \",\"out_census_tract_2000\":\" 261  \",\"out_census_tract_2010\":\" 261  \",\"out_census_tract_2020\":\" 261  \",\"out_co\":\"10\",\"out_coincident_seg_cnt\":\"1\",\"out_com_dist\":\"112\",\"out_condo_base_bbl\":\"N\\/A\",\"out_condo_bill_scc\":\" \",\"out_condo_billing_bbl\":\"N\\/A\",\"out_condo_flag\":\"Non-Condo\",\"out_condo_num\":\"N\\/A\",\"out_coop_num\":\"N\\/A\",\"out_corner_code\":\"SW\",\"out_curve_flag\":\"None\",\"out_dcp_zoning_map\":\"3B \",\"out_dot_st_light_contract_area\":\"1\",\"out_dsny_snow_priority\":\"C\",\"out_dsny_snow_priority_str\":null,\"out_ed\":\"045\",\"out_error_message\":\"                                                                                \",\"out_error_message2\":\"                                                                                \",\"out_fdny_id\":\"       \",\"out_feature_type\":\"Street\",\"out_fire_bat\":\"13\",\"out_fire_co\":\"Ladder 45\",\"out_fire_co_str\":\"Ladder 45\",\"out_fire_div\":\"7\",\"out_from_additional_lgcs1\":\"  \",\"out_from_additional_lgcs2\":\"  \",\"out_from_additional_lgcs3\":\"  \",\"out_from_additional_lgcs4\":\"  \",\"out_from_additional_lgcs5\":\"  \",\"out_from_dcp_preferred_lgcs1\":\"01\",\"out_from_dcp_preferred_lgcs2\":\"  \",\"out_from_dcp_preferred_lgcs3\":\"  \",\"out_from_dcp_preferred_lgcs4\":\"  \",\"out_from_dcp_preferred_lgcs5\":\"  \",\"out_from_node\":\"0043728\",\"out_generic_id\":\"0023098\",\"out_grc\":\"00\",\"out_grc2\":\"00\",\"out_health_area\":\"04.00\",\"out_health_center_dist\":\"17\",\"out_hi_hns\":\"528             \",\"out_hi_x_coord\":\"1002828\",\"out_hi_y_coord\":\"0248248\",\"out_high_bbl_condo\":\"N\\/A\",\"out_hnd\":\"500             \",\"out_hurricane_zone\":\" X\",\"out_individual_segment_length\":\"00460\",\"out_interior_flag\":\"Not Interior Lot\",\"out_irreg_flag\":\"Not Irregular Lot\",\"out_lat_property\":\"40.847457\",\"out_latitude\":\"40.847514\",\"out_lion_key_face_code\":\"5900\",\"out_lion_key_sequence_number\":\"00010\",\"out_lo_hns\":\"500             \",\"out_lo_x_coord\":\"1003231\",\"out_lo_y_coord\":\"0248026\",\"out_lon_property\":\"-73.931781 \",\"out_longitude\":\"-73.931618 \",\"out_low_bbl_condo\":\"N\\/A\",\"out_mc\":\"7\",\"out_no_cross_street_calculation_flag\":\" \",\"out_nta\":\"MN36 \\/                                                                            \",\"out_nta_2020\":\"MN1201\",\"out_num_of_bldgs\":\"1\",\"out_num_of_blockfaces\":\"2\",\"out_nypd_id\":\"       \",\"out_physical_id\":\"0026975\",\"out_police_area\":\" \",\"out_police_patrol_boro\":\"Manhattan North\",\"out_police_pct\":\"34\",\"out_police_sector\":\" 34A\",\"out_preferred_lgc\":\"13735001\",\"out_preferred_street_name\":\"WEST  180 STREET                \",\"out_puma_2020\":\"04112\",\"out_puma_code\":\"03801\",\"out_reason_code\":\" \",\"out_reason_code2\":\" \",\"out_right_of_way_type\":\" \",\"out_roadway_type\":\"Street\",\"out_rpad_bldg_class\":\"O5\",\"out_rpad_scc\":\"0\",\"out_san_bulk\":\"EMWF \",\"out_san_commercial_waste_zone\":\" MN7\",\"out_san_dist_section\":\"112 \\/ 122\",\"out_san_org_pick_up\":\"     \",\"out_san_recycle\":\"EW \",\"out_san_reg\":\"MWF  \",\"out_san_sched\":\"2A\",\"out_sanborn_boro\":\"1\",\"out_sanborn_page\":\"019 \",\"out_sanborn_volume\":\"12 \",\"out_school_dist\":\"6\",\"out_sd\":\"31\",\"out_segment_azm\":\"151\",\"out_segment_id\":\"0071345\",\"out_segment_len\":\"460\",\"out_segment_orientation\":\"W\",\"out_segment_type\":\"Undivided\",\"out_sos_ind\":\"Address is on the left when facing from AMSTERDAM AVENUE to AUDUBON AVENUE\",\"out_spec_addr_flag\":\" \",\"out_speed_limit\":\"25\",\"out_stname1\":\"WEST  180 STREET                \",\"out_street_width_irregular\":\" \",\"out_street_width_max\":\" 30\",\"out_street_width_min\":\" 30\",\"out_stroll_key\":\"                   \",\"out_tax_map\":\"1\",\"out_tax_section\":\"08\",\"out_tax_volume\":\"03\",\"out_to_additional_lgcs1\":\"  \",\"out_to_additional_lgcs2\":\"  \",\"out_to_additional_lgcs3\":\"  \",\"out_to_additional_lgcs4\":\"  \",\"out_to_additional_lgcs5\":\"  \",\"out_to_dcp_preferred_lgcs1\":\"01\",\"out_to_dcp_preferred_lgcs2\":\"  \",\"out_to_dcp_preferred_lgcs3\":\"  \",\"out_to_dcp_preferred_lgcs4\":\"  \",\"out_to_dcp_preferred_lgcs5\":\"  \",\"out_to_node\":\"0043602\",\"out_traffic_dir\":\"A\",\"out_truck_route_type\":\" \",\"out_unit\":\"              \",\"out_usps_city_name\":\"NEW YORK                 \",\"out_vacant_flag\":\"Not Vacant\",\"out_valid_lgc_1\":\"01\",\"out_valid_lgc_2\":\"  \",\"out_valid_lgc_3\":\"  \",\"out_valid_lgc_4\":\"  \",\"out_vanity_sos\":\"L\",\"out_wa1_message\":\"                                                                                \",\"out_x_coord\":\"1003169\",\"out_x_coord_property\":\"1003124\",\"out_y_coord\":\"0248057\",\"out_y_coord_property\":\"0248036\",\"out_zip_code\":\"10033\"},\"root\":null}"

The browser then executes more javascript which adds this information to the DOM.

curl is a simple program that sends and receives HTTP messages (an alternative is wget). Most browser developer tools have a network tab where you can examine network calls and copy the relevant information for usage with programs like curl or wget. These are the arguments I used to obtain the above JSON response,

curl "https://a030-goat.nyc.gov/Goat/Function1B/GetResponse" -X POST -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "X-Requested-With: XMLHttpRequest" --data-raw "ButtonType=Submit&Borough=1&AddressNo=500&StreetName=w+180th+street&Unit=&RoadBedBool=false&TPADBool=true&TPADBool=false&BrowseFlag=&X-Requested-With=XMLHttpRequest"

Always be considerate when hijacking a website’s API - they are meant for browser use, not scripts. You don’t want to look like you’re attacking the server. So put a delay between each call if you go that way.

~Max

At this point it’s an academic exercise. I’ve already fetched the data I actually need for the 400-some-odd records that didn’t have coordinates info.

I’m trying to add to my tool belt the ability to snag the internal contents of a web page, as distinguished from its source code.

I probably don’t want to learn an entirely new-to-me coding language in order to do so. A FileMaker plug-in would be entirely appropriate, for example.

I was probably a little agressive there. I am a software dev, and these are the kind of questions I am constantly facing, both from our QA teams and our management.

Web pages are notoriously difficult to parse as text. HTML has been though numerous variations, many of which are insanely difficult to parse. Don’t do this.

As @Max_S suggests, call the API. You should get a nicely formatted JSON response that you can then parse.

I have no experience with FileMaker, but it seems likely that instead of the web page, you can ask FileMaker to request and retrieve data from the API URL, and then parse the JSON to get your expected result.

Parsing JSON is way, way easier than parsing HTML.

nm

~Max

I’m willing to learn what those sentences mean, but I currently don’t. I know what an API is, but the phrase “call the API” in this context is less than clear. I’ve heard of JSON but never met him. What is it your’e suggesting that I do?

ETA: by “the API” are you referring to this, which is in the source code?

https://api.tiles.mapbox.com/mapbox-gl-js/v0.42.2/mapbox-gl.js

API stands for Application Programming Interface, and it’s a mechanism for one program to access another through a web interface. You make an API ‘call’ to a specified address (endpoint), passing some data if necessary, and the other side typically responds with a formatted result (Usually JSON) containing the result.

So for example I could have a website called “Multiply”. The public facing url might have a page where you enter two numbers into text boxes, and the answer appears in a third. But I might also expose a web service endpoint, or API. Then if I have a program that needs to multiply two numbers, I could call http://multiply.com/myAPI, passing two numbers, and the response will be a string containing the answer formatted for easy parsing by computer rather than for viewing by a person.

This is obviously a very simple example. But I think all of this might be beyond your level if you haven’t done any programming at all.

It’s very common for web sites to not have all the data on them hardcoded in source code. Most modern sites are a mix of markup source and a number of service calls to retrieve actual data to display.

I looked at their FAC and I don’t believe there is an API

And, googling, I found the Geosupport Desktop Edition here.

Maybe that will help.

You’re trying too hard to solve a specific problem that I don’t have any more. Forget the Goat web site for now. Consider them no more than Exhibit A, an example. What I’m asking for is a way to trap for the text that the human end user can see (and copy) from their web browser window. Not a solution to getting the geocoordinates from the Goat web site.

This is the crux of the problem: There are dozens of idiosyncratic ways in which modern dynamic websites work. Finding the human-displayed text embedded in the HTML was how it was done 20+ years ago. Finding the human-displayed text embedded in the javascript was how it was done 10+ years ago.

Neither of those are used much anymore for the kinds of websites that have significant user interaction on the page. If you want to scrape output from modern dynamic websites you will either need to learn how to use modern tools like web APIs and JSON queries or you will fail. There really isn’t a third answer.

What the folks upthread have tried to show you is how they used some standard tools to suss out how the GOAT website API worked, a website they had never seen before. Then once they’d figured out how to ask it questions in the manner it needed, they then used other standard tools to ask it specific questions and get back specific answers. Which process could then be automated to process a large scale of data using yet more standard scripting tools.

If you had and could operate those same tools, you could in principal use the exact same approach on almost any dynamic website and with enough perseverance and cleverness, get it to spit out the answers you want in a format you can use. For ten, ten thousand, or ten million queries.

Unfortunately, even those baby steps require some knowledge of programming and modern tool use. As well as knowledge about how websites are commonly built, since you’re trying to sneak under its skirts to take pix of its undies metaphorically speaking.

If all of the earlier posts went over your head you have a bunch of learning to do.

Or if all you can use or learn to use is an upgraded version of a 1980s design all-in-one database / reporting engine, you’re probably left behind. You’re not going to find a “plug-in” that is a point-and-click master key to access any, much less every, websites’ idiosyncratic dynamic API.

I wish you well. I truly do. I left IT 13 years ago now and I am appalled at what an utter dinosaur I have become, despite being pretty much on the bleeding edge at that time. Constant learning and constant churn is the bare price of admission to the IT biz. Folks can noodle around the edges for quite awhile using old tools and old paradigms. But soon enough they turn to dust.

Well, the simplistic workaround was to automate the process of selecting the browser application after telling the OS to open the URL, doing a Select All, doing a Copy, then returning focus to the database to paste into a global text field.

That really doesn’t sound like something that a clever plug-in designer couldn’t turn into an extension of what FileMaker does natively, does it?

It would be non-trivial as the plugin you are imagining would require code to match all the core functions of a web browser. A lot of steps could potentially go between loading the initial HTML source code and the final text rendered on the user’s screen. There could be one, ten, or a hundred network communications to fetch and process additional files (fonts, stylesheets, javascript source files, bitmaps, JSON formatted text files, etc.)

JSON stands for javascript open notation and it is a file format, like .html. DOM stands for document-object model and is probably what you are thinking of as the “source code” of a rendered webpage - it’s not necessarily the same as the source code in the HTML file.

What you are asking for is beyond the traditional scope of a database application. I don’t envy you if you are forced to use some database client for text parsing. If I understand you correctly this is your existing workflow:

FileMaker database: GetLayoutObjectAttribute(“Web”;“content”) → FileMaker string functions

You might want to find a way to change to this kind of workflow

FileMaker database: external command, then import text file as string → FileMaker string functions

where the external command is a system shell command which:

  • runs a browser in headless mode
  • instructs the browser to fetch and render a given webpage
  • instructs the browser to export the serialized DOM structure to a text file

Yes, that’s kind of complicated and you have to write a script, but it’s also one-and-done, and one-size-fits-most websites, and it would do what you ask.

The reason there aren’t plugins for this is because it makes more sense to do what I did upthread, that is, copy the HTTP request message used by the browser to obtain raw data.

~Max

Here’s an example of the shell command one might use, from a Windows environment:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --enable-logging --headless=new --dump-dom --virtual-time-budget=10000 https://a030-goat.nyc.gov/goat/Function1B?borough=1^&street=W%20180th%20Street^&address=500 > dom.txt

The Windows command line environment requires that all ampersands (‘&’) in the URL be preceded by a caret (‘^’).

ETA: The resulting text file contains the information, in this case x and y coordinates on line 246:

<p id="out_xy_coord_graphic">1003169, 0248057</p>

~Max

I had a similar task a few years ago, and solved it in the same way. Relatively painless in the wash up.