programmers: how do I parse the "screen"

I’m interested in writing a program that automates the play of a game - it’s nothing complicated, like UO or EQ, and in fact it doesn’t really even matter what the game is - I just want to make a proof of concept, so let’s suppose it’s solitaire.

Now, my primary problem is how to receive information from an application by reading a portion of the screen area. Let’s say I want to check if a card at a particular location on the screen is the 9 of clubs or not. Somehow I would need to:

  1. determine the x,y screen coordinates of the bounding box containing the card I want to ‘read’
  2. capture the image at those coordinates
  3. compare it to a stored bitmap to see if it matches up

An alternate, simpler approach might be:

  1. identify a unique x,y + color combination of a pixel that uniquely identifies the card (i.e. a 9 of clubs would have a black pixel in a specific location)

The problem with the first approach is that I have no idea what libraries and functions I could use to implement this parsing & comparsion. The problem with the second approach is that I don’t know how to deal with the problem of capturing a specific pixel from a window that could be anywhere on the screen, which would mean I’d never know the precise x, y to check.

Has anyone thought about or dealt with these issues before? I’m passable in Java, VB, and C++, so any solutions or code examples involving those would be incredibly useful.

A much, much, much easier solution would be to find the source code of a solitaire game and write an API which would simulate a human player by calling the same functions as the graphical interface.

I think you have it backwards. How did the card get on to the screen? It certainly wasn’t the user scribbling it there with a marker, right? So you have the program establish a data structure that tells it what cards are where. Then you query that to display the cards on the screen, but you also query that to figure out where a given card is right now, or what card is in a given location.

The idea is that I don’t have access to the API - I am trying to do “screen scraping” and read the info I need directly off the display. So the issue isn’t “can I write a program that plays solitaire”, but rather “can I trick Microsoft’s solitaire program into thinking someone is sitting at a physical terminal and playing it by reading the display and clicking on things?”

If you can capture a pixel, I presume you could capture an entire area. I think you’ve got the same problem in either case.

I was gonna say I don’t know anything about screenscraping, but I do. Had to write a script for a terminal emulator for a government database… ick. But that’s nothing like what you want.

Can I suggest another (possibly harder) method? There are plenty of programs/trainers/mo-money hacks/etc that monitor the memory another program allocates, and do stuff based on what it sees. It’s a sure bet that some variable will be set to the card you want at some point, and you can probably extract an array indicating current hands displayed. I unfortunately don’t know anything about implementing this either, but it may be easier.

To do this you’ll need to get busy with the Windows GDI API and a couple of other functions. I’m just going to throw the function names at you here - you’ll need to look up the details in the Platform SDK. Explaining in more detail would take much too long.

First, you need to identify the window of the application you’re interested in. The EnumWindows function is a callback type enumeration - you provide a pointer to a routine that accepts specific parameters and it’ll get called once for every open top-level window (if you’re on NT then the list will be limited to the ones on the desktop to which the UI of your application is bound).

This callback sends you application the window handle (hWnd) of each window. You can then use either GetWindowText or GetClassName if you know the class name. This will allow you to identify and store the hWnd of the window you’re interested in.

Now, to get the window’s display you need to start with the GDI. Windows draws graphics on things called Device Contexts (DC). Inside each Window DC will be various GDI objects, among them a Bitmap which should contain the graphical display of the user interface. I’ve had problems accessing the DC of Windows directly, so as an alternate method I usually get the DC of the desktop window (hWnd=0) which allows me to access the entire screen display.

Once you have a handle to the desktop DC (using GetDC) you can use the GetPixel function to return the colour of pixels at given coordinates. In order to know which corrdinates to examine you should call GetWindowRect or, probably more usefully, GetWindowPlacement. Both of these APIs gives you access to a RECT structure that describes the bounds of the window’s rectangle.

With this information you can start examining pixels within the window’s display and make decisions based on this. If you need to do more processing of the image, particularly processing that’ll change the image (edge-detect filters using boolean operations etc) then you’ll need to go a bit further and create your own copy of the window’s UI.

To do this you’ll need to build your own DC and Bitmap object. Use CreateCompatibleDC, [b/CreateCompatibleBitmap** and SelectObject. Remember to retain references to the objects selected out of the DC you create and destroy them properly when you’re done or you’ll leak memory. When you’ve built your private DC you can use BitBlt to copy rectangles of graphics around from DC to DC. Once you’ve copied the display to your private DC then you can do what you like to it without affecting the on-screen display.

All this is certainly possible in C++ and VB (v6 and, I think, v5). .NET languages provide a different way to do this via the System framework. Java, I’m not sure about. It probably can but I’ve not done any for seven or eight years, so I’m a bit rusty.

Hope this helps point you in the right direction.

Solitaire would actually be fairly easy because it has a fixed layout. Depending on the size of the window (which you can easily get), you can calculate exactly where all of the cards will be located. It’s then a simple matter of capturing a bitmap of each card location and comparing it to known “control” bitmaps. You can get your control bitmaps (the faces of the cards) by capturing them with a screen capture utility.

Controlling the mouse can be done with the SetCursorPos() API. You can simulate mouse clicks by sending WM_LBUTTON* messages to Solitaire’s window. Dragging a card may be a challenge, but it can certainly be done.

If you haven’t done much GDI programming, expect this to take several weeks or months. GDI is not too complicated, but there are lots of things to know (some of which Armilla mentioned). When you do something wrong, the usual result is a crash (UAE, GPF, Access Violation, whatever Microsoft is calling them these days).

If you’re doing this for fun, find something that is actually fun instead.

Thanks for the info on the GDI API. I just came across a java Robot class which seems like it can do everything that I would want to do, since it provides a mechanism to both test pixels and to capture entire regions of the screen, but saves me the trouble of having to learn how to work with the GDI.

I’m still perplexed by an algorithmic question, though - if all you can do is work with absolute screen pixels, what happens if you can’t tell where the window is? Let’s say I know that the first collection pile in solitaire has its top left corner at 40, 50. But I can’t hard code this in, since the next time I run it, the solitaire window might be placed in a different region of the screen, thus invalidating my coordinates.

Also, suppose I make a big library of “reference” bitmaps as someone suggested, and compare captured regions to them. One idea that comes to mind is reading the images in as byte arrays, XORing them together, and counting the 1s in the resulting array to compute a rough similarity score. Does that sound reasonable?

I am sure there are better ways to do this, but you could get the program to search the screen for a specific image (a fixed part of the solitare setup for example). When it knows the location of this part, then it can work out the rest from there.

Here’s another problem to think about: what to do if part of the window is off the side of the screen.