(Forgive me if I include details you already know; I just want to be complete.)
There are different ways to run network games, but the common model used in MMORPGs is a traditional client-server model. Each person playing at home is running a client that connects into a game server hosted by the game publisher.
The server runs the ‘official’ game – it has the official copy of the game’s state. Each client has a copy of the server’s game state; although with much less info. A client won’t know the details of other players or of the hidden parts of the game. If you think of a card game being played as client-server, each client would know the cards they are holding and the cards face up on the table, but only the server would know all of the cards.
The server also has the ability to push the game state to any of the clients. If at any time the game server detects that the client is confused or out-of-sync, it will push a fresh game state out to the client. This might make a character jump to a new position instantly (the lagwarp that Chronos mentions).
During a game, each client sends a stream of actions to the server. These actions can be low-level or high-level depending on the type of a game. An arcade game like 4-player Gauntlet might just send raw controls (joystick movement and button presses). A more complex game like WoW might send higher-level actions like swing-main-weapon or cast-spell-x. The server receives all of these actions from all of the clients and processes them in order. Each of these events is processed serially (although that does not mean necessarily in a single-thread).
If you have the case similar to the one you mentioned in the OP: two clients swing their axe at a monster at exactly the same time, one of these actions is going to arrive at the server first and be processed first. In this case, that client would be the one to slay the monster. The key is that in practice it doesn’t really matter exactly which client gets its action processed first. As a player, you would never know that you both swung at the exact same instant and so you both should have hit the monster. Instead, you assume the other player swung slightly sooner. In fact, the delay of packets over the internet will introduce much larger variations in actions arriving out-of-order. In this case, it is not necessary to be perfect, just good enough. Someone swung and the monster was killed. No noticeable artifacts.
Now in the case that Chronos mentioned, where two players are moving at the same time, this is a constructive action and not a destructive one. Both players are sending a stream of move actions (move up, move up, move left, move up…). At the same time the server is telling the clients that the other player is moving with a stream of events. The clients assume that their own move actions will be accepted and they display their character moving. However, as the server is processing these streams of movement actions, it might detect that both players tried to move into the same location and it will have to prevent the second player from entering that space. In this case, the server identifies a conflict and sends a new game state to the second client so that his character can be ‘pushed’ back into an unoccupied space. Sometimes, when the network latency is very poor or the server is overloaded, you will see the characters jump greater distances as the server tries to keep everyone in-sync.
As far as “are there generic libraries that help with these activities”, I think some game libraries (e.g. DirectX) have support for syncing state and sending packets over TCP/IP, but beyond that it is really unique to the game rules. Only the specific game code could know about the differences between the way an axe action and move action should be handled. And which of these could cause conflicts. I suppose you could create a very generic framework, but it might add too much overhead.
For MMORPGs, they are handling so many clients that all of this logic is a herculean effort. As a result, it needs to be scaled and optimized in amazing ways and a generic framework wouldn’t be efficient. In these cases, the code is designed specifically for the game and optimized around the hardware and network architectures.
For less rigorous games, like a 2-person arcade game you don’t need to do a client-server model. Instead, each instance of the game talks directly to the other instance. They send the actions (joystick moves, button presses) to the other instance and the other instance treats these actions as if they happened on a local joystick or button. Then, each instance can send a game state sync to the other just to make sure they are both aligned. This is a nice solution because it does not require a game server, however, it opens the clients up to hacking since each client has to have a complete copy of the game state.