Help me to troubleshoot my computer (specific questions about RAM and PSU inside)

Heat doesn’t affect just the chips. All the metal components in the machine expand and contract with temperature, causing small cracks to and weak solder joints to seperate. This in itself can create a snowball effect producing even more heat. This increases the load on the power supply, producing even more heat. Diagnostics don’t always produce the heating conditions of real situations either. Often they stress particular components, when the worst case scenario is having all components heavily used, creating greater temperatures than may be detected by diagnostics.

engineer_comp_geek stated the part people miss all the time. No matter what the failure temperature is for a particular component, or the whole machine, you don’t want to get anywhere near that. The simple solution is fans. The faster you pass room temperature air through the case, the lower the temperature will be. Even a household cooling fan pointed at the computer will help cool it by cooling the case. Just make sure it’s not working against the exhaust fan(s) in the computer. Also make sure internal wiring isn’t blocking circulation, and make sure all connectors are clean and tight. You can also get longer cables for disk drives and externalize them.

I didn’t see anybody else mention another factor. A dirty computer will run hot. Dust and grime form an insulating layer over components, and clog air holes. Get compressed air and blow out the dust and dirt. Use a lint free cloth and alcohol to clean obviously dirty components.

Yep, exactly.

There are two “worst case scenarios”. One is to run a chip at a very high temperature all the time. That will dramatically shorten the expected lifespan of the chip.

The second one is called “thermal cycling” and that is where something heats up and cools down a lot. Different materials expand and contract at different rates, so as things heat up and cool down it puts mechanical stresses on all of the joints. “Cold solder joints” (where the solder wasn’t hot enough when it made the connection, and is therefore a weak joint) are particularly affected by thermal cycling. The most likely place of failure though is inside the chip, where the itty bitty wires that connect the silicon to the outside world tend to lift up off of their pads and break their connection.

Running a computer all the time reduces thermal cycling, but makes mechanical parts like disk drives and fans more likely to fail. Turning the computer off when you don’t use it reduces the mechanical stresses in fan bearings and the like, but introduces more thermal stresses.

By the way, in case you think you can beat the system and make sure your chips never fail by simply never turning them on, there’s also an aging effect in silicon that will kill the chips no matter what you do to them.

You can’t win. :wink:

The two things you most want to avoid though are running the computer hot and turning the computer on and off a lot.

I think if you* never *turn on a chip, performance will remain constant for a very long time:D

Here’s an update. Askance suggested that I run the CPU tests again. Here are the temperature.

Core 1: Peak 67C
Core 2: Peak 64C
Core 3: Peak 60C
Core 4: Peak 63C

The temperature has gone down.

However, the last time I ran the test, the computer was on for hours while doing a RAM test, and this round I started the time the moment I got home.

That was with the new RAM installed but no other changes, such as directing a fan onto the MB? If so I would say that is still too high and the mild drop is likely due to the PC just not doing as much work at the time.

Does it still hang in the graphics-intensive games? What do the temps show after playing one for, say 20-30 minutes?

If you play around with any of the stress test utilities and simultaneously monitor the temps, you will see that the core temp readings respond very quickly to changes in activity.

The best one to use to see this is OCCT since it cycles through periods of high and low intensity. It also shows the temps for the first 4 cores. You will see that when cpu usage goes from 16% to 100% that the temps respond immediately. It might take several seconds for them to max out, but you can see the response almost immediately.

edit - ah yes, the point. It shouldn’t matter how long the computer has been on when looking at the temps since they bounce back and forth in response to activity levels so quickly. The only time this might not be true is if you don’t have a good seat for your heatsink or the thermal paste has broken down and isn’t doing it’s job, but even then, IDK.

A quick newsflash: The computer still hangs in Lord of the Rings Online, but do fine with Mass Effect 2 (I played through the entire last mission and Lair of the Shadow Broker). It could be just luck though; on my first try I could do LOTRO fine for an hour; now it just hangs after ten minutes.

The temperature above is with no change, besides the new RAM. I have been wanting to look into the temperature at any rate, and will discuss with a friend. I’ll do another RAM test in the meanwhile.

Anyway, there’s something about the hang which I have forgotten to mention and I remembered it when it hangs again. The fan began to spin very loudly, and the motherboard’s lights (I believe it’s a set of green, orange and red lights, NOT the LAN light; it’s just between the RAM slots) begins to lit on. I could see those lights besides I am using a CoolerMaster casing where one side is transparent. Edit: On checking the manual, those three lights are the Phase LED, if I am not wrong.

I believe you can get that effects whenever you do a cold reset of the computer. The fan turns very fast, the three set of lights lit up for a while, then die down.

Okay, if I may need to install a new fan and change the PSU, and maybe the card (that’s still a suspect), is it better that I try to troubleshoot, or get a new one? Or is there something I should swap? Here’s the specs of the current machine, mentioned upthread.

Quad Core2 QT9400 2.66ghz
Xtreem Dark 2 X 2GB RAM <— just exchanged
XFX Nvidia GTS 250 (256MB)
EP45-UD3l Gigabyte motherboard
Gigabyte Power Supply GE-P450P-C2 ← suspect

Did you buy this machine from an established vendor like a Dell, HP or other large company that essentially puts them together on an assembly line, or, did you either build it yourself or have it build by a local shop.

I’m starting to think that you may need to detach the fan/heatsink combo that sits on the chip, remove the TIM (thermal interface material), get something good like Arctic Silver, and reapply.

Those temps are way to high for idle speeds. They may be acceptable, but they’re still too high.

Have you tried running a game with say OCCT or speedfan running at the same time? You might need a second monitor for this. If that’s not an option, then maybe you can get OCCT or some other utility to log the temps and cpu utilization level and then compare the readings to the period of time the game was running.

with the idle temps you have, I wouldn’t be surprised if the actual load temps are getting close to the point where the chip’s thermal cut-out circuitry kicks in. That for example would explain why the cpu fan winds up and sound like a jet on approach.

If the fan on the heatsink is working, then the most likely explanation is that the thermal paste has broken down and is no longer doing it’s job. The other possibility is that one of the pins that secures the fan/hs assembly to the mobo is loose and not forcing solid contact between the hs and the cpu. Although looking at the OP temps, that doesn’t seem likely since they are in a pretty tight group.

I also peeked into the PC Health section in my bios. Here are the readings of the power:

VCore: 1.140V
DDR18V: 1.888V
+3.3V: 3.360V
+12V" 12.302V

Another oddity: The power fan lists as 0 RFPM, which strikes me as strange, because I can see that the power fan is spinning inside the casing.

I don’t think that’s unusual since the only connections between the PSU and mobo is via the power cables and I don’t think that they provide any feedback to the mobo. Anyway, that wouldn’t affect CPU temps.

edit - ok, it could theoretically have an effect in a closed case by decreasing air flow, but if you can see the fan works, I wouldn’t worry about it.

It is built by a local shop.

Noted, I will look into the heat and the re-seating of the paste. That looks like the next likely candidate. Will do some monitoring of temperature while playing games.

Another issue: the XFX site for my card lists 500W as the minimum requirement. My PSU is giving “450W, peak 550W” (which I understand to be unreliable); do I have a power problem?

Once again, thanks to all who pitched in with advice.

I don’t do any serious gaming so I never run into those problems, but if the XFX site is taking into account your memory, CPU and video in arriving at that number, then I would say probably yes.

If they are basing it just on a “normal” system configuration with that card and saying min. of 500w, then probably still a ‘yes’ but the problem might be even worse than we suspect depending on how far above “normal” your system happens to be.

Also dont forget that maybe several months ago (if you are in the northern hemisphere), temps were cooler, and there would have been less stress on the system.

Now ambients are probably much higher and that could mean the difference between the PSU being able to run at near peak or not.

That would probably be the best thing to investigate next.

If you decide to try to reapply the thermal paste, make sure you post here first since you have to be careful. The heatsink tends to get glued to the cpu and if you try to pull it directly out you can damage the pins. You need to wiggle, tug, wiggle, tug to get the hs off witout yank the cpu along with it.

Also, if you are using a fan/hs combo that mounts with push pins, there is a trick to getting the pins back into their holes.

Another update. Memtest still reports one write error to the RAM. Since the RAM is new, I guess the cause of the problem is elsewhere.

Remember that peripheral cards like the video have memory on them too. But I’m betting on a motherboard problem at this point.

Will Memtest write to other memory besides RAM?

By motherboard, do you mean the Gigabyte board? Is there anyway to verify?

I’ve never heard of memtest86 testing anything other than the main system memory. I haven’t used it in a few years though, but if this is something new, they would at least have to tell you the bus location if the bad memory were on a peripheral device like a pci card.

Unless you have indications to the contrary, I would assume that the errors are in main memory.

I have run into this problem when the memory did not have enough voltage - although in the one recent incident, I think it also had something to do with the integrated memory controller on my chip. Regardless, raising the voltage supplied to the DRAM fixed the problem (something you shouldn’t have to do).

In your case, it seems much more likely that the source of any power problems is your PSU - as you noted yourself in an earlier post. If you had this machine when the ambient temps were cooler and you didn’t have any problems, but you do now with the same configuration, then that would be the first thing to look at.

I can’t think of anything you can do to help test for that right now though besides running the OCCT PSU test - which I think you’ve already done. If you could temporarily drop your ambient temps by maybe 10-15F degrees and see how the system responded then, that would be useful. If the problems go away or are much fewer, then I would buy a more powerful PSU.

Keep in mind that if you have been running over what the specs for the PSU are for a long time, you have probably degraded components in the PSU.

If you want a quick test of your power consumption, you can get a kill-a-watt meter to measure power use in real time for about $15-20.

I don’t know if Memtest writes to the video memory; it is addressed out of the main memory address range but depending on how much it is it would usually be paged, so I’m guessing the answer is No. Not sure if it might have some options that would test peripheral memory; this page seems to suggest you should go into the Memory Sizing options and select different ways of probing memory.

Yes, I mean the Gigabyte board. Was the memory error reported that same as the one you saw before replacing the RAM, or a different one or different place? Is it consistently in the same place and the same error in the same test? If the error different since replacing the RAM, the new RAM may be faulty.

Let’s get the temps you see after a problematic game has been running for a while, say 20-30 minutes. Then open the case and point a desktop fan right into the cavity and try it again; if you get significantly lower temps then you have a cooling problem; also please report whether that seems to extend how long you can play the problem games or even fixes it altogether.