Job stopped, not sure why

Hi,

I am new to the cluster, and i received the following error while running a file: /var/opt/ud/torque-4.2.10/mom_priv/jobs/1690376.hpc05.hpc.SC: line 14: 32107 Killed.

Could someone explain the reason why the file was stopped from running? Thank you.

Hey,

Welcome to the cluster! Just checking, did the problem still continue? I see you are now running several other jobs.

Thank you!! Yes i am running other jobs. But this problem shows itself only for a specific input. i tried running that file twice (with that specific input), and both the times the file was “killed”. It did not hit the walltime, so i dont think it has to do anything with walltime. But i still dont know why it was stopped.

Do you also monitor memory usage? Alternatively it could be a problem with the code itself.

how can i monitor memory usage? i occasionally used “qstat -f” to get some information, is there another way to monitor it?

I think it highly depends on the programming language and platform you are using. qstat -f is a reasonable option that is valid outside of the job you are running.

1 Like