Process on CPU inside syscall

Hello Experts,

If a Solaris process is calling some syscall, and right now execution is inside syscall doing only CPU work, for example the inside simplest times syscall,

-> app_func
  => times
    << we are here now, we have entered in the times, but not exited yet
  <= times
<- app_func

then

  1. the process is considered as blocked
  2. process state (for example, in prstat) is CPUn

are these points right?

Thanks

  1. In a word, no.
    Consider write which invokes kernel mode code through a syscall. It can block sometimes, but most of the time it succeeds (or fails) right away.

A process consists of kernel memory and process memory. The kernel memory can access lots of really dangerous things, so access there is very restricted, through syscalls. When a syscall is active the process is still running, but it is not in user mode, it is in kernel mode instead.

These syscalls do one thing very carefully. Blocking only occurs when a resource is not available, for example no data is available on a socket, so a read (recv) call will normally block until data shows up. A read call will also block waiting for user input, like when you are working on the command line.

Most syscalls do not block, they either fail or return with success right away (assuming they can get cpu, which is not what I think we are talking about).

You can block what seems like forever waiting in a realtime system to get cpu, when your process priority is really low. But. This is true either for user mode operations or kernel mode operations. And is not particular to syscalls.

Thank you Jim for your answer.
However I am confused.

Could you please review this thread and give your comment?

  1. When a process(thread) is entering in syscall (times for example) this process(thread) is considered as "blocked" and column "b" in vmstat will be increased or not?
  2. If not then when a process is considered as blocked?
  3. Or a process is blocked only when it is waiting for something, as an example when it is waiting response from disk.
  4. if a process has nothing to do then his state in prstat is "sleeping", right? If a process is inside syscall reading data from disk, "pread" for example, and this process is waiting now when real hard storage will return data then the state of the process is also sleeping, is it right?

'right away' isn't the same as 'instantly'. There will still be a brief interval where the kernel is running and the process is not.

If they're just waiting on the kernel to finish and not waiting for data, I suppose they'd simply be sleeping, rather than blocked, but this is beginning to split hairs. Either way it has to wait.

I would expect that process state will be on CPU in that moment instead of sleeping.

When you switch to kernel mode it is usually a hardware operation, which is simply a single op. There is no delay, the process is either in kernel or user, there is no in between. Correct?

Stuff happens on the way through to a syscall. But running the actual syscall itself
involves (usually): enter kernel mode, vector to function call, leave kernel mode.

We may be talking past each other here. Or I don't get the idea of blocked the OP refers to. Are we talking about the interrupt stack? paging?

From vmstat man page on Solaris 10 - this is the only mention of "block"

This has nothing that is special to a syscall entering kernel mode. Lots of other kernel mode code can encounter this condition, too. For a lot of reasons, like a context switch, image loading during an exec call, etc.

On solaris a lot of "blocking" as defined in the man page, does in fact come from starting a child process and executing the image. But that is peculiar to loading an image file into memory with the attendant zero-paging and so on. The libc time call does almost nothing with its syscall in kernel mode, it reads one place in memory. So I am missing something here.

So, this is the output of the code below, Solaris does not fully support getprusage,
it uses /proc instead,
red colors are related, green colors are related, blue has to do with being out of context or being pushed onto the interrupt stack, one of the 44 times it got interrupted/context switched.
System trap time: for .0204900 second - this is how traps (going to kernel mode for example) are handled. This is the time it takes to initiate into kernel mode for 10+ million syscalls. Maybe this is what you mean. Traps also involve signal processing, when the process comes back from a context switch.

appworx> ./tmer
10085397
Resource usage for PID 25360:
  LWP ID: 0
  Number of LWPs: 1
  Timestamp: 5711964.772491400
  Creation time: 5711960.622692600
  Termination time: 0.0
  Real (elapsed) time: 4.149402400
  User CPU time: 3.618758600
  System CPU time: 0.530350100
  System trap CPU time: 0.204900
  Text page fault CPU time: 0.0
  Data page fault CPU time: 0.0
  Kernel page fault CPU time: 0.0
  User lock wait time: 0.0
  Other sleep time: 0.0
  CPU latency time: 0.59100
  Stopped time: 0.29700
  Minor faults: 0
  Major faults: 0
  Number of swaps: 0
  Input blocks: 0
  Output blocks: 0
  Messages sent: 0
  Messages received: 0
  Signals received: 0
  Voluntary context switches: 0
  Involuntary context switches: 44
  System calls: 10085457
  Characters read/written: 9

C code

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/resource.h> 
#include <time.h>
#ifdef NEED_SNPRINTF
#include <sys/procfs.h>
#else
#include <sys/old_procfs.h>
#endif
#include <limits.h>


int getprusage (pid_t, prusage_t *);
void print_rusage(pid_t, prusage_t *);
void tmer(int, char **);
char dest[64]={0x0};


int main (int argc, char **argv)
{
	prusage_t buf;
	
  tmer(argc, argv);
  if(getprusage (-1, &buf) == -1)
	{
		perror("getprusage failed"); 
		exit(1);
	}		
  print_rusage (getpid (), &buf);
	return 0;
}

void tmer(int argc, char **argv)
{
    size_t add =(argc>1) ? atol(argv[1]): 5;
    time_t tm=time(NULL);
    double z=0;
    tm+=add;
    while(time(NULL) < tm) z++;
    printf("%.0f\n", z);
    return;
}


int getprusage (pid_t pid, prusage_t *pr_usage)
{
	int fd;
	char name [PATH_MAX];

	if (pid == -1)
		snprintf (name, PATH_MAX, "/proc/%ld", (long) getpid ());
	else
		snprintf (name, PATH_MAX, "/proc/%ld", (long) pid);

	if ((fd = open (name, O_RDONLY)) == -1)
		return (-1);

	if (ioctl (fd, PIOCUSAGE, pr_usage) == -1) {
		close (fd);
		return (-1);
	}
	else {
		close (fd);
		return (0);
	}
}

 void print_rusage (pid_t pid, prusage_t *buf)
{
  printf ("Resource usage for PID %ld:\n", (long) pid);
  printf ("  LWP ID: %ld\n", (long) buf -> pr_lwpid);
  printf ("  Number of LWPs: %d\n", (int) buf -> pr_count);
  printf ("  Timestamp: %ld.%ld\n", buf -> pr_tstamp.tv_sec,
  	buf -> pr_tstamp.tv_nsec);
  printf ("  Creation time: %ld.%ld\n", buf -> pr_create.tv_sec,
  	buf -> pr_create.tv_nsec);
  printf ("  Termination time: %ld.%ld\n", buf -> pr_term.tv_sec,
  	buf -> pr_term.tv_nsec);
  printf ("  Real (elapsed) time: %ld.%ld\n", buf -> pr_rtime.tv_sec,
  	buf -> pr_rtime.tv_nsec);
  printf ("  User CPU time: %ld.%ld\n", buf -> pr_utime.tv_sec,
  	buf -> pr_utime.tv_nsec);
  printf ("  System CPU time: %ld.%ld\n", buf -> pr_stime.tv_sec,
  	buf -> pr_stime.tv_nsec);
  printf ("  System trap CPU time: %ld.%ld\n", buf -> pr_ttime.tv_sec,
  	buf -> pr_ttime.tv_nsec);
  printf ("  Text page fault CPU time: %ld.%ld\n", buf -> pr_tftime.tv_sec,
  	buf -> pr_tftime.tv_nsec);
  printf ("  Data page fault CPU time: %ld.%ld\n", buf -> pr_dftime.tv_sec,
  	buf -> pr_dftime.tv_nsec);
  printf ("  Kernel page fault CPU time: %ld.%ld\n", buf -> pr_kftime.tv_sec,
  	buf -> pr_kftime.tv_nsec);
  printf ("  User lock wait time: %ld.%ld\n", buf -> pr_ltime.tv_sec,
  	buf -> pr_ltime.tv_nsec);
  printf ("  Other sleep time: %ld.%ld\n", buf -> pr_slptime.tv_sec,
  	buf -> pr_slptime.tv_nsec);
  printf ("  CPU latency time: %ld.%ld\n", buf -> pr_wtime.tv_sec,
  	buf -> pr_wtime.tv_nsec);
  printf ("  Stopped time: %ld.%ld\n", buf -> pr_stoptime.tv_sec,
  	buf -> pr_stoptime.tv_nsec);
  printf ("  Minor faults: %ld\n", buf -> pr_minf);
  printf ("  Major faults: %ld\n", buf -> pr_majf);
  printf ("  Number of swaps: %ld\n", buf -> pr_nswap);
  printf ("  Input blocks: %ld\n", buf -> pr_inblk);
  printf ("  Output blocks: %ld\n", buf -> pr_oublk);
  printf ("  Messages sent: %ld\n", buf -> pr_msnd);
  printf ("  Messages received: %ld\n", buf -> pr_mrcv);
  printf ("  Signals received: %ld\n", buf -> pr_sigs);
  printf ("  Voluntary context switches: %ld\n", buf -> pr_vctx);
  printf ("  Involuntary context switches: %ld\n", buf -> pr_ictx);
  printf ("  System calls: %ld\n", buf -> pr_sysc);
	printf ("  Characters read/written: %ld\n", buf -> pr_ioch); 
	return;
}

Let's suggest that we do physical read involving OS cache.
Also suggest that our cache is huge and CPU and disk are very slow that searhcing in cache consumes about 5 seconds and after that request data from disk is 10 seconds.

pread begin
  search_in_OS_cache(); << we are in searching from cache now
  
  if not_found_in_cache then
    request_from_disk();
  end if
pread end

What we will see in prstat for these thread for this 5 seconds during searching in OS cache? As I understand process state will be CPUn, right ? Will this thread be considered as blocked? As I understand now - not. Is it right?

After 5 seconds data is not found in the cache and we have to request it from disk.
So,

pread begin
  search_in_OS_cache(); 
  
  if not_found_in_cache then
    request_from_disk(); << we are here now
  end if
pread end

10 seconds we are waiting respons from disk.

What we will see in prstat for these thread for these 10 seconds? As I understand process state will be sleeping, right ? Will this thread be considered as blocked (I mean the thread a candidate for vmstat.b)? As I understand - yes. Is it right?

You'd be incorrect. System calls don't run inside the process. That's the difference between a system call and a function call... functions are just different code running inside the process, system calls are a message passed to the kernel. The process stops until the kernel starts it again.

Think of it this way. System calls are a process asking something else to do work for it. It doesn't do it itself, it hands it off to the kernel, passes a message. To pass this message, the process is automatically stopped. That's how system calls work.

It is interesting.
What if I have 50 processes on the host with 100 CPU cores. And let's suggest in some moment of time ALL of them are entering in some CPU expensive syscall. As an example syscall searching data in cache. So, in this moment of time all of them in syscall.
Which state will they have in prstat? I mean exactly column STATE in prstat

---------- Post updated at 11:50 PM ---------- Previous update was at 11:44 PM ----------

Yes, I am understanding your point of view. I just trying to understand which state a process has inside syscall and when it became blocked?

My understanding that

  • when a process inside syscall on CPU his state in prstat is "CPUn".
  • when a process inside syscall waiting response from a disk his state in prstat is "sleeping". And in the same time the process is considered as blocked. In this moment column vmstat.b will be increased.
    Is it right?

So, about your point.
All consumed CPU should be recorded for some process(or thread). Let's suggest that we have IDLE system with only one running thread. This thread is entering into CPU consuming syscall. Syscall will consume about 100 hours of CPU.
Which state the thread will have in these 100 hours?
Will CPU time of the thread be increased on 100 hours?