Linux Containers - /proc mounting and other queries

Hi guys, I am confused about how containers work in Linux, especially how chrooting works and about how /proc filesystems are mounted.

So please feel to migrate this question to another forum if this is not the right one.

Now, to business.

Okay Dockers can be confusing to the uninitiated especially when everyone thinks that they are just lightweight VMs. A good talk on youtube helped me get a clearer picture. It is "Build your own container from scratch".

It showed a lot of useful things namespace creation, but where I really got confused was when the virtual filesystem /proc had to be mounted to a separate directory.

I have am completely confused about how this works the way it does.

The part that confused was this in the video.

Questions are as follows:

  • Can't ps be namespaced? As in it will by default show the process in the current namespace from which it can be invoked from?
  • When we mount /proc into another new rootfs are we creating a new /proc for the namespace or are we creating a new /proc for that namespace?
  • I don't have much idea about Linux virtual filesystems, but I believe it is a way for the kernel to communicate information to the user space. If that is correct, then does that mean that when we have a new /proc mounted the kernel is now writing out to two different /proc directories? I am really confused with this.
  • I have used chroot to get into a system for repair purposes but I have not completely understood most of it. Take for instance when I mount the /proc from my LiveCD into a broken OS, that is just mapping my existing /proc into the broken OS, it does not create a new /proc AFAIK. Does that have any similarity to what is shown on the video here, or are we creating a new /proc . Which does not make sense since container processes are also can be viewed from the host.

Please let me know if any further information is required from my side.

ps looks into /proc, so having a namespace in /proc is the correct way.
(Having a namespace in ps would be inconsistent.)

Each mount of /proc is a new interface to the kernel. There is no "fowarding" of an existing mount. The only mount forwarding is the bind mount (it should work with all file system types including /proc).

/proc works a bit like /dev where each file is a driver.
The kernel is not writing out but mapping out. If you access a file it is actually written out.

Is there any documentation that tells how /proc behaves when given a namespace? Also can a system have /proc mounted on 2 different places? How is that even allowed?

Could you please explain what you mean by "forwarding" of a an existing mount point? Also what is the difference between a normal mount and a bind mount? No one has clear answer for that.

What do you mean by "mapping out" ? Does that mean that whenever I query /proc the kernel actually "puts" information there for the program that is querying the info?

You already got some excellent answers to your questions at hand but you might profit from a little "theory" behind all that, so here it goes:

Whenever we talk about virtualisation we need to keep in mind that there are to fundamentally different ways of doing so: "full virtualisation" and "para-virtualisation".

Full-virtualisation is what i.e. VMWare or the DOSbox emulator do: a program is started which emulates a certain hardware platform. On this emulated hardware an OS is installed and "runs" more or less independently from the host hardware. The advantage this has is that you can mix arbitrary platforms because it only depends on the availability of the emulator programs. You can install a PC-emulator onto Linux and run a WIndows guest in it, start it a second time and install another Linux to it, then start a third instance and install DOS onto that. Fully virtualised systems are not "aware" that they are virtualised. For the virtualised system it is like running on non-virtualised hardware.

Para-virtualisation on the other hand, does not work like this: hardware is only emulated up to a certain point. For instance, take the file system driver: if you work on a real disk you need to do all sorts of checks inside this driver because disk blocks could be failing, filesystems can get corrupted, etc.. The driver makes up for that to some extent by these checks. Now, a fully virtualised system has usually a fully virtualised disk which is in fact a file in the host systems filesystem. The driver of the virtualised machine wouldn't have to do all these checks because "under" it the disk driver of the host system (which really does the writing) will do it anyway. A para-virtualised disk driver is "aware" that it works on virtual hardware so it skips all these checks (and a lot of other unnecessary work) which makes the load the emulation places on the host system considerably lighter. The same goes for network drivers, etc.. The final development in this is to have not even a separate kernel for the guest OS but set aside some "space" in the host kernel where all the processes of the guest system go. At this point we usually do not call the guest systems "guest systems" any longer but call these "containers". The big advantage of paravirtualised systems is: the load produced by emulating the hardware itself is much lighter than in fully virtualised systems, so you get to run more guest systems from a given amount of host resources. On the downside, having only one kernel for all guests means that you can't have different OSes running but are limited to what the host system runs. Examples for para-virtualisation software are OpenVZ/Virtuozzo but also Docker.

What is chroot and how does it enter the picture: UNIX, since its earliest stages, has the chroot command, which creates a system environment limited to some separated part of the filesystem. Historically this was done to be able to safely operate FTP servers: in a certain directory a replica of the (important parts of the) main filesystem (like /usr/lib , /bin , etc.) was created and the absolute minimum of libs, commands, etc. were placed there. Then the FTP server process was started in a way so that this directory was the "root" of tis environment and it could not access any other file outside of this. This was done with the chroot command. This way users could access the FTP server and transfer files to ad from it - they might even mess up the FTP server itself, but this "chrooted" part only, not the "underlying" system. Para-virtualised guest systems - in specific containers - more or less resemble this and para-virtualisation is therefore sometimes regarded as "richly dressed up chroot environment".

I hope that connects a few loose ends.

bakunin

Try yourself:

mount -t proc proc /mnt
mount | grep proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /mnt type proc (rw,relatime)
ls /mnt
ls /proc

After your tests do not forget to umount the 2nd mount point

umount /mnt

Hard to explain. An example is a disk mount (filesystem like ext3,ext4,reiserfs,xfs,...), that is only allowed once, because writes to the two mount points would cause a corruption in the filesystem on the disk. But: a bind mount of the primary disk mount to another mount point is allowed; all writes occur at the primary mount point.

Yes, at least the contents of the files is created by a little kernel routine when accessed. Some files are even reverse-handled: by writing a value into it, the kernel routine patches the corresponding location in kernel memory.

It does connect a few loose ends. But I really didn't think docker was a para-virtualization product. I thought they were just implemented as jailed processes.

AFAIK Docker mainly uses Linux Namespaces and Cgroups to create a virualized environment for execution.

One more thing that is bugging me is : When you install a software within a docker container does it get installed into the host OS also? Intution tells me that is how it should be since a Docker mainly shares the host kernel. If so, then shouldn't it share the host package management system.

Where this logic falls apart is when on an Ubuntu system we can run a Debian or Alpine docker. How is that possible? How can debian binaries even run on an Ubuntu system?

Lastly, why does docker need the root filesystem of the OS that it is trying to emulate on disk? What does it mean to have a rootfs of an OS and again how can utilities within in run an another OS. This refers mostly to the video that I have linked in my original question. If it is at all possible please watch it ( I have marked the actual place where it got generated)- you will know why I am getting confused.

--- Post updated at 04:47 PM ---

Didn't get what you explained. Let me tell you of what idea I have of the Linux mount process then may be you will get a better idea of why I am failing to grasp the concept.

Any block device that the kernel identifies can be mounted to a location in the VFS. The location, which is just a directory is called a mount point.

Now one question is that can an already mounted device be mounted twice? A partition that is mounted twice? If so I don't see why it would cause corruption as you say it would. You writing to the same block device, the kernel just identifies it to the user by 2 mount points.

Next, is what is a primary mount point? Is the first mount of the block device? Can a mount point be mounted again? Why would you want to do that?

Where does bind mount fit into all of this?

The filesystem driver can deny a second mount. But in fact ext3, ext4, xfs allow mutiple primary mounts.

# ls -ldi /boot /mnt
     2 drwxr-xr-x  4 root root 3072 May 24  2014 /boot
260609 drwxr-xr-x  5 root root 4096 Sep 25 11:40 /mnt
# df /boot
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1               101146     37017     58907  39% /boot
# mount /dev/sda1 /mnt
# ls -ldi /boot /mnt
2 drwxr-xr-x  4 root root 3072 May 24  2014 /boot
2 drwxr-xr-x  4 root root 3072 May 24  2014 /mnt
# mount | grep /dev/sda1
/dev/sda1 on /boot type ext3 (rw)
/dev/sda1 on /mnt type ext3 (rw)
# umount /mnt

The same exercise with a bind mount:

# mount --bind /boot /mnt
# ls -ldi /boot /mnt
2 drwxr-xr-x  4 root root 3072 May 24  2014 /boot
2 drwxr-xr-x  4 root root 3072 May 24  2014 /mnt
# mount | grep /dev/sda1
/dev/sda1 on /boot type ext3 (rw)
# mount | grep /mnt
/boot on /mnt type none (rw,bind)
# umount /mnt

--- Post updated at 12:41 PM ---

Yes but how is a bind mount different from a normal mount?

According to this question on SO

Bind mounts reflect the directory structure from the source and does not allow modifications on the disk. Its suppose to be part of the live filesystem. But then my question is what is the difference between a normal mount?

Does this question about bind mounts deserve its own personal thread?

Your example just shows the type as none and an extra attribute as bind in the mount. What does this imply?

In which cases would a bind mount be beneficial? I mean why would you want to use it?

When you want to put a filesystem in several different places in a manner which doesn't depend on symbolic links.

Imagine a chroot for example. A symbolic link to outside the chroot would be pretty useless. A bind mount would still work.

A bind mount works with any directory, not just with mount points.