Ceph Experience

joker · March 16, 2020, 11:47am

Hi,

I got asked if I can list some additional informations about ceph "in production" at my site.

Well, the small hyperconverged cluster here with ceph is not really in production by the customer. It's handed over to the customer. The customer creates his new setup on the hardware, is still running tests and minor applications and we are responsible for monitoring, backup and updating the system. Therefore the virtual machines get automatically moved around the cluster while staying live once every week in an uncritical time window and the host systems get their updates.

Terminology

Node

A node is a physical servers which serves for different storage units(="OSDs").
OSD

An OSD is a storage unit in ceph. It's short for Object storage Daemon. The process which manages one storage unit.
Hyperconverged

Hyperconverged means the nodes have the function to serve the ceph cluster as well as serving as compute nodes for virtual machines.

The facts

Node count 3
OSD count 12
Locations 1 (everything on the same location)
Type of intended usage Java Applications connected to external Postgres Database servers
Network connection 2 x 10 Gb on each Node (Bonding: active/passive)
Used Switches Huawei CE6810-32T16S4Q-LI(redundant setup, lacp connected active/passive)
Used OSD Disks INTEL SSDSC2KG019T8 2 TB
Used Virtualization Environment Proxmox PVE(based on Debian Linux 10 - Buster)
Used Guests Various(Debian, Ubuntu, BSD)
Ceph Redundancy Factor 3 - (in a healthy system there are 3 copies of each data block)

Further Questions

What happens if an OSD fails?

When an OSD fails up to a configured time value requests are served without doing anything. When the timeout happened the full redundancy is being restored by balancing out the data on the remaining OSDs. At any time you may replace an OSD(SATA-SSD) and reintegrate it with a simple ceph command.
What happens if Node fails?

When a Node fails 4 OSDs are going down with it. Since the redundancy factor is 3 every node holds a copy of the data. So the remaining redundancy will be 2 in this setup. The overall speed capacity will go down, but the cluster is still fully functional and still redundant.
How is the system monitored?

The system is monitored via our in-house monitoring system(check_mk). The monitoring includes SSD SMART status, Ceph Health and Ceph OSD health. Ceph has it's own internal health check system which can be queried on all nodes and wich is queried from our monitoring system.
How long is the system running?

The system is running for 12 month now without any issue so far.
Anything else to report?

Due to a firmware issue, a new firmware had to be installed on all ssd devices. The servers were needed to be powered off and on again for the new firmware to become active. That was no problem to do while the cluster was running.

Ceph is like many HA-Setups sensitive to clock synchronisation. The server clocks should be fully synchronized at any time.

Other notes:

Ceph is not the ideal system for databases, since it adds network latency to the storage. If you need high speed databases, ceph might not be the best option. Every Database provides its own replication method. I would prefer that to any external replication method. But well, if you through a lot of money onto the hardware with high performance components, that will allow that too.

Why and when Ceph is a good choice

Ceph advantages

Ceph is a scalable, thin provisioned, self healing, redundant network storage. From what I've experienced so far the management is very easy. No complex cluster configurations. (Well, you can setup you own replication maps with different crash zones at any complexity you like, so for example if you span your ceph accross different data centers(DC) your redundancy is still the best way possible when one data center goes offline in contrast to the worst case scenario to loose all replicas because those are all located in the offline DC. The process of figuring out the location for an address is called crush map in ceph.). If your redundancy allows it, just reboot a node when needed for maintenance - bring it back online - everything syncs automatically.

Ceph use cases

Probably there are a lot. The one that I use is the here mentioned use as storage for virtualization. It allows thin provisioning, data sharing between cluster nodes and snapshotting of virtual machines.