Blame - Documentation/md-cluster.txt - codeaurora/cp-linux

blob: 1b794369e03a4ef14099f4ce702fc0d7c65140c6 [file] [log] [blame]

Kyle Swenson	8d8f654	2021-03-15 11:02:55 -0600	[diff] [blame]	1	The cluster MD is a shared-device RAID for a cluster.
				2
				3
				4	1. On-disk format
				5
				6	Separate write-intent-bitmap are used for each cluster node.
				7	The bitmaps record all writes that may have been started on that node,
				8	and may not yet have finished. The on-disk layout is:
				9
				10	0 4k 8k 12k
				11	-------------------------------------------------------------------
				12	\| idle \| md super \| bm super [0] + bits \|
				13	\| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \|
				14	\| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \|
				15	\| bm bits [3, contd] \| \| \|
				16
				17	During "normal" functioning we assume the filesystem ensures that only one
				18	node writes to any given block at a time, so a write
				19	request will
				20	- set the appropriate bit (if not already set)
				21	- commit the write to all mirrors
				22	- schedule the bit to be cleared after a timeout.
				23
				24	Reads are just handled normally. It is up to the filesystem to
				25	ensure one node doesn't read from a location where another node (or the same
				26	node) is writing.
				27
				28
				29	2. DLM Locks for management
				30
				31	There are two locks for managing the device:
				32
				33	2.1 Bitmap lock resource (bm_lockres)
				34
				35	The bm_lockres protects individual node bitmaps. They are named in the
				36	form bitmap001 for node 1, bitmap002 for node and so on. When a node
				37	joins the cluster, it acquires the lock in PW mode and it stays so
				38	during the lifetime the node is part of the cluster. The lock resource
				39	number is based on the slot number returned by the DLM subsystem. Since
				40	DLM starts node count from one and bitmap slots start from zero, one is
				41	subtracted from the DLM slot number to arrive at the bitmap slot number.
				42
				43	3. Communication
				44
				45	Each node has to communicate with other nodes when starting or ending
				46	resync, and metadata superblock updates.
				47
				48	3.1 Message Types
				49
				50	There are 3 types, of messages which are passed
				51
				52	3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
				53	updated, and the node must re-read the md superblock. This is performed
				54	synchronously.
				55
				56	3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
				57	so that each node may suspend or resume the region.
				58
				59	3.2 Communication mechanism
				60
				61	The DLM LVB is used to communicate within nodes of the cluster. There
				62	are three resources used for the purpose:
				63
				64	3.2.1 Token: The resource which protects the entire communication
				65	system. The node having the token resource is allowed to
				66	communicate.
				67
				68	3.2.2 Message: The lock resource which carries the data to
				69	communicate.
				70
				71	3.2.3 Ack: The resource, acquiring which means the message has been
				72	acknowledged by all nodes in the cluster. The BAST of the resource
				73	is used to inform the receive node that a node wants to communicate.
				74
				75	The algorithm is:
				76
				77	1. receive status
				78
				79	sender receiver receiver
				80	ACK:CR ACK:CR ACK:CR
				81
				82	2. sender get EX of TOKEN
				83	sender get EX of MESSAGE
				84	sender receiver receiver
				85	TOKEN:EX ACK:CR ACK:CR
				86	MESSAGE:EX
				87	ACK:CR
				88
				89	Sender checks that it still needs to send a message. Messages received
				90	or other events that happened while waiting for the TOKEN may have made
				91	this message inappropriate or redundant.
				92
				93	3. sender write LVB.
				94	sender down-convert MESSAGE from EX to CW
				95	sender try to get EX of ACK
				96	[ wait until all receiver has processed the MESSAGE ]
				97
				98	[ triggered by bast of ACK ]
				99	receiver get CR of MESSAGE
				100	receiver read LVB
				101	receiver processes the message
				102	[ wait finish ]
				103	receiver release ACK
				104
				105	sender receiver receiver
				106	TOKEN:EX MESSAGE:CR MESSAGE:CR
				107	MESSAGE:CR
				108	ACK:EX
				109
				110	4. triggered by grant of EX on ACK (indicating all receivers have processed
				111	message)
				112	sender down-convert ACK from EX to CR
				113	sender release MESSAGE
				114	sender release TOKEN
				115	receiver upconvert to PR of MESSAGE
				116	receiver get CR of ACK
				117	receiver release MESSAGE
				118
				119	sender receiver receiver
				120	ACK:CR ACK:CR ACK:CR
				121
				122
				123	4. Handling Failures
				124
				125	4.1 Node Failure
				126	When a node fails, the DLM informs the cluster with the slot. The node
				127	starts a cluster recovery thread. The cluster recovery thread:
				128	- acquires the bitmap<number> lock of the failed node
				129	- opens the bitmap
				130	- reads the bitmap of the failed node
				131	- copies the set bitmap to local node
				132	- cleans the bitmap of the failed node
				133	- releases bitmap<number> lock of the failed node
				134	- initiates resync of the bitmap on the current node
				135
				136	The resync process, is the regular md resync. However, in a clustered
				137	environment when a resync is performed, it needs to tell other nodes
				138	of the areas which are suspended. Before a resync starts, the node
				139	send out RESYNC_START with the (lo,hi) range of the area which needs
				140	to be suspended. Each node maintains a suspend_list, which contains
				141	the list of ranges which are currently suspended. On receiving
				142	RESYNC_START, the node adds the range to the suspend_list. Similarly,
				143	when the node performing resync finishes, it send RESYNC_FINISHED
				144	to other nodes and other nodes remove the corresponding entry from
				145	the suspend_list.
				146
				147	A helper function, should_suspend() can be used to check if a particular
				148	I/O range should be suspended or not.
				149
				150	4.2 Device Failure
				151	Device failures are handled and communicated with the metadata update
				152	routine.
				153
				154	5. Adding a new Device
				155	For adding a new device, it is necessary that all nodes "see" the new device
				156	to be added. For this, the following algorithm is used:
				157
				158	1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
				159	ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
				160	2. Node 1 sends NEWDISK with uuid and slot number
				161	3. Other nodes issue kobject_uevent_env with uuid and slot number
				162	(Steps 4,5 could be a udev rule)
				163	4. In userspace, the node searches for the disk, perhaps
				164	using blkid -t SUB_UUID=""
				165	5. Other nodes issue either of the following depending on whether the disk
				166	was found:
				167	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
				168	disc.number set to slot number)
				169	ioctl(CLUSTERED_DISK_NACK)
				170	6. Other nodes drop lock on no-new-devs (CR) if device is found
				171	7. Node 1 attempts EX lock on no-new-devs
				172	8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
				173	as SpareLocal
				174	9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
				175	10. Other nodes get the information whether a disk is added or not
				176	by the following METADATA_UPDATED.