Hi.
This is information of filesystem /movelv_test. It only used 42% capacity / 20 physical partions ( PP ) each PV. As I understand, it means that only 8 PP on each PV contain data, 12 PP remaining on each PV are assigned to logical volume fslv01 but have no data. Is this true ???
If yes, how to know exactly which PP contains data.
Thank for read.
Not quite. The filesystem "/movelv_test" is residing on a certain Logical Volume, "fslv01". This LV in turn is made from 20 Logical Partitions. Because the LV is mirrored each LP is comprised from 2 Physical Partitions. Would the LV not be mirrored it would still be made from 20 LPs but this time each LP would be made from only one PP.
You need to understand that there are several abstraction layers and you cannot change between them at will.
No. You seem to think that data fills the raw disk space like water fills a bucket - from the bottom up, but this is not the case. Filesystems are made for random read/write access and that means that filling them contiguously would be a performance nightmare. Suppose you have two files, A and B and they are put adjacently on the disk. When you edit file A now and add a single character you would have no space to write that, so you need to rewrite the whole file. If there is a way to fragment that file and put the additional character elsewhere you could leave most of the file in place.
The relationship between disk space and filesystem space is more like the price you pay for something: suppose you purchase a house for some money. You can point to the whole money and say it bought you the whole house, but the question which bills bought me the western wall just makes no sense.
Of course you can investigate where exactly on which part of the disk a certain file is. But be aware that when you use the FS this can change over time and it is by no means fixed.
I don't really understand what you means. As you said above, data is written randomly to the disk and these lp/pp listed above have no relationship with the filesytem space used/free. The data may reside anywhere on these 40 pps and there's no way to know that. Is that what you means?
Yes and no. data is not written entirely randomly. There is a system behind it, but it is complex. You seem to have meant that a filesystem which is, say, 50% full and consists of LPs1-10 will have the LPs1-5 written full and LPs6-10 empty. This is (typically) not the case. Typically (depending on what the content of your filesystem is - many small files or a few large files or some mix of both, how often the data on disk changes and some other specifics) your filesystems space will be scattered and every LP will be nearly 50% full and nearly 50% empty.
Furthermore, if a FS resides on a LV which consists of 10 LPs these are ordered 1-10. But that does not have to mean that LP3 is "behind" LP2 on disk. It could well be that LP1 is at PP200, LP2 is at PP100 and LP3 is at PP150 (with maybe other LVs taking the place in between). For the filesystem the space would still appear to be linear, first the space coming from LP1, then from LP2, etc.. The FS will be unaware of the fact that to go from LP1 to LP2 this means to skip many PPs on disk.
Yes, it is possible to find out, but no, there is no 1:1-relation you could use. Say, you have some file. If you want to know where the first byte of the file "/path/to/some/file" is located on the disk you can find out on which place the first (filesystem) block of your file is. Then, using LV methods, find out on which LP this part of the filesystem really is. After this, you use the LP-PP-relation you already have posted to find out which PP is used to represent this LP and finally you can use VG-methods to locate the PP on some specific place on some specific disk and use low-level methods ("dd" and the like) to really find that part of the file you started with.
But, having found that out for byte X doesn't mean that you could read the next byte on disk and expect to find the content of byte X+1 in the file. This may be so or not, but to know that you will need to exercise the whole procedure again.
Are you wanting to shrink the filesystem, replace a disk or something? I'm wondering why you are so interested.
To shrink a filesystem on AIX is not simple task like it can by on some other operating systems, for instance RHEL 6 has an option to re-size up or down the filesystem and logical volume all in one command, although it requires it to unmount as part of the process.
On AIX you would need to either:-
Build a new filesystem of the required size, move the data and then remount the new filesystem to the correct location (making the change permanent for next boot etc.)
Backup the data, destroy and re-create the filesystem and then restore the data.
If you are looking to free up or replace a disk, then migratepv is your friend here. It can take a while depending how big your filesystem is, but it will (one at a time) make a copies the LPs to the target drive then remove the old one. Do not interrupt the process as this can leave you with inconsistencies in the volume group and logical volume information.
An alternate would be to use mirrors to achieve this if placement is critical. You can use mklvcopy to add a third copy to the new disk (and synchronise them) and then use rmlvcopy specifying to remove the one from the old disk. When making the 3rd copy, you can specify a map-file to force it to use the PPs on whichever disk in whichever order you want if that is important.
Whichever technique you use, you can empty a disk and remove it from the volume group for replacement or use elsewhere.
If I've missed the point, then let us know what you really need to know and we will try to help.
Actually this is correct only historically. Since IIRC version 5.3 it is possible to shrink a filesystem. Initially it was possible only if the end of the underlying LV (from where the shrinking took place) was not already used by files but since then AIX does a "reorgfs" automatically as it seems. The only restricition (obviously) is that the resulting size must be big enough to hold all the existing data.
To shrink a FS (and the underlying LV as well) do, analogous to a growth:
Hehe, how do you know that. It's true that I'm having problem with filesystem, but it's a little different with what you think. In fact, I want to split the data of a filesystem on multiple hard disks equally.
It's quite a waste of lines listing all the lp/pp of the real filesystem I'm handling, so allow me to use that /movelv_test as an example.
( Don't be surpised why it looks like that. At first, the vg of this fs have only 2 pv, hdisk1 and hdisk2. The data grew up day by day and when there was no free pp on hdisk1 and hdisk2, we added two more pv: hdisk3 and hdisk4 to the vg and then increase the /movelv_test size . Now this fs 's Used% is ~ 55%, 180GB assigned to the fs/ 100GB data used )
The problem is when I checked the disk operation ( by command "topas" ), I found that the operation of hdisk1 and hdisk2 is very high, ~90%-100% while the operation of hdisk3 and hdisk4 is just ~ 10% ( most of the disk operation is writing - the application writes logs to the fs's mountpoint ).
The result of the "topas" command and the lp/pp listed above made me think that data fills the raw disk space like water fills a bucket - as bakunin said, but now I know it's not true.
Now, what I want to do is to balance the disk operation of these 4 pv ( just only at this time,temporarily don't care about data increasing ) and I intend to do like this:
_ Decrease the size of the fs so that the fs's Used% ~ 99% ( to ensure that most of the pp are filled with data and don't care whether the pp contains data or not ), in this case - decrease size of /movelv_test to 101 GB
_ Used "migratelp" command to arrange each pp of the fslv01 like this
_ Increase the size of the fs back ( in this case is 180 GB )
And I have two questions
Is it OK to use the command " migratelp" to do something like this. The data availability, the synchronization between the 1st mirror and the 2nd mirror ... is OK ? Are there any errors of the filesystem, errors of data, ... after using this command? If 1 of 4 hard disk fails, the data is still OK ???
Does this solution work ? Are there any solutions better?
It is perfectly OK and in fact the command was made for exactly this purpose. Still, i think you do not need it, see below.
This - the splitting you intend - makes sense only in a specific kind of situation, so please describe your hdisk-devices a bit better. What are they (single SCSI-disks, RAID-sets, LUNs from SAN, ...) and how do you access them?
First a little theory, so that ou can understand the output better:
Here you see several of the "layers" i talked about in a previous post at work: The LV consists of LPs (leftmost column) numbered 0001, 0002, 0003, ... and the space in this LV is continuous. That means that when byte nr. X is the last byte in in LP 0001 then byte nr X+1 is the first byte in LP 0002. Now, the LP 0001 consists in fact of two PPs, which hold identical copies; PP0175 on hdisk1 and PP 0111 on hdisk2. Similarily for all the other LPs.
The first question is if you really need the LV-copies. writing in parallel is a tad slower than writing on a single LV (without copies) and it might help with the performance if you do away with the mirroring. You will have to decide if the loss of security outweighs the gain on performance or if it is the other way round.
Second, you can place the LPs also in this way (schematically):
The LVM of AIX has a special provision for doing that, called (somewhat counterintuitively) "Inter-Policy", look here:
# lslv -l mylv
mylv:/some/where
PV COPIES IN BAND DISTRIBUTION
hdiskpower2 1670:000:000 24% 270:410:409:409:172
# lslv -L ap33p1lv
LOGICAL VOLUME: mylv VOLUME GROUP: myvg
LV IDENTIFIER: 00534b7a00004c000000011cd5e0067e.1 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 4096 PP SIZE: 512 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 1670 PPs: 1670
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: /some/where LABEL: /some/where
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
INFINITE RETRY: no
At "minimum" the LVM places the LPs on the PPs in a way so that the minimum possible hdisks are involved. At "maximum" it will try to spread it over as many hdisks as possible, thus arriving at a placement similar to what i have sketched out above.
Note, btw, that the smaller the PP size you use (this is a property of the VG so you might have to create it anew and start over from scratch) the better it is for the effect to take place. You are ultimately trying to use the internal cache of the hdisks for maximum effect and the PPs should be small enough to fit into this cache. 512MB, like in my example, would be too big for that.
If you really have to create the VG anew it might also be a good idea to create a RAID set (faster than single disks but slower than a stripe set) or even a stripe set (faster than the RAID but lacking the security of the RAID). See this little tutorial for details about RAIDs, striping, etc..
You can even more fine-tune the creation process of the LV by using a so-called "map file" and using the "-m" switch (see the man page of "mklv" for details). Basically you can explicitly state which PP(s) should represent any given LP of the LV that way.
Further, please tell us which kind of data the FS holds. You already said "mostly writing log files", but a little more detail would help: many small files, a few very large files, do the files change often or are they mostly appended? How often are files deleted and recreated (like in log rotation)? How many processes write (typically) concurrently to the FS. It might be that you can gain a lot with different OS tuning parameters without even having to change the disk layout.
At last, about "migratepv": if you really need it (which is, as shown above, not sure IMHO) you will want to remove any mirror copy from the LV prior to migrating its PPs around. It is simply only half the work because every PP has to moved separately. Unmirror the LV you want to migrate, do the migration and when that is done create a new mirror. You can use a map file (see above, the "-m" switch) for this too.
Can I also raise a concern about your plan to write LPs from both copies spread over the same disks. What would happen if you lost hdisk3, for instance. You might end up with two copies both with some missing LPs and a lot of hard work to extricate yourself.
I've not found a mention of reducing FS size in the manual page for chfs on AIX 6. The details about the size= part are below:-
The process to reduce the FS does work though
# lslv robin_test_lv
LOGICAL VOLUME: robin_test_lv VOLUME GROUP: rebsvg
LV IDENTIFIER: 00cfe9f500004c000000012bcecff0de.18 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 32 megabyte(s)
COPIES: 2 SCHED POLICY: parallel
LPs: 32 PPs: 64
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: edge UPPER BOUND: 32
MOUNT POINT: /robin_test_fs LABEL: /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
INFINITE RETRY: no
# chfs -a size=-32M /robin_test_fs
Filesystem size changed to 2031616
# lslv robin_test_lv
LOGICAL VOLUME: robin_test_lv VOLUME GROUP: rebsvg
LV IDENTIFIER: 00cfe9f500004c000000012bcecff0de.18 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 32 megabyte(s)
COPIES: 2 SCHED POLICY: parallel
LPs: 31 PPs: 62
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: edge UPPER BOUND: 32
MOUNT POINT: /robin_test_fs LABEL: /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
INFINITE RETRY: no
How delighted am I? :):):):):)
Given that you can add a PP by using chfs -a size=+1 I tried chfs -a size=-1 /test_robin_fs and got en error, which I eventually deciphered as the reduction appears to need to be given in multiples of the PP size.
Sorry for replying so late. It took me a few days to check the data I/O to do the migrate, and a few days more to check the result after the migration.
They are SCSI Disk Drive.
Yes, many small files and these small files are compressed to a tar file after each day. Depending on which kinds of log, these tar files are deleted after 1 week, 1 month or 1 years ...
This i's what I've done for the last few days. I used "lvmstat" to collect the information of I/O stat of each lp/pp of fslv00.
You can see that, however the "KB-Writ" of hdisk4 and hdisk5 is higher than these of hdisk0 and hdisk3, the "Busy%" of hdisk0 and hdisk3 is higher than these of hdisk4 and hdisk5.
Of course it is. If systems administration would be simple we wouldn't be the heroes of the whole IT business, would we? So welcome to the job with the biggest demands and the greatest rewards our industry has to offer.
Well, this is something we can build on. I can help you better when i return to the office (and the documentation: there is a lot of detail i do not know from the top of my head).
A little general information about the disk statistics and what they mean:
Every disk has a "command queue": read- and write-requests are buffered and then worked on one after the other. If the queue is full the disk will not accept more commands until some room in the queue is free again. Keep this in mind for a moment.
The OS now asks every disk (in this regard, "disk" means "everything with a "hdisk"-device - it does not have to be a physical disk but also a LUN, a RAID-set, ....) in turn if this queue has length 0 at the moment or not, which the disk answers with "yes" (= length 0) or "no" (any length different then 0). From many of these answers the OS compiles a percentage which is shown as "disk busy %".
This means that "disk busy" is not as important as you think and that it has no meaning by itself. If a queue has "not length 0" it can have length 1 or length 15. The value is interesting because you get a measure of how many accesses a disk experiences. But you cannot measure the throughput of a device from that value alone. Disk operations come in different sizes, varying from 512 bytes (one disk block) up to several GB. Which of these is outstanding the busy% will not tell you, just that there is one outstanding at all.
You might consider balancing for "I/O" rather than for "busy%", but you might even do something different. I suggest you read my little introduction to performance tuning with emphasis to I/O-tuning in the meantime. I will come back to this thread once i am back in office (next Monday).