How to know exactly which physical partion contains data?

bobochacha29 · December 8, 2014, 9:42pm

/dev/fslv01        5.00      2.90   42%      776     1% /movelv_test

fslv01:/movelv_test
LP    PP1  PV1               PP2  PV2               PP3  PV3
0001  0175 hdisk1            0111 hdisk0            
0002  0176 hdisk1            0112 hdisk0            
0003  0177 hdisk1            0113 hdisk0            
0004  0178 hdisk1            0114 hdisk0            
0005  0179 hdisk1            0115 hdisk0            
0006  0180 hdisk1            0116 hdisk0            
0007  0181 hdisk1            0117 hdisk0            
0008  0182 hdisk1            0118 hdisk0            
0009  0183 hdisk1            0119 hdisk0            
0010  0184 hdisk1            0120 hdisk0            
0011  0185 hdisk1            0121 hdisk0            
0012  0186 hdisk1            0122 hdisk0            
0013  0187 hdisk1            0123 hdisk0            
0014  0188 hdisk1            0124 hdisk0            
0015  0189 hdisk1            0125 hdisk0            
0016  0190 hdisk1            0126 hdisk0            
0017  0191 hdisk1            0127 hdisk0            
0018  0192 hdisk1            0128 hdisk0            
0019  0193 hdisk1            0129 hdisk0            
0020  0194 hdisk1            0130 hdisk0

Hi.
This is information of filesystem /movelv_test. It only used 42% capacity / 20 physical partions ( PP ) each PV. As I understand, it means that only 8 PP on each PV contain data, 12 PP remaining on each PV are assigned to logical volume fslv01 but have no data. Is this true ???
If yes, how to know exactly which PP contains data.
Thank for read.

bakunin · December 9, 2014, 3:14am

Not quite. The filesystem "/movelv_test" is residing on a certain Logical Volume, "fslv01". This LV in turn is made from 20 Logical Partitions. Because the LV is mirrored each LP is comprised from 2 Physical Partitions. Would the LV not be mirrored it would still be made from 20 LPs but this time each LP would be made from only one PP.

You need to understand that there are several abstraction layers and you cannot change between them at will.

No. You seem to think that data fills the raw disk space like water fills a bucket - from the bottom up, but this is not the case. Filesystems are made for random read/write access and that means that filling them contiguously would be a performance nightmare. Suppose you have two files, A and B and they are put adjacently on the disk. When you edit file A now and add a single character you would have no space to write that, so you need to rewrite the whole file. If there is a way to fragment that file and put the additional character elsewhere you could leave most of the file in place.

The relationship between disk space and filesystem space is more like the price you pay for something: suppose you purchase a house for some money. You can point to the whole money and say it bought you the whole house, but the question which bills bought me the western wall just makes no sense.

Of course you can investigate where exactly on which part of the disk a certain file is. But be aware that when you use the FS this can change over time and it is by no means fixed.

I hope this helps.

bakunin

bobochacha29 · December 9, 2014, 7:54am

I don't really understand what you means. As you said above, data is written randomly to the disk and these lp/pp listed above have no relationship with the filesytem space used/free. The data may reside anywhere on these 40 pps and there's no way to know that. Is that what you means?

bakunin · December 12, 2014, 2:49am

Yes and no. data is not written entirely randomly. There is a system behind it, but it is complex. You seem to have meant that a filesystem which is, say, 50% full and consists of LPs1-10 will have the LPs1-5 written full and LPs6-10 empty. This is (typically) not the case. Typically (depending on what the content of your filesystem is - many small files or a few large files or some mix of both, how often the data on disk changes and some other specifics) your filesystems space will be scattered and every LP will be nearly 50% full and nearly 50% empty.

Furthermore, if a FS resides on a LV which consists of 10 LPs these are ordered 1-10. But that does not have to mean that LP3 is "behind" LP2 on disk. It could well be that LP1 is at PP200, LP2 is at PP100 and LP3 is at PP150 (with maybe other LVs taking the place in between). For the filesystem the space would still appear to be linear, first the space coming from LP1, then from LP2, etc.. The FS will be unaware of the fact that to go from LP1 to LP2 this means to skip many PPs on disk.

Yes, it is possible to find out, but no, there is no 1:1-relation you could use. Say, you have some file. If you want to know where the first byte of the file "/path/to/some/file" is located on the disk you can find out on which place the first (filesystem) block of your file is. Then, using LV methods, find out on which LP this part of the filesystem really is. After this, you use the LP-PP-relation you already have posted to find out which PP is used to represent this LP and finally you can use VG-methods to locate the PP on some specific place on some specific disk and use low-level methods ("dd" and the like) to really find that part of the file you started with.

But, having found that out for byte X doesn't mean that you could read the next byte on disk and expect to find the content of byte X+1 in the file. This may be so or not, but to know that you will need to exercise the whole procedure again.

I hope this helps.

bakunin

rbatte1 · December 12, 2014, 9:10am

Are you wanting to shrink the filesystem, replace a disk or something? I'm wondering why you are so interested.

To shrink a filesystem on AIX is not simple task like it can by on some other operating systems, for instance RHEL 6 has an option to re-size up or down the filesystem and logical volume all in one command, although it requires it to unmount as part of the process.

On AIX you would need to either:-

Build a new filesystem of the required size, move the data and then remount the new filesystem to the correct location (making the change permanent for next boot etc.)
Backup the data, destroy and re-create the filesystem and then restore the data.

If you are looking to free up or replace a disk, then migratepv is your friend here. It can take a while depending how big your filesystem is, but it will (one at a time) make a copies the LPs to the target drive then remove the old one. Do not interrupt the process as this can leave you with inconsistencies in the volume group and logical volume information.

migratepv -l logical_volume  source_disk  target_disk

An alternate would be to use mirrors to achieve this if placement is critical. You can use mklvcopy to add a third copy to the new disk (and synchronise them) and then use rmlvcopy specifying to remove the one from the old disk. When making the 3rd copy, you can specify a map-file to force it to use the PPs on whichever disk in whichever order you want if that is important.

Whichever technique you use, you can empty a disk and remove it from the volume group for replacement or use elsewhere.

If I've missed the point, then let us know what you really need to know and we will try to help.

Regards,
Robin

bakunin · December 12, 2014, 5:16pm

Actually this is correct only historically. Since IIRC version 5.3 it is possible to shrink a filesystem. Initially it was possible only if the end of the underlying LV (from where the shrinking took place) was not already used by files but since then AIX does a "reorgfs" automatically as it seems. The only restricition (obviously) is that the resulting size must be big enough to hold all the existing data.

To shrink a FS (and the underlying LV as well) do, analogous to a growth:

chfs -a size=-<some size> /path/to/mountpoint

I hope this helps.

bakunin

bobochacha29 · December 13, 2014, 1:08am

rbatte1:

Are you wanting to shrink the filesystem, replace a disk or something? I'm wondering why you are so interested.

To shrink a filesystem on AIX is not simple task like it can by on some other operating systems, for instance RHEL 6 has an option to re-size up or down the filesystem and logical volume all in one command, although it requires it to unmount as part of the process.

On AIX you would need to either:-

Build a new filesystem of the required size, move the data and then remount the new filesystem to the correct location (making the change permanent for next boot etc.)

Backup the data, destroy and re-create the filesystem and then restore the data.

If you are looking to free up or replace a disk, then migratepv is your friend here. It can take a while depending how big your filesystem is, but it will (one at a time) make a copies the LPs to the target drive then remove the old one. Do not interrupt the process as this can leave you with inconsistencies in the volume group and logical volume information.
migratepv -l logical_volume  source_disk  target_disk
An alternate would be to use mirrors to achieve this if placement is critical. You can use mklvcopy to add a third copy to the new disk (and synchronise them) and then use rmlvcopy specifying to remove the one from the old disk. When making the 3rd copy, you can specify a map-file to force it to use the PPs on whichever disk in whichever order you want if that is important.

Whichever technique you use, you can empty a disk and remove it from the volume group for replacement or use elsewhere.

If I've missed the point, then let us know what you really need to know and we will try to help.

Regards,
Robin

Hehe, how do you know that. It's true that I'm having problem with filesystem, but it's a little different with what you think. In fact, I want to split the data of a filesystem on multiple hard disks equally.
It's quite a waste of lines listing all the lp/pp of the real filesystem I'm handling, so allow me to use that /movelv_test as an example.

This is how it looks like

fslv01:/movelv_test
LP    PP1  PV1               PP2  PV2             
0001  0175 hdisk1            0111 hdisk2            
0002  0176 hdisk1            0112 hdisk2           
0003  0177 hdisk1            0113 hdisk2            
0004  0178 hdisk1            0114 hdisk2           
0005  0179 hdisk1            0115 hdisk2            
0006  0180 hdisk1            0116 hdisk2            
0007  0181 hdisk1            0117 hdisk2           
0008  0182 hdisk1            0118 hdisk2           
0009  0183 hdisk1            0119 hdisk2           
0010  0184 hdisk1            0120 hdisk2          
0011  0185 hdisk3            0121 hdisk4          
0012  0186 hdisk3            0122 hdisk4            
0013  0187 hdisk3            0123 hdisk4           
0014  0188 hdisk3            0124 hdisk4           
0015  0189 hdisk3            0125 hdisk4           
0016  0190 hdisk3            0126 hdisk4            
0017  0191 hdisk3            0127 hdisk4            
0018  0192 hdisk3            0128 hdisk4            
0019  0193 hdisk3            0129 hdisk4           
0020  0194 hdisk3            0130 hdisk4

( Don't be surpised why it looks like that. At first, the vg of this fs have only 2 pv, hdisk1 and hdisk2. The data grew up day by day and when there was no free pp on hdisk1 and hdisk2, we added two more pv: hdisk3 and hdisk4 to the vg and then increase the /movelv_test size . Now this fs 's Used% is ~ 55%, 180GB assigned to the fs/ 100GB data used )

The problem is when I checked the disk operation ( by command "topas" ), I found that the operation of hdisk1 and hdisk2 is very high, ~90%-100% while the operation of hdisk3 and hdisk4 is just ~ 10% ( most of the disk operation is writing - the application writes logs to the fs's mountpoint ).

The result of the "topas" command and the lp/pp listed above made me think that data fills the raw disk space like water fills a bucket - as bakunin said, but now I know it's not true.

Now, what I want to do is to balance the disk operation of these 4 pv ( just only at this time,temporarily don't care about data increasing ) and I intend to do like this:
_ Decrease the size of the fs so that the fs's Used% ~ 99% ( to ensure that most of the pp are filled with data and don't care whether the pp contains data or not ), in this case - decrease size of /movelv_test to 101 GB
_ Used "migratelp" command to arrange each pp of the fslv01 like this

fslv01:/movelv_test
LP    PP1  PV1               PP2  PV2             
0001  0175 hdisk1            0111 hdisk2            
0002  0176 hdisk1            0112 hdisk2            
0003  0177 hdisk1            0113 hdisk2            
0004  0178 hdisk1            0114 hdisk2            
0005  0179 hdisk1            0115 hdisk2            
0006  0180 hdisk2            0116 hdisk3            
0007  0181 hdisk2            0117 hdisk3            
0008  0182 hdisk2            0118 hdisk3            
0009  0183 hdisk2            0119 hdisk3           
0010  0184 hdisk2            0120 hdisk3            
0011  0185 hdisk3            0121 hdisk4            
0012  0186 hdisk3            0122 hdisk4            
0013  0187 hdisk3            0123 hdisk4            
0014  0188 hdisk3            0124 hdisk4            
0015  0189 hdisk3            0125 hdisk4            
0016  0190 hdisk4            0126 hdisk1            
0017  0191 hdisk4            0127 hdisk1            
0018  0192 hdisk4            0128 hdisk1            
0019  0193 hdisk4            0129 hdisk1            
0020  0194 hdisk4            0130 hdisk1

_ Increase the size of the fs back ( in this case is 180 GB )

And I have two questions

Is it OK to use the command " migratelp" to do something like this. The data availability, the synchronization between the 1st mirror and the 2nd mirror ... is OK ? Are there any errors of the filesystem, errors of data, ... after using this command? If 1 of 4 hard disk fails, the data is still OK ???
Does this solution work ? Are there any solutions better?

Sorry for my poor English. Thanks for read.

bakunin · December 15, 2014, 11:26am

let me restructure your questions a bit:

It is perfectly OK and in fact the command was made for exactly this purpose. Still, i think you do not need it, see below.

This - the splitting you intend - makes sense only in a specific kind of situation, so please describe your hdisk-devices a bit better. What are they (single SCSI-disks, RAID-sets, LUNs from SAN, ...) and how do you access them?

First a little theory, so that ou can understand the output better:

bobochacha29:

This is how it looks like

fslv01:/movelv_test
LP    PP1  PV1               PP2  PV2             
0001  0175 hdisk1            0111 hdisk2            
0002  0176 hdisk1            0112 hdisk2           
0003  0177 hdisk1            0113 hdisk2            
0004  0178 hdisk1            0114 hdisk2
[...]

Here you see several of the "layers" i talked about in a previous post at work: The LV consists of LPs (leftmost column) numbered 0001, 0002, 0003, ... and the space in this LV is continuous. That means that when byte nr. X is the last byte in in LP 0001 then byte nr X+1 is the first byte in LP 0002. Now, the LP 0001 consists in fact of two PPs, which hold identical copies; PP0175 on hdisk1 and PP 0111 on hdisk2. Similarily for all the other LPs.

The first question is if you really need the LV-copies. writing in parallel is a tad slower than writing on a single LV (without copies) and it might help with the performance if you do away with the mirroring. You will have to decide if the loss of security outweighs the gain on performance or if it is the other way round.

Second, you can place the LPs also in this way (schematically):

LP    PP1  PV1
0001  0001 hdisk1
0002  0001 hdisk2
0003  0001 hdisk3
0004  0001 hdisk4
0005  0002 hdisk1
0006  0002 hdisk2
0007  0002 hdisk3
0008  0002 hdisk4
[...]

The LVM of AIX has a special provision for doing that, called (somewhat counterintuitively) "Inter-Policy", look here:

# lslv -l mylv
mylv:/some/where
PV                COPIES        IN BAND       DISTRIBUTION  
hdiskpower2       1670:000:000  24%           270:410:409:409:172 
# lslv -L ap33p1lv
LOGICAL VOLUME:     mylv                   VOLUME GROUP:   myvg
LV IDENTIFIER:      00534b7a00004c000000011cd5e0067e.1 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            4096                   PP SIZE:        512 megabyte(s)
COPIES:             1                      SCHED POLICY:   parallel
LPs:                1670                   PPs:            1670
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    32
MOUNT POINT:        /some/where            LABEL:          /some/where
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no

At "minimum" the LVM places the LPs on the PPs in a way so that the minimum possible hdisks are involved. At "maximum" it will try to spread it over as many hdisks as possible, thus arriving at a placement similar to what i have sketched out above.

Note, btw, that the smaller the PP size you use (this is a property of the VG so you might have to create it anew and start over from scratch) the better it is for the effect to take place. You are ultimately trying to use the internal cache of the hdisks for maximum effect and the PPs should be small enough to fit into this cache. 512MB, like in my example, would be too big for that.

If you really have to create the VG anew it might also be a good idea to create a RAID set (faster than single disks but slower than a stripe set) or even a stripe set (faster than the RAID but lacking the security of the RAID). See this little tutorial for details about RAIDs, striping, etc..

You can even more fine-tune the creation process of the LV by using a so-called "map file" and using the "-m" switch (see the man page of "mklv" for details). Basically you can explicitly state which PP(s) should represent any given LP of the LV that way.

Further, please tell us which kind of data the FS holds. You already said "mostly writing log files", but a little more detail would help: many small files, a few very large files, do the files change often or are they mostly appended? How often are files deleted and recreated (like in log rotation)? How many processes write (typically) concurrently to the FS. It might be that you can gain a lot with different OS tuning parameters without even having to change the disk layout.

At last, about "migratepv": if you really need it (which is, as shown above, not sure IMHO) you will want to remove any mirror copy from the LV prior to migrating its PPs around. It is simply only half the work because every PP has to moved separately. Unmirror the LV you want to migrate, do the migration and when that is done create a new mirror. You can use a map file (see above, the "-m" switch) for this too.

I hope this helps.

bakunin

rbatte1 · December 16, 2014, 6:30am

Can I also raise a concern about your plan to write LPs from both copies spread over the same disks. What would happen if you lost hdisk3, for instance. You might end up with two copies both with some missing LPs and a lot of hard work to extricate yourself.

I've not found a mention of reducing FS size in the manual page for chfs on AIX 6. The details about the size= part are below:-

The process to reduce the FS does work though

# lslv robin_test_lv
LOGICAL VOLUME:     robin_test_lv          VOLUME GROUP:   rebsvg
LV IDENTIFIER:      00cfe9f500004c000000012bcecff0de.18 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                32                     PPs:            64
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       edge                   UPPER BOUND:    32
MOUNT POINT:        /robin_test_fs         LABEL:          /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no            

# chfs -a size=-32M /robin_test_fs
Filesystem size changed to 2031616

# lslv robin_test_lv                   
LOGICAL VOLUME:     robin_test_lv          VOLUME GROUP:   rebsvg
LV IDENTIFIER:      00cfe9f500004c000000012bcecff0de.18 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                31                     PPs:            62
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       edge                   UPPER BOUND:    32
MOUNT POINT:        /robin_test_fs         LABEL:          /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no

How delighted am I? :):):):):)

Given that you can add a PP by using chfs -a size=+1 I tried chfs -a size=-1 /test_robin_fs and got en error, which I eventually deciphered as the reduction appears to need to be given in multiples of the PP size.

Robin

bobochacha29 · December 23, 2014, 11:11pm

Sorry for replying so late. It took me a few days to check the data I/O to do the migrate, and a few days more to check the result after the migration.

They are SCSI Disk Drive.

Yes, many small files and these small files are compressed to a tar file after each day. Depending on which kinds of log, these tar files are deleted after 1 week, 1 month or 1 years ...

This i's what I've done for the last few days. I used "lvmstat" to collect the information of I/O stat of each lp/pp of fslv00.

Log_part  mirror#  iocnt   Kb_read   Kb_wrtn      Kbps
       1       1  993851   1208612   3869488      3.01
       1       2  892710    595072   3869440      2.65
       2       1  494569   1453116   2206252      2.17
       3       1  484349   1480348   2106752      2.13
      94       1  480105   1716412   2667692      2.60
       4       1  441866   1397696   2095044      2.07
       2       2  401524    746264   2206260      1.75
      94       2  395993    994520   2667688      2.17
      66       1  394574   1960732   2910004      2.89
       3       2  385884    783876   2106748      1.71
      66       2  378052   1193540   2909996      2.43
      93       1  363708   1538244   2186696      2.21
       4       2  360660    762820   2095048      1.69
      27       1  359079   1828708   2038160      2.29
      11       1  312996   1613992   2138716      2.22
      93       2  290093    866932   2186692      1.81
      27       2  283805    916232   2038156      1.75
      28       1  281631   1596708   1727532      1.97
       9       1  267060   1564664   1815164      2.00
      11       2  266783   1002556   2138700      1.86
       5       1  264425   1067524   1360380      1.44
      10       1  257942   1689472   1991040      2.18
      25       1  236913   1244748   1377272      1.55
      15       1  231909   1622516   1983764      2.14
      13       1  229836   1634964   2080424      2.20
      28       2  221740    777552   1727532      1.48
      12       1  219561   1570800   1824384      2.01

Then I used this information to split the lp/pps to the disks, tried to make it balance.

And this is the result

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  40
hdisk5   51.5     1.4K   97.5     0.0     1.4K  PgspOut       0  % Client   40
hdisk4   54.0     1.4K   96.5     0.0     1.4K  PageIn        0
hdisk1   47.5   282.5    81.5     0.0   282.5   PageOut     378  PAGING SPACE
hdisk3   41.0   282.5    81.5     0.0   282.5   Sios        381  Size,MB   28672
cd0       0.0     0.0     0.0     0.0     0.0                    % Used      0
hdisk0    0.0     0.0     0.0     0.0     0.0   NFS (calls/sec)  % Free    100

It seems that the "Busy%" is balance , but the "KB-Writ" is not balance as expected.

I've made the change to 5 servers. The last server remaining is also the most abnormal one

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  52
hdisk4   18.5     1.6K   40.5     0.0     1.6K  PgspOut       0  % Client   52
hdisk5   22.0     1.6K   40.5     0.0     1.6K  PageIn        0
hdisk0   84.5   573.4   122.5     2.0   571.4   PageOut     489  PAGING SPACE
hdisk3   82.0   571.4   122.0     0.0   571.4   Sios        505  Size,MB   28672
hdisk1    0.0     0.0     0.0     0.0     0.0                    % Used      0
hdisk2    0.0     0.0     0.0     0.0     0.0   NFS (calls/sec)  % Free    100

You can see that, however the "KB-Writ" of hdisk4 and hdisk5 is higher than these of hdisk0 and hdisk3, the "Busy%" of hdisk0 and hdisk3 is higher than these of hdisk4 and hdisk5.

It's so complicated.

bakunin · December 26, 2014, 5:38am

Of course it is. If systems administration would be simple we wouldn't be the heroes of the whole IT business, would we? So welcome to the job with the biggest demands and the greatest rewards our industry has to offer.

Well, this is something we can build on. I can help you better when i return to the office (and the documentation: there is a lot of detail i do not know from the top of my head).

A little general information about the disk statistics and what they mean:

Every disk has a "command queue": read- and write-requests are buffered and then worked on one after the other. If the queue is full the disk will not accept more commands until some room in the queue is free again. Keep this in mind for a moment.

The OS now asks every disk (in this regard, "disk" means "everything with a "hdisk"-device - it does not have to be a physical disk but also a LUN, a RAID-set, ....) in turn if this queue has length 0 at the moment or not, which the disk answers with "yes" (= length 0) or "no" (any length different then 0). From many of these answers the OS compiles a percentage which is shown as "disk busy %".

This means that "disk busy" is not as important as you think and that it has no meaning by itself. If a queue has "not length 0" it can have length 1 or length 15. The value is interesting because you get a measure of how many accesses a disk experiences. But you cannot measure the throughput of a device from that value alone. Disk operations come in different sizes, varying from 512 bytes (one disk block) up to several GB. Which of these is outstanding the busy% will not tell you, just that there is one outstanding at all.

You might consider balancing for "I/O" rather than for "busy%", but you might even do something different. I suggest you read my little introduction to performance tuning with emphasis to I/O-tuning in the meantime. I will come back to this thread once i am back in office (next Monday).

I hope this helps.

bakunin