How to identify varying unique fields values from a text file in UNIX?

manikandan23 · February 27, 2017, 2:33pm

Hi,

I have a huge unsorted text file. We wanted to identify the unique field values in a line and consider those fields as a primary key for a table in upstream system.

Basically, the process or script should fetch the values from each line that are unique compared to the rest of the lines in the file.

If there are 150 bytes in a line for a file that is containing around 100,000 lines and I wanted to find how many bytes on the line (150 bytes) can be formed as a primary key?

I know the file has to be sorted based on the entire 150 bytes and aftre that I am not sure how can I identify the uniqueness between lines?

Please help.

Thanks,
Mani A

joeyg · February 27, 2017, 2:53pm

Can you supply some examples to better understand?
Not 150 byte lines and not 100k likes, but something to get an idea of your goal?

There are commands such as
sort -u

manikandan23 · February 27, 2017, 3:22pm

For Ex.

My file contains the following lines:

ETL01InventoryBalances			SUCCESS
ETL02EvavivsStagingSalesOrders		SUCCESS
ETL03StagevsODSSalesOrder		        SUCCESS
ETL04EvavivsSalesOrderHeader History	SUCCESS
ETL05EvavivsSalesOrderLine History	SUCCESS
ETL07EvavivsStageRAs			        SUCCESS
ETL08StagevsODSRAs			        SUCCESS
ETL09StagetoODSIdentifierAttachments	SUCCESS
ETL10EvavitoStageWTs			        SUCCESS
ETL11StagevsODSShippingOrder		SUCCESS
ETL12StagevsODSShippingOrder Line	SUCCESS
ETL13StagevsODSShipments		        SUCCESS
ETL14StagevsODSShipmentLines		SUCCESS
ETL15StagevsODSPurchaseOrder		SUCCESS
ETL16StagevsODSPurchaseOrder Lines	SUCCESS
ETL17StagevsODSInventoryTransactions	SUCCESS
ETL18StagevsODSOrders			SUCCESS
ETL19StagevsODSOrderLines		        SUCCESS
ETL20StagevsODSShippingOrder		SUCCESS
ETL21StagevsODSShippingOrder Lines	SUCCESS
ETL22ODS Duplicate Shipments		SUCCESS
ETL23Evavi vs Stage Sales Order Lines	SUCCESS
ETL24Evavi vs ODS Sales Order Lines	SUCCESS
ETL33Source to ODS Identifier AttachmentSUCCESS
SND01Serialized ODS Shipments vs SND	SUCCESS
SND02SND vs Serialized ODS Shipments	SUCCESS
SND03WMS DMR- ERR records in ViawareSUCCESS
SND04Evavi DMR - ERR records		SUCCESS
VIA01Viaware Cost Status		        SUCCESS

Here, I need to sort them and remove duplicate records (If any) and extract them to a separate file which will further be inserted into a table where just 2 columns and this file value will populate the primary key column of that table. Before going to the table, the extract file should contain only unique records in it.

---------- Post updated at 03:22 PM ---------- Previous update was at 03:06 PM ----------

Lets assume that I need to compare the fields that are unique in first 150 bytes of each line in the file.

sort -u will give the unique records for the entire file. But I want this uniqueness to be checked around 150 bytes of the line in the file.

Don_Cragun · February 27, 2017, 3:25pm

Given that the 1st five characters of every line in your sample input is unique, you aren't going to find any duplicates. And, the file is already in sorted order. Are you trying to compare a substring of the lines in your file instead of whole lines?

manikandan23 · February 27, 2017, 3:28pm

In simple terms,

How will you identify the column which can act as a primary key (or unique) in a file?

---------- Post updated at 03:28 PM ---------- Previous update was at 03:26 PM ----------

Thanks Don. But the data that I showed is only less than 15 of the entire content of the file. My file is a fixed length file having 150 characters in a line. I need to identify a unique pattern in the record and call that pattern as a primary key

vgersh99 · February 27, 2017, 3:54pm

looks like the first COLUMN is the key?
The exception is the munged record with no field separator:

ETL33Source to ODS Identifier AttachmentSUCCESS

Can this be somehow pre-processed/fixed?
Is it safe to assume that the SECOND field is always 'SUCCESS'?

manikandan23 · February 27, 2017, 4:01pm

Thanks for the response. But not really. There is no field separator as is. I have edited the file just for readability.

vgersh99 · February 27, 2017, 4:05pm

Could we see the original version (for completeness sake)?

manikandan23 · February 27, 2017, 4:12pm

I have populated only around few records from the file as below.

Please assume first 150 characters in the line may have primary keys.

ETL01InventoryBalances            SUCCESS
ETL02EvavivsStagingSalesOrders        SUCCESS
ETL03StagevsODSSalesOrder        SUCCESS
ETL04EvavivsSalesOrderHeader History    SUCCESS
ETL05EvavivsSalesOrderLine History    SUCCESS
ETL07EvavivsStageRAs            SUCCESS
ETL08StagevsODSRAs            SUCCESS
ETL09StagetoODSIdentifierAttachments    SUCCESS
ETL10EvavitoStageWTs            SUCCESS
ETL11StagevsODSShippingOrder        SUCCESS
ETL12StagevsODSShippingOrder Line    SUCCESS
ETL13StagevsODSShipments        SUCCESS
ETL14StagevsODSShipmentLines        SUCCESS
ETL15StagevsODSPurchaseOrder        SUCCESS
ETL16StagevsODSPurchaseOrder Lines    SUCCESS
ETL17StagevsODSInventoryTransactions    SUCCESS
ETL18StagevsODSOrders            SUCCESS
ETL19StagevsODSOrderLines        SUCCESS
ETL20StagevsODSShippingOrder        SUCCESS
ETL21StagevsODSShippingOrder Lines    SUCCESS
ETL22ODS Duplicate Shipments        SUCCESS
ETL23Evavi vs Stage Sales Order Lines    SUCCESS
ETL24Evavi vs ODS Sales Order Lines    SUCCESS
ETL33Source to ODS Identifier AttachmentSUCCESS
SND01Serialized ODS Shipments vs SND    SUCCESS
SND02SND vs Serialized ODS Shipments    SUCCESS
SND03WMS DMR- ERR records in Viaware    SUCCESS
SND04Evavi DMR - ERR records        SUCCESS
VIA01Viaware Cost Status        SUCCESS
ETL01InventoryBalances            SUCCESS
ETL02EvavivsStagingSalesOrdersplan      SUCCESS
ETL03StagevsODSSalesOrder        UNKNOWN
ETL04EvavivsSalesOrderHeader History    UNKNOWN
ETL05EvavivsSalesOrderLine History    UNKNOWN
ETL07EvavivsStageRAs            UNKNOWN
ETL08StagevsODSRAs            UNKNOWN
ETL09StagetoODSIdentifierAttachments    UNKNOWN
ETL10EvavitoStageWTs12            UNKNOWN
ETL21StagevsODSShippingOrder        FAILURE
ETL212StagevsODSShippingOrder Line    FAILURE
ETL23StagevsODSShipments        FAILURE
ETL24StagevsODSShipmentLines        FAILURE
ETL25StagevsODSPurchaseOrder        FAILURE
ETL76StagevsODSPurchaseOrder Lines    FAILURE
ETL77StagevsODSInventoryTransactions    FAILURE
ETL78StagevsODSOrders            FAILURE
ETL59StagevsODSOrderLines        FAILURE
ETL60StagevsODSShippingOrder        FAILURE
ETL71StagevsODSShippingOrder Lines    CHECKIN
ETL82ODS Duplicate Shipments        CHECKIN
ETL93Evavi vs Stage Sales Order Lines    CHECKIN
ETL04Evavi vs ODS Sales Order Lines    CHECKIN
ETL33Source to ODS Identifier AttachmentCHECKIN
SN005Serialized ODS Shipments vs SND    CHECKIN
SN5D2SND vs Serialized ODS Shipments    CHECKIN
SND43WMS DMR- ERR records in Viaware    CHECKIN
SND44Evavi DMR - ERR records        UNKNOWN
EVIA01Viaware Cost Status        UNKNOWN

wisecracker · February 27, 2017, 4:28pm

Hi manikandan23...

In post #1 you quote that your line length is 150 bytes and in post #5 it has changed to 150 characters.
Are some of these characters Unicode or pure ASCII from whitespace to '~', (tilde), perhaps including tabs?
If Unicode then the line lengths assuming your 150 characters will be greater than 150 bytes because there might be several non-ASCII characters, resulting in binary lines.
We are making an assumption that your file(s) contain pure ASCII but a snapshot of one of your files would help, put inside CODE tags as this preserves pure text mode viewing.

Don_Cragun · February 27, 2017, 4:31pm

You have told us that the whole 150 character fixed length line is a key. You have said you need to identify a unique patter to act as a primary key. You have said that you need to identify the column which can act as a unique in a file. ... ... ...

I am very confused.

None of the lines you showed us have fixed length records. None of the lines you have shown us are 150 characters long. None of the lines you have shown us are 150 print columns wide. Two of the lines you have shown us are identical if you ignore the 1st five characters on each line. (And the command: sort -u -k1.6 file will easily get rid of that duplicated line while resorting the lines you have shown us ignoring the 1st five characters on each line.) Do you not know the format of the data you are processing?

vgersh99 · February 27, 2017, 4:31pm

making some assumptions here....
Will something like this be helpful?
awk -f mani.awk myFile where mani.awk is:

BEGIN {
  tab=sprintf("\t")
}

function trim(str)
{
    sub("^([ ]*|" tab "*)", "", str);
    sub("([ ]*|" tab "*)" "$", "", str);
    return str;
}
{
  match($0, "[A-Z][A-Z]+$")
  print trim(substr($0,1,RSTART-1))
}

manikandan23 · February 27, 2017, 9:25pm

Thank you so much everyone. I am really sorry for the confusion.

The file contains only ASCII and the first 150 characters (please assume this number for the sake of understanding and to make it clear) are considered to be meant for a primary key to a upstream table.

So when I parse this file, Lets say, in the input I got around 500,000 lines and the first 150 characters from those 500,000 lines could be repeating or entirely unique.

When I output my primary key file, it will be inserted into the table directly. This process should run without any exception of having a unique constraint violation or anything.

I hope it is clear now.
Again, am very sorry for all the miscommunication.

Thanks,
Mani A

Don_Cragun · February 27, 2017, 10:16pm

OK. So, sort -u (as suggested in post #2) should do exactly what you want. You said in post #3 that sort -u would not work, but your reasoning was not clear.

So, is there some reason why sort -u will not solve your problem?