I have a huge unsorted text file. We wanted to identify the unique field values in a line and consider those fields as a primary key for a table in upstream system.
Basically, the process or script should fetch the values from each line that are unique compared to the rest of the lines in the file.
If there are 150 bytes in a line for a file that is containing around 100,000 lines and I wanted to find how many bytes on the line (150 bytes) can be formed as a primary key?
I know the file has to be sorted based on the entire 150 bytes and aftre that I am not sure how can I identify the uniqueness between lines?
ETL01InventoryBalances SUCCESS
ETL02EvavivsStagingSalesOrders SUCCESS
ETL03StagevsODSSalesOrder SUCCESS
ETL04EvavivsSalesOrderHeader History SUCCESS
ETL05EvavivsSalesOrderLine History SUCCESS
ETL07EvavivsStageRAs SUCCESS
ETL08StagevsODSRAs SUCCESS
ETL09StagetoODSIdentifierAttachments SUCCESS
ETL10EvavitoStageWTs SUCCESS
ETL11StagevsODSShippingOrder SUCCESS
ETL12StagevsODSShippingOrder Line SUCCESS
ETL13StagevsODSShipments SUCCESS
ETL14StagevsODSShipmentLines SUCCESS
ETL15StagevsODSPurchaseOrder SUCCESS
ETL16StagevsODSPurchaseOrder Lines SUCCESS
ETL17StagevsODSInventoryTransactions SUCCESS
ETL18StagevsODSOrders SUCCESS
ETL19StagevsODSOrderLines SUCCESS
ETL20StagevsODSShippingOrder SUCCESS
ETL21StagevsODSShippingOrder Lines SUCCESS
ETL22ODS Duplicate Shipments SUCCESS
ETL23Evavi vs Stage Sales Order Lines SUCCESS
ETL24Evavi vs ODS Sales Order Lines SUCCESS
ETL33Source to ODS Identifier AttachmentSUCCESS
SND01Serialized ODS Shipments vs SND SUCCESS
SND02SND vs Serialized ODS Shipments SUCCESS
SND03WMS DMR- ERR records in ViawareSUCCESS
SND04Evavi DMR - ERR records SUCCESS
VIA01Viaware Cost Status SUCCESS
Here, I need to sort them and remove duplicate records (If any) and extract them to a separate file which will further be inserted into a table where just 2 columns and this file value will populate the primary key column of that table. Before going to the table, the extract file should contain only unique records in it.
---------- Post updated at 03:22 PM ---------- Previous update was at 03:06 PM ----------
Lets assume that I need to compare the fields that are unique in first 150 bytes of each line in the file.
sort -u will give the unique records for the entire file. But I want this uniqueness to be checked around 150 bytes of the line in the file.
Given that the 1st five characters of every line in your sample input is unique, you aren't going to find any duplicates. And, the file is already in sorted order. Are you trying to compare a substring of the lines in your file instead of whole lines?
How will you identify the column which can act as a primary key (or unique) in a file?
---------- Post updated at 03:28 PM ---------- Previous update was at 03:26 PM ----------
Thanks Don. But the data that I showed is only less than 15 of the entire content of the file. My file is a fixed length file having 150 characters in a line. I need to identify a unique pattern in the record and call that pattern as a primary key
In post #1 you quote that your line length is 150 bytes and in post #5 it has changed to 150 characters.
Are some of these characters Unicode or pure ASCII from whitespace to '~', (tilde), perhaps including tabs?
If Unicode then the line lengths assuming your 150 characters will be greater than 150 bytes because there might be several non-ASCII characters, resulting in binary lines.
We are making an assumption that your file(s) contain pure ASCII but a snapshot of one of your files would help, put inside CODE tags as this preserves pure text mode viewing.
You have told us that the whole 150 character fixed length line is a key. You have said you need to identify a unique patter to act as a primary key. You have said that you need to identify the column which can act as a unique in a file. ... ... ...
I am very confused.
None of the lines you showed us have fixed length records. None of the lines you have shown us are 150 characters long. None of the lines you have shown us are 150 print columns wide. Two of the lines you have shown us are identical if you ignore the 1st five characters on each line. (And the command: sort -u -k1.6 file will easily get rid of that duplicated line while resorting the lines you have shown us ignoring the 1st five characters on each line.) Do you not know the format of the data you are processing?
Thank you so much everyone. I am really sorry for the confusion.
The file contains only ASCII and the first 150 characters (please assume this number for the sake of understanding and to make it clear) are considered to be meant for a primary key to a upstream table.
So when I parse this file, Lets say, in the input I got around 500,000 lines and the first 150 characters from those 500,000 lines could be repeating or entirely unique.
When I output my primary key file, it will be inserted into the table directly. This process should run without any exception of having a unique constraint violation or anything.
I hope it is clear now.
Again, am very sorry for all the miscommunication.
OK. So, sort -u (as suggested in post #2) should do exactly what you want. You said in post #3 that sort -u would not work, but your reasoning was not clear.
So, is there some reason why sort -u will not solve your problem?