Best way to get unique records only

Support for DataSlave

Moderators: Tom, ian

Best way to get unique records only

Postby phrenetic » Fri Jan 31, 2014 3:08 pm

Can you advise on the best way to get an output consisting of unique records only?

I have an input file (CSV file) that contains over 1 million records. I use a Validation function to filter out records that I am not interested in, then transform the data into a smaller subset. The transformation process results in duplicate records (which is fine as I do not have a primary key).

What I want to do is to extract only the unique records (i.e. de-duplicate the data) but I am not certain whether using the DeDuplicate object is the right thing to do, as I have only one input table and no reference table.

I set the DeDuplicate object to have the input table as both the reference and the input and it started running - but after 15 minutes I stopped it as there appeared to be zero progress. Actually it did seem to be doing something (although I am not sure what) as it reported 680 records had been appraised. As there are over 22000 records to process, I don't want to leave it running for X hours only to find that it hasn't done what I want.

So, is there a best way to extract only the unique records from the input (akin to SELECT UNIQUE when using a SQL database) ?

Alistair
phrenetic
 
Posts: 7
Joined: Wed Oct 12, 2011 9:56 pm

Re: Best way to get unique records only

Postby ian » Fri Jan 31, 2014 4:33 pm

The de-duplicate can be very slow with very large data sets. The time taken explodes as rows are added because every row is checked against every row.

If you can write the unduplicated data to a database you can then use SQL just as you suggest and you will get better performance from the database engine. This is just what I think you are suggesting. It is hard to comment much further without a better understanding of the data and database schema.

You may want to create a column designed to be used for the de-duplicate. Thick carefully about what you need to put in that column.
Make the most significant characters at the front of the de-duplicate column
Make sure the column is appropriately indexed.

If you do choose to use the de-dup function in DataSlave we can offer advise about how to tune performance.
ian
 
Posts: 364
Joined: Sat Dec 18, 2004 8:13 am
Location: UK

Re: Best way to get unique records only

Postby phrenetic » Fri Jan 31, 2014 10:09 pm

Thanks - I'll pump all the records to the database and dedupe them there.

Alistair
phrenetic
 
Posts: 7
Joined: Wed Oct 12, 2011 9:56 pm


Return to Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron