De-duplicating a very large file against itself.

Some tips for the use of DataSlave

Moderators: Tom, ian

De-duplicating a very large file against itself.

Postby ian » Thu Sep 09, 2010 4:27 pm

If you have a very large file you need to de-duplicate against itself there is a problem:

If you use the Batch mode as is recommended for large files the de-dup will not work. However we suggest you handle the data in two maps.

Firstly import the data into DataSlave, pass through a Transform, passing only the column you want to use for De-Duplication. Write this to a temporary file. Remember to use Trial mode at first while you develop the map and then change to Batch mode to process the large data set.

Now build a second map. Read in the temporary, but reduced file for the reference for the De-Duplication. Read in your data set, again in Trial mode and then Batch mode, De-Duplicating against the reduced dataset.

The De-Duplicate will now work in Batch mode. If you want to send us a small sample of data and some instructions on what you require and we will build you a sample map to illustrate.
ian
 
Posts: 364
Joined: Sat Dec 18, 2004 8:13 am
Location: UK

Return to Tips

Who is online

Users browsing this forum: No registered users and 2 guests

cron