Definition
Removing duplicates involves eliminating duplicate entries from your dataset. Duplicates can cause problems in data analysis and produce inaccurate results. Thus, it is often necessary to remove duplicates to ensure data integrity.
Example of removing duplicates using JavaScript
Here is a simple JavaScript array of objects representing a dataset:
const data = [
{ firstName: "John", lastName: "Doe", age: "30" },
{ firstName: "Jane", lastName: "Smith", age: "25" },
{ firstName: "Bob", lastName: "Johnson", age: "40" },
{ firstName: "John", lastName: "Doe", age: "30" },
];
In the above array, we have a duplicate record of "John Doe". To remove this, we can use the .filter() method in combination with the .map() and .indexOf() functions to check for duplicates:
const uniqueData = data.filter(
(value, index, self) =>
self
.map((item) => item.firstName + item.lastName + item.age)
.indexOf(value.firstName + value.lastName + value.age) === index
);
Here's what is happening:
- filter() function:
Array.filter()is a method that creates a new array with all elements that pass the test implemented by the provided function. In this case, the function will check if each element indatais a duplicate or not. The function will include the element inuniqueDataonly if it's not a duplicate. - Inside the filter function:
(value, index, self) => ...is the function that is passed intofilter(). This function will be applied to every element indata.valuerepresents the current element being processed in the array.indexis the index of the current element being processed in the array.selfis a reference to the original array (datain this case).
- map() function:
Array.map()is a method that creates a new array populated with the results of calling a provided function on every element in the calling array. Here, the functionitem => item.firstName + item.lastName + item.ageis applied to every element inself(which isdata). This function will return a new array of combined strings (firstName + lastName + age) from each element. - indexOf() function:
Array.indexOf()is a method that returns the first index at which a given element can be found in the array, or -1 if it is not present. Here,indexOf(value.firstName + value.lastName + value.age)is looking for the first occurrence of the combined string (firstName + lastName + age) from thevaluein the array generated bymap(). If there are duplicate records, their combined string will have the same value. - Equality Check:
... === indexchecks if the index of the first occurrence ofvalue‘s combined string (as found byindexOf()) is the same as the currentvalue‘s index indata. If they are the same, it means that this is the first occurrence of this value in the array and therefore it's not a duplicate. If they are different, it means that thisvaluehas appeared before in the array (therefore it is a duplicate), and it won't be included inuniqueData.
Before
| firstName | lastName | age |
|---|---|---|
| John | Doe | 30 |
| Jane | Smith | 25 |
| Bob | Johnson | 40 |
| John | Doe | 30 |
After
| firstName | lastName | age |
|---|---|---|
| John | Doe | 30 |
| Jane | Smith | 25 |
| Bob | Johnson | 40 |
Considerations
When removing duplicates, it's essential to define what constitutes a duplicate in the context of your data. For instance, in the above example, we considered a record a duplicate if all fields were identical. However, in some cases, you might want to remove duplicates based on a specific field.
Related Operations
- Concatenating Fields: This operation combines two or more fields into a single field. This could inadvertently create duplicates if not handled correctly.
- Splitting Fields: This operation separates a single field into multiple fields. After splitting, ensure to check for duplicates that might have been created.
- Normalizing Cases: Before removing duplicates, you may need to normalize cases to ensure you're not retaining duplicates due to case differences.