DALL·E 2 pre-training mitigations

By neub9
4 Min Read

We noticed that our earlier version of DALL·E 2 was occasionally producing exact replicas of training images. This was a problem, as our goal is for DALL·E 2 to generate original and unique images, rather than simply stitching together pieces of existing images. Reproducing training images verbatim also raised concerns about copyright infringement, ownership, and privacy issues if people’s photos were used in the training data.

To understand this issue further, we compiled a dataset of prompts that frequently led to duplicated images. We utilized a trained model to sample images for 50,000 prompts from our training dataset and then categorized the samples based on their similarity to the corresponding training image. Upon manual inspection of the top matches, we identified only a few hundred true duplicate pairs out of the 50,000 prompts. Despite the regurgitation rate appearing to be less than 1%, we felt it was crucial to bring the rate down to 0 for the reasons mentioned above.

Upon studying our dataset of regurgitated images, we discovered two recurring patterns. Firstly, the images were predominantly simple vector graphics, making them easy to memorize due to their low information content. Secondly, we observed that these images had multiple near-duplicates in the training dataset. This suggested that data duplication is associated with memorization, as seen in other works involving large language models.

To address this issue, we planned to utilize a neural network to identify similar image groups and remove duplicates from each group. However, the challenge was to check if every image was a duplicate of every other image in the dataset, which would require an impractical number of image pair checks given the large size of our dataset.

Fortunately, we developed a more efficient alternative by clustering our dataset before deduplication. Clustering the data allowed us to deduplicate samples within each cluster without checking for duplicates outside of the cluster, resulting in a significantly faster process.

Empirical testing showed that this clustering-based approach successfully identified 85% of all duplicate pairs when using K=1024 clusters. By leveraging multiple clusterings, we were able to find 97% of all duplicate pairs in practice.

Although this deduplication process resulted in the removal of almost a quarter of our dataset, we found that many of the near-duplicate pairs had meaningful variations. Interestingly, human evaluators slightly preferred the model trained on deduplicated data, suggesting that the redundant images were negatively impacting performance.

After training a model on the deduplicated data, we found that the new model never regurgitated a training image when given the exact prompt. A thorough check also confirmed that the model never regurgitated a different image than the one associated with a given prompt.

In conclusion, our approach of deduplicating the dataset significantly improved the performance of the model, reducing the instances of image regurgitation and demonstrating the effectiveness of the clustering-based deduplication method.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *