A user of Kaggle, a platform for machine learning and data science competitions which was recently acquired by Google, has uploaded a facial data set he says was created by exploiting Tinder’s API to scrape 40,000 profile photos from Bay Area users of the dating app — 20,000 apiece from profiles of each gender.The data set, called People of Tinder, consists of six downloadable zip files, with four containing around 10,000 profile photos each and two files with sample sets of around 500 images per gender.The question is how do you build up a sufficient database of single individuals to make the site viable?Dating sites approach this issue differently, but the temptation to create and use fake profiles is rife with risk as JDI Dating learned in a recent FTC enforcement action.Why not leverage Tinder to build a better, larger facial dataset?” Why not — except, perhaps, the privacy of thousands of individuals whose facial biometrics you’re dumping online in a mass repository for public repurposing, entirely without their say-so.“The datasets tend to be extremely strict in their structure, and are usually too small.
But since Tinder makes its rights to your content transferable, it’s entirely possible even this large-scale repurposing of the data falls within the scope of its T&Cs, assuming it sanctioned Colianni’s use of its API.
Glancing through a few of the images from one of the downloadable files they certainly look like the sort of quasi-intimate photos people use for profiles on Tinder (or indeed, for other online social apps) — with a mix of selfies, friend group shots and random stuff like photos of cute animals or memes.
It’s by no means a flawless data set if it’s just faces you’re looking for.
users have many motives for uploading their likeness to the dating app.
But contributing a facial biometric to a downloadable data set for training convolutional neural networks probably wasn’t top of their list when they signed up to swipe.
(I just hope he strips out all the pet shots first or he’ll find this task an uphill struggle.) The data set, which was uploaded to Kaggle three days ago (minus the sample files), has been downloaded more than 300 times at this point — and there’s obviously no way to know what additional uses it might be being put to.