Moving Files in AI Datasets
Updated February 8, 2021
Some AI datasets have directories where each directory represents a class. A particular directory can contain a large number of files. Because of time and memory constrains you may want to train your model using a smaller number of files.
We detail how to move the first n files from a directory into a new directory.
This is done for each directory (class) in the dataset.
We also show how to get a random sample of n items from a directory.
Example — The keywords Dataset
The dataset “keywords” can be downloaded from:
https://docs.edgeimpulse.com/docs/keyword-spotting
I created a folder, “Dataset” on my desktop and put the downloaded folder “keywords2” into it.
The dataset contains:
- yes — 1,500 one-second samples with only the word “yes” in it.
- no — 1,500 one-second samples with only the word “no” in it.
- unknown — 1,504 one-second samples of other words.
- noise — 1,500 one-second samples of background or static noise.
The next step is to create destination directories for the smaller number of files you want.
Create Destination Directories
Moving the Files
The above zsh mv code was found at https://unix.stackexchange.com/questions/12976/how-to-move-100-files-from-a-folder-containing-thousands
Thanks to: https://unix.stackexchange.com/users/885/gilles-so-stop-being-evil
Random Sampling From a Directory
Start by creating the destination directories.
Get a Random Sample of Files — Copy to New Directory
ronald@Ronalds-MBP keywords2 % ls yes/*.wav | sort -R | tail -400 | xargs -J{} cp {} yes400
ronald@Ronalds-MBP keywords2 % ls no/*.wav | sort -R | tail -400 | xargs -J{} cp {} no400
ronald@Ronalds-MBP keywords2 % ls unknown/*.wav | sort -R | tail -400 | xargs -J{} cp {} unknown40
ronald@Ronalds-MBP keywords2 % ls noise/*.wav | sort -R | tail -400 | xargs -J{} cp {} noise400
ronald@Ronalds-MBP keywords2 %
Thanks to Naveen at Edge Impulse for the above code:
https://twitter.com/knaveen
The above code uses the xargs utility. Xargs allows you to build and execute commands from standard input. For more information see:
https://linuxize.com/post/linux-xargs-command/