Recursively find all duplicate files in a directory (Java)

Intro

Sure this could be implemented using a few lines of basic Linux commands. However, understanding how to write such code in Java, requires understanding in several topics - Hash tables, recursion, lists, file system, and more.

That is why I love this problem. Not only does it concern understanding much of Java's fundamentals, but also there is great deal of required efficiency regarding time complexity and space complexity.

Duplicate detection using hash

The first problem to take into consideration is, how do I detect duplicated files? Should I only consider file names? What about file size? Maybe both? Considering both is still not enough. It's pretty easy to come up with a counter example for this approach. Take for example File A thats called fileA.txt and file B called fileB.txt. fileA.txt contains the word "hello" however fileB.txt contains the word "world". Both files contain the same name and size, however are not identical. That is why my approach will contain reading the files bytes, and saving a unique hash id for each file.

    private static MessageDigest messageDigest;
    static {
        try {
            messageDigest = MessageDigest.getInstance("SHA-512");
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException("cannot initialize SHA-512 hash function", e);
        }
    }

In the above code, we apply a notable secure hash function called SHA-512. We will use this function to create a unique id for each of the files in the file system.

Duplicated files retrieval using Hash Table

Our second problem, is how to store the files id hash efficiently for future retrieval in an efficient way. One of the best methods for retrieval efficiently is of course Hash Tables which if implemented properly, enable retrieval in O(1) complexity time. What we'll do is store the hash unique id's as keys, and for every key, the value will be a Linked List containing all of the duplicated String paths associated to the same key. Such hash id's are very very big which is why we'll also use the Java library BigInteger.

And finally, we'll traverse all sub directories and files recursively, such that for each directory, traverse all of it's files. The final implementation is as follows:

public static void findDuplicatedFiles(Map<String, List<String>> lists, File directory) {
        for (File child : directory.listFiles()) {
            if (child.isDirectory()) {
                findDuplicatedFiles(lists, child);
            } else {
                try {
                    FileInputStream fileInput = new FileInputStream(child);
                    byte fileData[] = new byte[(int) child.length()];
                    fileInput.read(data);
                    fileInput.close();
                    String uniqueFileHash = new BigInteger(1, md.digest(fileData)).toString(16);
                    List<String> list = lists.get(uniqueFileHash);
                    if (list == null) {
                        list = new LinkedList<String>();
                        lists.put(uniqueFileHash, list);
                    }
                    list.add(child.getAbsolutePath());
                } catch (IOException e) {
                    throw new RuntimeException("cannot read file " + child.getAbsolutePath(), e);
                }
            }
        }
    }

All thats left is to run the above method and print out the Hash tables key values if such exists (that is that the associated linked lists hold duplicates.

Map<String, List<String>> lists = new HashMap<String, List<String>>();
        FindDuplicates.findDuplicateFiles(lists, dir);
        for (List<String> list : lists.values()) {
            if (list.size() > 1) {
                System.out.println("\n");
                for (String file : list) {
                    System.out.println(file);
                }
            }
        }
        System.out.println("\n");

The source code can be found in the download link below:

Feel free to ask any related questions in the comments below.