If we think of it from wider perspective, the discovery is still the most problematic part. That is, because for any piece of data there exist infinite number of variations. Also there is really only illusion of files that exists. Therefore hash turns into an illusion of pointer when we can understand that if even single bit of the file differs the hash turns completely different.
So I ask you the reader, what is a better way to point to certain data than a hash of a file?
Can merkle trees help us? We can consider files as chunks, then we can try to find similar blocks and reconstruct what we have.
Some filetypes are very bad for this. Its because if file uses compression algorithm that does not split well, the problem of corruption arises. We need to prefer file containers that can be split, lets say for parts of 1MB, then we can calculate hashes for each part and build merkle trees where we also have hashes of larger collections of parts.