Is there an advantage to parsing backup formats to deduplicate?
To be application independent and support the broad variety of Nearline applications, it is much more straightforward to work independently of application specific formats. Some vendors go against this trend and are content-dependent. This means they are locked into support of particular backup products and revisions; they parse those formats and create an internal file system, so that when a new file version comes in, they can compare it to its prior entry in its directory and store only the changes, not unlike a version control system for software development. This approach sounds promising – it could optimize compression tactics for particular data types, for example – but in practice it has more weaknesses than strengths. First, it is very capital intensive to develop. Second, it always involves some amount of reverse engineering, and sometimes the format originators are not supportive, so it will never be universal. Third, it makes it hard to find redundancy in other parts of the