The use of unique identifiers in DIGGS and the case against them

In the AGS data interchange format, and in many other formats, particular fields, termed “key” fields, are used as unique identifiers for each record. These keys are fields with meaning to the data, e.g., hole identifier, depth, date/time, etc. Currently in the DIGGS format, an arbitrary but unique identifier is currently required for every record in a transmission. These identifiers are to be unique within a file and replace key fields. We contend that these identifiers are unnecessary except, perhaps, in a few cases and are actually harmful to the acceptance and usage of the format.

The main purpose, as we understand it, for the unique identifiers is the following scenario: A DIGGS file is transmitted from a site investigation contractor to a consultant. However, this is partial information. At a later date additional information is transmitted. In the original transmission, a borehole was given the name “BH51”. This was a mistake and it should have been named “BH15”. The corrected data is included with the later transmission. With an AGS transmission, based strictly on meaning key values, a new borehole would be created and the incorrect borehole would remain. With a DIGGS transmission, as long as the SI contractor and consultant maintained the unique identifiers, the original borehole would be properly renamed and there will be no invalid data in the database.

A secondary, but just as important, reason for unique identifiers is for cross hierarchy relationships. For example, the HandVane table has a field called specimen that optionally links a hand vane result to a lab specimen, thus indicating that this result was run on a lab specimen and not on a hole or excavation face. The contents of this specimen field would be the unique identifier of the specimen on which the hand vane was run.

These two reasons are good ones. However, in the case of the erroneous key field, we feel the cost of this methodology is much higher than the return and, in the case of the cross hierarchy relationship, there exists another way.

First let’s look at the cost of unique identifiers. Although it is anticipated that they shall be unique on a file basis, in reality they must be unique in the world for all time. This is because DIGGS data could be originally transmitted with two particular projects in one file, but at a later time they may be retransmitted by the data receiver with any combination of projects in one file. This makes each identifier quite long and these must be assigned to every record: every borehole, every sample, every reading in a particle distribution test. The file bloat will be huge.

In the first draft of the AGS to DIGGS translator, file sizes went up by two to three orders of magnitude. Files that were 20KB went up to megabytes. This was before the unique identifiers. In some cases, e.g., particle size distribution data, the unique identifiers could easily take up more space than the data.

Second, both the data generator and data receiver must store these identifiers in their databases to avoid the erroneous key field problem. Again, a high cost in both database size and complexity. Further, this is an additional change they would need to make to support DIGGS. A high price if not necessary.

Third, this requirement of maintaining the unique identifier may severely limit the choices available to the database designer and force the database to mimic the data transmission structure. This is exactly the opposite of what a good transmission standard is meant to achieve.

To illustrate the problem, let’s assume that the data receiver may only be interested in final laboratory results and has set up a database with one table with moisture content, unit weights, and Atterberg Limits results. They are not interested in storing the roles, specifications, etc. for each test in their database, just the final results. The Meta data can be accessed from the original DIGGS file, if necessary. This is a reasonable configuration and is very common today among consultants in the UK since the AGS structure was set up this way and many mimicked this structure. The problem comes in with the unique identifier. Each of these tests is in separate tables in the DIGGS format and each result is associated with a different unique identifier even within the same specimen. To maintain the unique identifier, the data receiver would be forced to insert an additional field for each data field to store the unique identifier. This is awkward, to say the least. Further, relying on the unique identifier to indicate key changes with this configuration makes conversion of erroneous keys much more difficult. The consultant would probably have to change the database structure to match the DIGGS format. While this may be desirable, this would insert many fields and tables in which the user has no interest. Besides the database reorganization, the consultant would be forced to change all their reports and queries to reference the new field locations. Organizations currently using an AGS compliant database should be able to continue with their structure with little or no changes.

Fourth, to make use of unique identifiers to solve the erroneous key problem would require software to be modified to make use of them. This is not a huge investment but it is a significant one. This is on top of the already extensive effort to put DIGGS support in place.
Finally, even with unique identifiers, the traditional key fields must be maintained otherwise it would be possible, for example, to have two boreholes with the same name (but different identifiers) in the same project and the DIGGS file would be valid.

We consider the overhead of unique identifiers not worth the price to solve the one problem of erroneous keys. While not a perfect alternative, we believe the traditional reliance on a strictly key-based system to be the much lesser of the two evils. Erroneous keys would need to be handled via communications between data generators and data consumers.

Regarding the issue of cross hierarchy relations, a unique identifier could be used only in the places where it is necessary. However, these identifiers could be arbitrary and unique within a file. They can be changed on each transmission. They only need to identify the relationship properly in a transmission. The receiving program can then map the relationship using whatever system it wishes.

An alternative to identifiers for cross hierarchy relations is again to use only key fields. For example, a hand vane could indicate it was run in a specimen by explicitly referencing the key fields of the related specimen. For example, using the AGS key fields, a hand vane in borehole BH1 at 5 meters references a specimen in borehole BH1, sample top depth = 4, sample type = U100, sample reference = 1, specimen depth = 4.5, specimen reference = A. The specimen field in the HandVane table for this record could then read:
table = Specimen, id = BH1,4,U100,1,4.5,A

Obviously work would need to be done on the syntax of such a specification but this technique gives a mechanism to perform cross hierarchy relations between any tables without the introduction of additional identifier fields. All that would be required is a “link” field in the dependent table. No additional field is needed in the target table. The same mechanism could be set up linking SPT results to associated samples. Other relationships may be desired as well. An added minor advantage is that the link becomes human-readable.

Another reason for abandoning this unique identifier concept is that DIGGS is complex enough as it is. Trying to solve all problems in the initial draft can make the format unworkable. Let’s stay as simple as we can while still delivering the functionally that most end users require.