[DAMD-94] Fix discrepancy between createHash and datamodel.txt Created: 09/Nov/20  Updated: 05/Jan/21  Resolved: 14/Nov/20

Status: Done
Project: Data Model
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: hassan Assignee: hassan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks INSTRM-1096 Add {{pfs_visit.pfs_design_id}} column Done
Story Points: 1
Sprint: 2DDRP-2021 A

 Description   

Currently the datamodel.txt defines a hash as a 63-bit unsigned int, fitted into 64 bit signed integers:

https://github.com/Subaru-PFS/datamodel/blob/7af03e7b2adba3b5e190995b54296d70683e6d17/datamodel.txt#L28

In various places I refer to a SHA-1, which is a strong 160-bit hash, as used by e.g. git
(https://en.wikipedia.org/wiki/SHA-1). We truncate these hashes to 63bits (so as to fit
in standard 64-bit signed integers). Sixty-three bits would produce up to 2^63 ~ 9e18 values.

This is currently inconsistent with the datamodel.utils.createHash() function, where a 64-bit hash is generated:

https://github.com/Subaru-PFS/datamodel/blob/fa98c08c8ac839956f0d4e8489523e4898894a8b/python/pfs/datamodel/utils.py#L51

Fix this discrepancy following the proposal by Sogo Mineo and Craig Loomis in the datamodel channel 2020-11-06, by updating the datamodel.txt text mentioned above, such that a 64-bit hash is generated, in line with createHash, and that this hash can be fitted into a standard 64-bit signed integer.

This will allow identifiers such as the pfsDesignId, which use that hash, to be stored in the opDB Postgres database, using a standard bigint or int8 data type, without need for additional conversion routines, as discussed in INSTRM-1096.

 

 



 Comments   
Comment by hassan [ 09/Nov/20 ]

Other identifiers such as the objId are also affected. Hassan is investigating whether the Gaia DR2 object identifier, the sourceId, can be stored in the positive range of a 64-bit signed int to avoid possible confusion. Yabe-san is checking the HSC ID.

Comment by sogo.mineo [ 10/Nov/20 ]

I think the hexadecimal notation of the 64bit hash will be kept unsigned as it is now, e.g. 0xfedcba9876543210.
What will the signed value for x = 0xfedcba9876543210?
I think fits_cast_unsigned_to_signed(x) is currently used.
What I meant in slack was cplusplus_cast_unsigned_to_signed(x).
I don't have preference but it should be determined which to use.

def fits_cast_unsigned_to_signed(x):
    return x - 0x8000_0000_0000_0000

def cplusplus_cast_unsigned_to_signed(x):
    return x - ((x & 0x8000_0000_0000_0000) << 1)
Comment by hassan [ 13/Nov/20 ]

Following subsequent discussions with cloomis and sogo.mineo: generating a hash that is truncated to from the original 160 bit SHA-1 to 64-bits, compared with one truncated to 63 bits, would cause more confusion and problems than it helps.

For example, if the resultant 64-bit hash is carried internally as a 64-bit signed integer (for example when read from the opDB, where the column type is 64-bit signed), then approx 50% of all generated hashes would result in a negative signed integer value being read. Care then needs to be taken when using these, for example when writing out such has values as hex representations in file names, as performed by the LSST Gen2 Butler (tests show in fact that the Butler would raise an error in such situations).

It appears to be much simpler and safer to truncate the SHA to 63 bits. This way, all hash values will be positive.

Comment by hassan [ 14/Nov/20 ]

merged to master (d472332)

Generated at Sat Feb 10 15:34:11 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.