Over the past few months, my colleagues and I have been building secure personal genomics using cutting-edge cryptographic platforms. Having already investigated MPC and ZK, we are now focusing on using fully homomorphic encryption (FHE) using Zama. Since Zama just posted about their experiments with genetic data, I thought I’d write a quick overview of our adventures as well.
The promise of FHE is that valuable data can be handled in a fully encrypted manner such that the data never needs to be decrypted even as computations are performed with it. Although the general technique has been around for around fifteen years and investigated for a variety of use cases, it hasn’t been found performant enough for production uses. We wanted to see whether Zama’s recent implementation, TFHE-rs, scaled enough to support analysis over a single consumer-grade human genome.
We loaded an example 23andMe dataset for a single individual of around 500,000 lines. Here are some representatives lines from the dataset:
# rsid chromosome position genotype
rs75333668 1 762320 CC
rs11240777 1 798959 GG
rs4475691 1 846808 CC
rs200686669 1 861349 CC
rs13302982 1 861808 GG
rs201186828 1 865545 GG
rs148711625 1 865584 GG
rs146327803 1 865625 GG
Each line above represents an SNP (single nucleotide polymorphism) along with its corresponding genotype. In essence, it tells us the content of the genetic code at a certain position of the genome. The “value” of these SNPs can tell us about individuals’ health traits, e.g. their muscle performance.

During the loading process we have to encode the information as numbers so that it can eventually be processed using FHE:
fn encode_rsid(rsid: &str) -> u64 {
let mut hasher = DefaultHasher::new();
rsid.trim_start_matches("rs").hash(&mut hasher);
return hasher.finish();
}
fn encode_genotype(genotype: &str) -> u8 {
match genotype {
"AA" => 1,
"AC" | "CA" => 2,
"AG" | "GA" => 3,
"AT" | "TA" => 4,
"CC" => 5,
"CG" | "GC" => 6,
"CT" | "TC" => 7,
"GG" => 8,
"GT" | "TG" => 9,
"TT" => 10,
// Optionally handle degenerate cases or mixed calls, often seen in some genetic data
"NN" => 11, // 'N' often represents an unknown or not determined base
// Handle cases where genotype is not provided or is erroneous
"" | "???" => 12, // This can be expanded based on the data set's specifics
_ => 0, // For any other unexpected or incomplete genotypes
}
We then encrypt the data using encrypted Zama integer types. Notice that we are using compressed types, i.e. CompressedFheUint8 instead of FheUint8, to greatly reduce memory consumption:
fn encrypt_genotypes_for_zama(processed_data: &HashMap<u64, u8>, client_key: ClientKey) -> result::Result<HashMap<&u64, CompressedFheUint8>, Error> {
let mut enc_data = HashMap::new();
for (encoded_rsid, encoded_genotype) in processed_data {
let genotype_encrypted = CompressedFheUint8::try_encrypt(*encoded_genotype, &client_key).unwrap();
enc_data.insert(encoded_rsid, genotype_encrypted);
}
Ok(enc_data)
}
Now that our data is loaded and encrypted, we can write some functions to operate on it in a completely encrypted manner! This is a function which allows checking the value of a single SNP to asses a health trait, which is the minimum we need to do something useful with the dataset:
fn check_genotype(encrypted_genotypes: &HashMap<&u64, CompressedFheUint8>,
rsid: &str,
genotype: &str,
client_key: ClientKey
) -> result::Result<bool, Error> {
let encoded_rsid = encode_rsid(rsid);
let encoded_genotype = encode_genotype(genotype);
let encrypted_genotype = encrypted_genotypes.get(&encoded_rsid).unwrap();
let decompressed_encrypted_genotype = encrypted_genotype.decompress();
let encrypted_target = FheUint8::try_encrypt(encoded_genotype, &client_key).unwrap();
Ok(decompressed_encrypted_genotype.eq(encrypted_target).decrypt(&client_key))
}
This is a function to go through the entire dataset to do some arbitrary calculations, which can assure us that we can run some complex analyses to perform come up with some truly meaningful personal genomic results:
fn get_genotype_frequencies(encrypted_genotypes: &HashMap<&u64, CompressedFheUint8>,
client_key: ClientKey
)
-> HashMap<u64, u64> {
let mut genotype_frequencies = HashMap::new();
for (_encoded_rsid, encrypted_genotype) in encrypted_genotypes {
let decompressed_encrypted_genotype = encrypted_genotype.decompress();
let decrypted_genotype = decompressed_encrypted_genotype.decrypt(&client_key);
let count = genotype_frequencies.entry(decrypted_genotype).or_insert(0);
*count += 1;
}
genotype_frequencies
}
The whole code can be viewed at https://github.com/Cryptonomic/MonadicDNA/blob/5f485db8d64bb3c2d68821172709ad589cc5b250/zama-poc/src/main.rs. The link references a specific git commit as the code is constantly changing as our ongoing objective is to build a working alpha with TFHE-rs. Here’s some example output from our program run from an M2 Macbook Air:
2024-07-15 23:49:24 [INFO] - Hello, Zama!
2024-07-15 23:49:24 [INFO] - Setting up Zama env
2024-07-15 23:49:24 [INFO] - Number of lines to process: 1000000
2024-07-15 23:49:25 [INFO] - Lines of processed data: 553190
2024-07-15 23:51:39 [INFO] - Lines of encrypted data: 553190
2024-07-15 23:51:39 [INFO] - Lookup result: true
2024-07-15 23:51:39 [INFO] - Lookup result: false
2024-07-15 23:53:53 [INFO] - Genotype frequencies: {9: 9021, 10: 81789, 7: 39357, 4: 341, 2: 9106, 5: 139432, 3: 39307, 0: 12358, 1: 81948, 6: 711, 8: 139820}
2024-07-15 23:53:53 [INFO] - run_iteration took: 269.620158125s
2024-07-15 23:53:53 [INFO] - Bye, Zama!
The high levels findings are:
– We can encrypt over half a million data points in about two minutes.
– Looking up a single SNP in the encrypted data is instantaneous.
– Running an analysis over the entire encrypted dataset takes a bit over two minutes
– All of this takes under 500 MB of RAM
– Incidentally, uncompressed types bring the memory usage to 6.49 GB.
Our preliminary conclusion is that Zama’s TFHE-rs can actually be used to create a performant and scalable encrypted genomics platform! The compute and memory requirements are good enough to build viable consumer as well as enterprise solutions!
Following on from this successful experiment, we are going to check network latencies by testing a client / server model, testing more intense scaling with more complex whole-genome analyses and, finally, running some basic machine learning algorithms to test aggregated operations across data from multiple individuals in one shot. If we can get all this together, we will build out an production-grade platform.
If you made it this far and are interested in our progress, follow me on X or LinkedIn, follow the repository on Github and check the demo site at https://monadicdna.com/ regularly!
Leave a comment