There are forensic methods to predict the age of a donor from a blood sample. AI- algorithms now build upon these methods to predict the age of a donor with greater accuracy via complex pattern recognition.
You have probably seen crime dramas where a case is solved by a forensic method in which a DNA sample obtained from a crime scene is matched to a suspect or to a subject in a database. This is a method called DNA fingerprints or STR analysis, whereby matches are established based on specific sites with very diverse sequences in the human population (geneticists say that such sites of the genome are highly polymorphic). STR stands for short tandem repeats because each site consists of 5–100 repeats of a short sequence, like CTG for instance. The United States’ FBI uses a standard set of 20 sites together with a national database called CODIS and NDIS respectively.
This is an extremely powerful tool for investigation, especially when it comes to convicting a suspect. However, it is not very useful during the initial investigation because it requires a suspect to match to. But DNA may reveal more than just identity. Aside from the ATCG nucleotide base pairs that we typically sequence, our DNA also retains epigenetic patterns that are influenced by our environment. One of these is DNA methylation: a methyl group attaches to a nucleotide – which may inhibit the associated gene – and is conserved through all daughter cells. Methylation occurs almost exclusively at CpG sites (these are any cytosine followed by a guanine in the 5’ to 3’ direction in the genome, and they are very common). Due to small inaccuracies, we gain and lose these methyl groups at CpG sites over time.
Scientists are able to detect the methylation levels of DNA from subjects of various ages from the DNA in their blood. 16 CpG sites are known to have methylation levels highly correlated with the individual’s age. These correlations obtained from known age values provide a formula for labs to work backwards: from known methylation levels, one’s age can be worked out with some accuracy.
An issue with this approach is that the prediction accuracy depends on the statistical model. The accuracy may be skewed by some biases in the data used for calibration, but even if the training data is reliable and representative, the analysis method may affect the predictions.
Illustrating the concepts of underfitting (left) and overfitting (right). The model on the left misfits the data. The model on the right most closely matches the data points, but does not match the true function as well as the model in the middle does. Only the model in the model would be able to accurately predict the y value associated with new x values.
A typical error in statistical analysis is overfitting. One may develop a model that matches the data, but not the population it is meant to represent. In our case, we are interested in how accurate current models of age prediction based on known samples can match the true age value of any individual in the population (not just those in the sample). Current models are already somewhat accurate at predicting a person’s age based on a blood sample, but in 2017, Athina Vidaki and collaborators asked if AI-generated algorithms based on Artificial Neural Networks (ANNs) produce a more accurate model. ANNs typically teach themselves to exclude and retrain information based on the data itself, so the analyst does not have to make such decisions (machines are usually better at this). This is also why ANNs can make better use of very large datasets than other models.
DNA methylation levels across 16 CpG sites are predictive of advanced age, but the association across each site is unique. Methylation levels at CpG site #1 may have a steep but weak positive correlation while those at site #2 may have a strong but near level negative correlation, or site #3 may have a correlation that has higher individual variance at higher ages. Standard statistical analysis models are mathematical frameworks that can detect these patterns and make predictions of age from these levels. Multiple linear regression analysis is often used to put together and weight these correlations.
ANNs can do the same and more. This is similar to how our eyes can visually recognize objects according to their gestalt rather than analyzing the individual features first, or how we learn grammar in a language after seeing countless examples. That understanding is more profound and complex than can be captured through a set of specific rules that we can describe.
Comparison between traditional multiple regression analysis and the ANN model of Vidaki et al. (2017) for age prediction from blood samples. Both models use the same data. The two left figures depict predicted age vs. true age. The two right figures depict the magnitude of the prediction error. The two top figures are that of traditional statistical analysis predictions and the two bottom figures are that of ANN models. The correlation coefficient between predicted age and true age on the left figures is expected to be higher in the more accurate model. The predicted error is expected to be higher in the more accurate model in the figures to the right.
The correlation between predicted age and true age is stronger with the ANN model (R2=0.964) than with multiple regression (R2=0.923). The figure above shows that the mean prediction error was relatively constant across age groups for multiple regression (top right). Meanwhile, the ANN model had lower total error but overestimated age at younger age groups and underestimated ages at older age groups (bottom right). Counter-intuitively, correcting the biases in the prediction would make them less accurate because the ANN traded the increase in bias against a reduction in variance. This is the case with multiple regression for ages lower than 5 years, where the ANN is actually better calibrated. On this dataset, the flexibility of ANNs proved useful to better negotiate the bias–variance tradeoff.
Most likely, the remaining biases of the ANN will wash away as the training data set becomes larger, provided it is representative. But the predictions will never become perfect. Every biological process has some intrinsic uncertainty that nothing can remove. All ANNs can do is reduce the extra uncertainty that is due to making modeling decisions and choosing individuals for the training sample.
Learn more about forensic analysis and machine learning
- Interview about the development of forensic predictions (Youtube video)
- Coding an ANN for age prediction via facial recognition (Youtube video)
About the Author
Ethan is a recent graduate at the University of Toronto with a double major in Biology and Psychology. In his spare time, he enjoys using statistics and spreadsheets to enable his competitive video game optimization. He hopes to begin a career in DNA forensics in the near future.
Vidaki, A., Ballard, D., Aliferi, A., Miller, T. H., Barron, L. P., & Court, D. S. (2017). DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing. Forensic Science International: Genetics, 28, 225-236. https://doi.org/10.1016/j.fsigen.2017.02.009