Environmental forensic investigations routinely generate two parallel and difficult-to-manage evidence streams. The first is large collections of historical documents accumulated over decades of site operation, regulatory oversight, and remedial investigation. The second is high-dimensional analytical chemistry datasets comprising hundreds to thousands of samples across multiple contaminant classes. Extracting defensible conclusions from either stream is time-consuming. Integrating them is harder still, but is where data mastery truly lives.
This presentation describes two methodological frameworks developed to address these long-standing practical challenges.
The first framework addresses document-intensive investigations. Cases involving legacy industrial sites and US Superfund sites routinely involve tens of thousands of pages of reports, deposition transcripts, regulatory submissions, and scientific literature. Finding relevant information using conventional means is slow, inconsistent, and prone to omission. You have likely tried general-purpose AI tools. Unfortunately, these lack familiarity with environmental chemistry nomenclature, regulatory frameworks, and the specialized jargon of contaminant investigations, and carry no guarantee of scientific grounding. You also need to be a prompt engineer to get close to the right answer. The approach described here uses a large language model based retrieval-augmented generation (RAG) system trained within the environmental domain, giving it fluency in the language and concepts practitioners use daily. Knowledge nodes are curated domain-specific reference sets drawn from vetted peer-reviewed literature and systematic reviews that anchor AI outputs to citation-traceable scientific material. The result is a system that responds to natural language queries across an indexed document corpus and returns answers that are specific, cited, and defensible. This is demonstrated using publicly available records from the Centredale Manor Superfund site and the US EPA Superfund Site webpages.
The second framework addresses pattern recognition and source attribution in high-dimensional chemical datasets. Conventional tools such as principal component analysis, basic clustering, and static cross-plots are widely used but limited when dozens of analytes and hundreds of samples are involved. The interactive data portal described here integrates machine learning Uniform Manifold Approximation and Projection (UMAP), hierarchical cluster analysis (HCA), chemical fingerprint visualization, and geospatial mapping in a unified environment. UMAP projects contaminant profiles into low-dimensional space, making source groupings visually apparent. The platform is fully bidirectional: a cluster selected in UMAP space immediately reveals corresponding samples on a site map and their fingerprints, while samples selected geographically are instantly shown in their multivariate chemical context. The approach is demonstrated using PAH datasets with UMAP, HCA, and PAH histograms used together to differentiate source signatures. The same tools can also identify mismatched co-elutions and batch-level laboratory errors that routine QA/QC misses but that can distort source attribution conclusions. This interactive data discovery portal is the secret sauce for identifying contamination sources at complex sites.
Together, these frameworks offer practical solutions to problems most environmental practitioners face: too many documents to review thoroughly, too much data to interpret with conventional tools, and findings that are difficult to communicate to non-technical decision-makers. Both have been applied across PAH, PCB, PFAS, and PCDD/F investigations and are available through the Statvis Environmental Data Intelligence System (EDIS).
Court Sandau is the founder and principal of Chemistry Matters Inc. and Statvis Analytics Inc., with over 30 years of experience in environmental chemistry and more than 20 years applying chemical fingerprinting and forensic analysis to contaminated sites across a broad range of contaminant classes including PCDD/Fs, PCBs, PAHs, PFAS, and petroleum hydrocarbons. The forensic workflows and machine learning tools developed throughout his consulting practice form part of the scientific foundation of Statvis Analytics Inc., a cloud-based environmental data intelligence platform that makes these capabilities broadly accessible to environmental professionals.
Sign up for our mailing list to get notice of all upcoming webinars, workshops and on-demand courses as well as exclusive events!
3831 West 50th Avenue
Vancouver BC V6N3V4