📍 Oxford Science Park, UK (Hybrid – 1 day/week onsite)
📃 Permanent contract
📅 Start Date: As soon as possible
About Ochre Bio
Ochre Bio develops RNA therapies for chronic liver disease. Our driving vision is to end the need for liver transplants. Over 1.5 million people, globally, die of chronic liver disease every year. For the vast majority the only cure is a liver transplant. With little more than 40,000 transplants performed, this is a health lottery. Ochre exists to change this.
Our science is built on three pillars:
- Causal human discovery: The largest global collection of human liver data to uncover new therapeutic targets.
- Rigorous human validation: World-leading human models that far outclass traditional animal models.
- Better therapeutic translation: Optimised RNA chemistry and biology to bring effective therapies to patients faster.
We're ambitious, curious, and supportive, embracing failure as part of innovation and guided by our three operating values:
- Clarke's Law: Be bold. Think big.
- Murphy's Law: Fail fast. Learn faster.
- Wheaton's Law: Support each other, always.
The Role
Most biotech companies talk about being data-driven. At Ochre, we mean it literally.
Our mission to end the need for liver transplants rests on one of the largest collections of human liver data in the world, and we need someone exceptional to help us build, maintain, and scale the infrastructure that makes it useful.
As a Data Scientist in our Computational Biology team, you'll sit at the intersection of biology and engineering, designing production-grade pipelines, structuring complex omics datasets, and ensuring data is accessible and reproducible across the organisation. You'll collaborate closely with experimental and computational scientists, and contribute to analysis that directly drives our drug discovery pipeline forward.
Key Responsibilities
- Design, build, and maintain scalable, production-grade cloud-based data pipelines for biological and omics datasets (e.g., RNA-seq, NGS)
- Develop and manage data infrastructure (storage, compute, workflows) in AWS using Infrastructure-as-Code tools (e.g., Terraform)
- Define and enforce data models, schemas, and metadata standards for complex biological datasets
- Implement robust data validation, quality control, and monitoring processes
- Optimise data ingestion, transformation, and access patterns to support downstream analysis and modelling
- Develop and maintain reproducible, well-tested codebases using software engineering best practices (version control, CI/CD, documentation)
- Collaborate with experimental and computational scientists to ensure data is generated, structured, and captured appropriately
- Improve data accessibility, discoverability, and governance across teams
- Communicate technical solutions and results clearly to both technical and non-technical stakeholders
Must-haves
- MSc or PhD in Computational Biology, Bioinformatics, Data Science, Computer Science, or a related quantitative discipline (or equivalent industry experience)
- Strong software engineering skills with proficiency in Python (R a plus) and cloud-based architecture (preferably AWS)
- Proven experience building and maintaining data pipelines and data infrastructure in a research or production environment
- Experience working with large-scale biological datasets, ideally NGS or other omics data
- Solid understanding of data modelling, data architecture, and data management best practices
- Experience with workflow orchestration tools (e.g., Nextflow, Airflow, Snakemake, or similar)
- Experience with Infrastructure-as-Code tools (e.g., Terraform, CloudFormation)
- Ability to understand biological context and collaborate effectively with wet-lab and computational scientists
Nice-to-haves
- Experience with data lake or warehouse architectures
- Familiarity with databases and query languages (e.g., SQL)
- Experience implementing CI/CD for data pipelines or scientific software
- Experience contributing to cross-functional platform or infrastructure projects
- Knowledge of statistical modelling approaches applied to biological data