TransPerfect logo

Data Scientist (ML, Speech, NLP & Multimodal Expertise) | London

TransPerfect
Full-time
On-site
London, Greater London, United Kingdom
Data Scientist

Job description

We are looking to hire a Data Scientist with strong expertise in machine learning, speech and language processing, and multimodal systems. This role is essential to driving our product roadmap forward, particularly in building out our core machine learning systems and developing next-generation speech technologies.

The ideal candidate will be capable of working independently while effectively collaborating with cross-functional teams. In addition to deep technical knowledge, we are looking for someone who is curious, experimental, and communicative.

Key Responsibilities:

路聽聽聽聽聽聽 Create maintainable, elegant code and high-quality data products that are modeled, well-documented, and simple to use.

路聽聽聽聽聽聽 Build, maintain, and improve the infrastructure to extract, transform, and load data from a variety of sources using SQL, Azure, GCP and AWS technologies.

路聽聽聽聽聽聽 Perform statistical analysis of training datasets to identify biases, quality issues, and coverage gaps.

路聽聽聽聽聽聽 Implement automated evaluation pipelines that scale across multiple models and tasks.

路聽聽聽聽聽聽 Create interactive dashboards and visualization tools for model performance analysis.

Additional Responsibilities:

路聽聽聽聽聽聽 Design and implement robust data ingestion pipelines for massive-scale text and speech corpora including automated data preprocessing and cleaning pipelines.

路聽聽聽聽聽聽 Create data validation frameworks and monitoring systems for dataset quality.

路聽聽聽聽聽聽 Develop sampling strategies for balanced and representative training data.

路聽聽聽聽聽聽 Implement comprehensive experiment tracking and hyperparameter optimization frameworks.

路聽聽聽聽聽聽 Conduct statistical analysis of training dynamics and convergence patterns.

路聽聽聽聽聽聽 Design A/B testing frameworks for comparing different training approaches.

路聽聽聽聽聽聽 Create automated model selection pipelines based on multiple evaluation criteria.

路聽聽聽聽聽聽 Develop cost-benefit analyses for different training configurations.

路聽聽聽聽聽聽 Design comprehensive benchmark suites with statistical significance testing.

路聽聽聽聽聽聽 Develop fairness metrics and bias detection systems.

路聽聽聽聽聽聽 Build real-time monitoring systems for model performance in production.

路聽聽聽聽聽聽 Implement feature drift detection and data quality monitoring.

路聽聽聽聽聽聽 Design feedback loops to capture user interactions and model effectiveness.

路聽聽聽聽聽聽 Create automated retraining pipelines based on performance degradation signals.

路聽聽聽聽聽聽 Develop business metrics and ROI analysis for model deployments.

Job requirements

Required Skills, Experience and Qualifications

Programming & Software Engineering

路聽聽聽聽聽聽 Python (Expert Level): Advanced proficiency in scientific computing stack (NumPy, Pandas, SciPy, Scikit-learn).

路聽聽聽聽聽聽 Version Control: Git workflows, collaborative development, and code review processes.

路聽聽聽聽聽聽 Software Engineering Practices: Testing frameworks, CI/CD pipelines, and production-quality code development.

Machine Learning and Language Model Expertise

路聽聽聽聽聽聽 Traditional Machine Learning and Deep Learning Knowledge: Proficiency in classical ML algorithms (Naive Bayes, SVM, Random Forest, etc.) and Deep Learning architectures.

路聽聽聽聽聽聽 Understanding of Transformer Architecture: Attention mechanisms, positional encoding, and scaling laws.

路聽聽聽聽聽聽 Training Pipeline Knowledge: Data preprocessing for large corpora, tokenization strategies, and distributed training concepts.

路聽聽聽聽聽聽 Evaluation Frameworks: Experience with standard NLP benchmarks (GLUE, SuperGLUE, etc.) and custom evaluation design.

路聽聽聽聽聽聽 Fine-tuning Techniques: Understanding of PEFT methods, instruction tuning, and alignment techniques.

路聽聽聽聽聽聽 Model Deployment: Knowledge of model optimization, quantization, and serving infrastructure for large models.

Collaboration & Adaptability

路聽聽聽聽聽聽 Strong communication skills are a must

路聽聽聽聽聽聽 Self-reliant but knows when to ask for help

路聽聽聽聽聽聽 Comfortable working in an environment where conventional development practices may not always apply:

o聽聽 PBIs (Product Backlog Items) may not be highly detailed

o聽聽 Experimentation will be necessary

o聽聽 Ability to identify what鈥檚 important in completing a task or partial task and explain/justify their approach

o聽聽 Can effectively communicate ideas and strategies

路聽聽聽聽聽聽 Proactive and takes initiative rather than waiting for PBIs to be assigned when circumstances call for it

路聽聽聽聽聽聽 Strong interest in AI and its possibilities, a genuine passion for certain areas can provide that extra spark

路聽聽聽聽聽聽 Curious and open to experimenting with technologies or languages outside their comfort zone

Mindset & Work Approach

路聽聽聽聽聽聽 Takes ownership when things don鈥檛 go as planned

路聽聽聽聽聽聽 Capable of working from high-level explanations and general guidance on implementations and final outcomes

路聽聽聽聽聽聽 Continuous, clear communication is crucial, detailed step-by-step instructions won鈥檛 always be available

路聽聽聽聽聽聽 Self-starter, self-motivated, and proactive in problem-solving

路聽聽聽聽聽聽 Enjoys exploring and testing different approaches, even in unfamiliar programming languages

Additional Skills, Experience and Qualifications

Machine Learning & Deep Learning:

路聽聽聽聽聽聽 Framework Proficiency: Scikit-learn, XGBoost, PyTorch (preferred) or TensorFlow for model implementation and experimentation.

路聽聽聽聽聽聽 MLOps Expertise: Model versioning, experiment tracking, model monitoring (MLflow, Weights & Biases), data monitoring and validation (Great Expectations, Prometheus, Grafana), and automated ML pipelines (GitHub CI/CD, Jenkins, CircleCI, GitLab etc.).

路聽聽聽聽聽聽 Statistical Modeling: Hypothesis testing, experimental design, causal inference, and Bayesian statistics.

路聽聽聽聽聽聽 Model Evaluation: Cross-validation strategies, bias-variance analysis, and performance metric design.

路聽聽聽聽聽聽 Feature Engineering: Advanced techniques for text, time-series, and multimodal data.

Data Engineering & Infrastructure:

路聽聽聽聽聽聽 Big Data Technologies: Spark (PySpark), Hadoop ecosystem, and distributed computing frameworks (DDP, TP, FSDP).

路聽聽聽聽聽聽 Cloud Platforms: AWS (SageMaker, S3, EMR), GCP (Vertex AI, BigQuery), or Azure ML.

路聽聽聽聽聽聽 Database Systems: NoSQL databases (MongoDB, Elasticsearch), graph databases (Neo4j), and vector databases (Pinecone, Milvus, ChromaDB, FAISS etc.).

路聽聽聽聽聽聽 Data Pipeline Tools: Airflow, Prefect, or similar orchestration frameworks.

路聽聽聽聽聽聽 Containerization: Docker, Kubernetes for scalable model deployment

Hybrid
  • London, Greater London, United Kingdom
Tech
Full-time, Permanent

All done!

Your application has been successfully submitted!