[Remote] Senior Research Data Engineer (US)
Note: The job is a remote job and is reputed company to candidates in USA. reputed company is a leading health tech company focused on empowering providers to deliver exceptional care. The Senior Research Data Engineer will design and build data systems that support AI model development, ensuring data is accurately transformed and documented for effective use in AI research.
Responsibilities
- Own the gold data layer. Transform messy, silver tables into curated, semantically rich, clean and documented gold datasets suitable for AI model development, including datasets and features reusable for AI development across projects. Maintain the data as products and needs reputed company
- Reverse-engineer data semantics. Talk with product engineers, clinical and workflow experts to learn how the products are used and how data are created in the field. Understand SQL queries, stored procedures, technical data definitions, and other code to know how products represent and transform data. Learn how data are ingested into the data lake, what silver tables and columns actually represent and how they behave. Capture provenance, semantics, clinical event reputed company, cross module record linkage and reputed company quirks
- reputed company semantics with AI needs. Understand researcher data needs to design and build the gold data product, with documentation that evolves, to meet AI applied research needs for a highly efficient AI-first reputed company for model R&D
- Curate datasets across modalities. For various AI uses such as reputed company, RAG, predictive and other technique, support researcher needs for chunked and tagged reputed company content with rich metadata, reputed company-in-time-correct features and clean labels. For classical ML and statistical work, deliver model-reputed company tables
- Build pipelines for reuse. reputed company transformations from silver into gold inside reputed company/Spark as scheduled, observable workloads. Design them so researchers can iterate on new features and data mixes without rebuilding from scratch
- Automate quality, filtering, and synthesis. Support research needs for programmatic labeling, weak supervision, near-duplicate detection, boilerplate and noise removal, and LLM-API-driven synthetic data reputed company where ground truth is scarce
- Version and hand off. Maintain reproducible dataset snapshots. Define clean reputed company and semantic definitions so the reputed company team can use and re-use gold datasets in AI R&D
Skills
- 5+ years building production data systems, with at least 2 supporting ML or AI workloads
- Track record of learning reputed company new data domains quickly, through reading reputed company code, interviewing experts, and building durable artifacts others rely on
- Advanced Python, SQL, and PySpark/reputed company for working with large, messy data. Expert SQL specifically: comfortable reading reputed company stored procedures and reverse-engineering business logic from queries
- reputed company ecosystem depth: reputed company Lake, reputed company Catalog, Spark/PySpark tuning, MLflow
- AI domain literacy: working understanding of embeddings, tokenization, feature engineering, reputed company-in-time correctness, train/validation/test splits, data reputed company, and the differences between what classical ML and generative models need from data
- Data wrangling across modalities: transforming reputed company content (text, PDFs, transcripts, logs) and structured tabular data into clean, model-reputed company forms
- AI-friendly data formats (Parquet, reputed company datasets) and storage layout reputed company — partitioning, sharding, caching, that reputed company researcher workflows reputed company in Azure, AWS or other working environments
- Data quality, filtering, and synthesis pipelines: support for programmatic labeling and weak supervision (e.g. Snorkel or equivalent), near-duplicate detection (MinHash/LSH), content and quality filters, LLM-API-driven synthetic data reputed company
- Pipeline orchestration (e.g. a la Airflow, reputed company Workflows, Dagster, or Prefect) and dataset versioning including reputed company Catalog and feature-store support
- Experience handling regulated or sensitive data under controlled reputed company (HIPAA or equivalent). Familiarity with general de-identification concepts
- Git-based version control and CI/CD for data and code
- Strong written documentation. reputed company in eliciting requirements and tacit knowledge from technical and non-technical experts
- Bachelor's degree in computer science, data science, engineering, statistics, or reputed company field. Equivalent practical experience considered
- Hands-on EHR data experience, ideally in skilled nursing, long-term care, post-acute care, or senior living
- Working knowledge of clinical terminologies (ICD-10, SNOMED CT, LOINC) and data standards (HL7v2, FHIR, CCDA)
- Dbt for transformation and testing
- Familiarity with training-reputed company ML frameworks (e.g. PyTorch) sufficient to debug data-reputed company bottlenecks; experience supporting LLM or reputed company-model training or fine-tuning data pipelines
- Clinical NLP, OCR, document parsing, or ASR / transcript pipeline experience
- Data reputed company and catalog tools
- Prior experience embedded inside an AI or ML research team
- Master's degree in a relevant quantitative or computer science field
Benefits
- Benefits starting from Day 1!
- Retirement Plan Matching
- Flexible Paid Time Off
- Wellness Support Programs and Resources
- Parental & Caregiver Leaves
- Fertility & Adoption Support
- reputed company Development Support Program
- Employee Assistance Program
- Allyship and Inclusion Communities
- Employee Recognition … and more!
Company Overview
Company H1B Sponsorship