Job Opportunity at Tufts University - HPC Systems Engineer

Exciting job opportunity available at Tufts University! - HPC Systems Engineer

Apply HERE

Overview

Tufts Technology Services (TTS) is a university-wide service organization committed to delivering technology solutions in support of Tufts’ mission of teaching, learning, research, innovation, and sustainability. With staff across all of Tufts’ campuses, as well as a 24x7 IT Service Desk, we collaborate with schools and divisions to meet the demands of a global, mobile community. We promote a collaborative, flexible work environment, embrace diversity and inclusion, and encourage personal and professional development. Learn more about TTS on our website.

What You’ll Do

The High Performance Computing (HPC) Engineer works with directorates within TTS to support, refine and advance the system administration/management of Tufts High Performance (HPC) compute cluster. The HPC system is critical to researchers and users (faculty, student, staff) at all levels across the university and is maintained with enterprise level expectations. The position will take ownership of projects that include identifying problems, developing testing protocols, and developing and implementing solutions. The work often requires coordination of Tufts researchers, TTS staff, and outside vendors. The role will also assist the larger team within Research Technology to assess and evaluate ongoing innovation and cutting-edge solutions to meet research computing needs.

HPC Systems Support:

  • Maintain the HPC ecosystem from system spec, provisioning, OS installation, maintenance including login, file transfer nodes, compute nodes, job schedulers (slurm), virtualization layer (vmware) and interface with larger team regarding network, storage administration, data center load balancer and firewall issues.
  • Maintain user facing HPC web gateways (Open OnDemand, Jupyter Notebook, Lab, Hub, FastX, OpenXDMod, Starfish, Galaxy, RStudio, etc).
  • Install, maintain and test common open source and commercial software stacks for both cpu and gpu computing. AI/ML/DL stacks, Anaconda, PyTorch, TensorFlow, RAPIDS, etc.) and containers such as singularity and docker.
  • Utilize configuration management and security best practices to maintain systems with tools such as ansible. Document all work and provide regular progress updates.
  • Work closely with storage admin to co-maintain HPC storage, features and subsystems
  • Respond to outage, emergency or urgent systems issues.

Operational Improvements:

  • Develop, document, automate continual operational improvements in the HPC system administration service.
  • Improve metrics, availability, and resource management of cpu/GPUs.
  • Maintain scripts to user, group, systems management.

HPC Systems Innovation and Support:

  • Provide system administration services and assist other team members to evaluate Proof of Concept (POC) systems to foster innovative architectures and solutions in new and emerging paradigms such as composable computing, GPU virtualization, cloud bursting, multi-site federation.

Service Management, Education, Outreach:

  • Work with team members to provide full life cycle system administration service through specification, purchase, installation, maintenance as well as service marketing, community building, outreach, training, education, and support.
  • Maintain ties with the larger system administration and research computing community to better understand new management paradigms, methods, and opportunities.

What We’re Looking For

Basic Requirements:

  • Knowledge and experience typically acquired through a Bachelor’s Degree in related field with two years related experience or High School diploma plus 3 or more years related experience in a higher education, research, scientific or technical computing environment.
  • Understanding of and experience with high performance computing, scientific gateways from both an architecture, subsystems and networking perspective as well as daily usage, support, and application-level knowledge.
  • Experience maintaining specific technologies used in research and high-performance computing such as job schedulers (Slurm), Containers (singularity, docker), RDMA over ethernet, Infiniband, GPUDirect, etc.
  • Experience with scripting basics (e.g., Shell, Batch, Perl, Python, etc.).
  • Experience with modern system administration devops and design patterns to automate Linux HPC clusters, operating system, software installation via scripting as well as configuration management systems such as ansible, puppet.
  • Experience installing, maintaining open source and commercial research computing web gateway solutions such as OpenOnDemand, OpenXDMod, FastX, Airvata, HubZero, NanoHub, or Taverna.
  • Experience installing, configuring, maintaining, troubleshooting common frameworks and software used in research and high-performance computing such as scikit-learn, TensorFlow/TensorBoard, Keras, Theano, Caffe, Pytorch, MXNet, DGL, GPU libraries such as NVIDIA RAPIDS suite (cuDF, cuML, cuGraph, cuDNN). on both GPU and CPU architectures. Management/monitoring frameworks such as DCGM.
    • Experience and resourcefulness with all aspects of the system management and development cycle from analysis through evaluation and documentation when approaching system engineering challenges.
    • Willing and able to learn technologies and required domain knowledge at a rapid pace.
    • Background supporting academic researchers (e.g., faculty, staff, students, etc.).
    • Strong communication, presentation, customer service, problem-solving skills in pursuit of system management and innovation.
    • Demonstrated ability to work effectively in a dynamic, collaborative environment with colleagues and build partnerships across technical disciplines, job functions and departments.
    • Dedication to taking ownership of projects that include identifying problems, developing testing protocols, and developing and implementing solutions.

Preferred Qualifications:

  • Master’s Degree in science or engineering field plus 2 or more years related experience in a higher education, research, scientific or technical computing environment.
  • Familiarity and experience with resources at private or public sector HPC research computing environments, national centers or XSEDE (eXtreme Science and Engineering Discovery Environment) beneficial.
  • Knowledge of the continuum of research computing and scalability from desktop to HPC to cloud and grid solutions.
  • Experience with relational databases such as mariadb, mysql or postgres.

Pay Range
Minimum $84,400.00, Midpoint $105,550.00, Maximum $126,700.00
Salary is based on related experience, expertise, and internal equity; generally, new hires can expect pay between the minimum and midpoint of the range.