HPC Cluster / Linux System Engineer @ USC - Marina del Rey, CA

HPC Cluster / Linux System Engineer

USC
Marina del Rey, CA
16 days ago

The University of Southern California/Information Sciences Institute (part of the Viterbi School of Engineering) is one of the nation’s largest, most successful university-affiliated computer research institutes. Our work ranges from theoretical basic research, such as core engineering and computer science discovery, to applied research and development, such as design and modeling of innovative prototypes and devices. ISI's 400 faculty, professional staff and graduate students carry out extraordinary information sciences research at three distinct locations - Marina del Rey, CA; Arlington, VA; and Waltham, MA.

* This position is located in Marina del Rey, CA. *

ISI is seeking an experienced HPC Cluster and Linux System Engineer to support core institute infrastructure and HPC services. As an Engineer at ISI, you will have the opportunity to work with other highly-skilled technology experts (IT members and Researchers) on complex systems and interesting technical challenges.

REQUIRED

1. Applicants selected for this position will require access to ITAR materials. According to U.S. government regulations, ONLY U.S. citizens OR lawful permanent residents (green card) are eligible for ITAR access.

2. Six (6) years of experience in the following fields: information technology, system administration, and high-performance computing cluster support and management.

3. Two (2) years of experience in high-performance computing cluster and scheduler support and management.

4. Bachelor’s degree in a relevant field such as computer science, computer information systems, etc. OR equivalent combined education, training, and experience.

5. Multi-vendor management, security, and network/Internet protocols.

6. Administrating, monitoring, and maintaining secure Linux/UNIX operating systems (CentOS/RHEL, Ubuntu).

7. Working knowledge of machine learning algorithms and software frameworks (TensorFlow, PyTorch, Keras, CUDA, cuDNN, Caffe, Theano, etc.)

8. HPC system software cluster management tool and job schedulers with Slurm

9. Proficiency with low-latency/high-bandwidth interconnected infrastructure such as 10GigE.

10. Knowledge of HPC storage (FC, SAS) principles, file systems (ZFS, etc.), and compute node storage (NFS).

11. Proficiency in fundamental programming/scripting skills (Bash, Python, or similar languages).

12. Demonstrated expertise in Configuration management tools (Salt, Ansible, Puppet, etc).

13. Ability to identify, troubleshoot, and resolve problems and manage system performance.

14. Demonstrated expertise in HPC cluster and scheduler planning, design, and implementation, involving both CPU and GPU resources.

15. Ability to drive technical leadership and management of complex large-scale computing system projects.

16. Experience establishing processes for maintaining system performance and managing best-in-class standards.

PREFERRED

1. Master’s degree in a relevant field, such as computer science, computer information systems, etc.

2. Virtualization infrastructures (VMWare).

3. Container technologies (Docker, Singularity).

4. Cloud computing (AWS, Azure).

WHAT YOU WILL DO

HPC Cluster and Linux System Engineer collaborates with technical leadership in the design, development, installation, and maintenance of software for Linux and HPC cluster systems and ensure its scalability and fault-tolerance needs are met. The Engineer is responsible for planning, implementation, availability, performance, security, maintenance, and repair of cluster infrastructure.

SCOPE

1. Drives the day-to-day operations for the Linux and HPC cluster systems by monitoring computing resource performance, managing configurations, and addressing security administration.

2. Applies revisions to system firmware and software; engages and collaborates with vendors to assist support activities as required.

3. Leads the development of new HPC software deployment plans, custom scripts, and testing procedures to ensure operational reliability for university researchers; trains technical staff in the use of new software and hardware, either developed or acquired.

4. Oversees the maintenance and management of HPC researcher accounts for staff and ISI research groups; leads the installation, modification, and maintenance of various research software applications for access on HPC clusters; acts as a trusted technical advisor for researcher support and documentation on software applications and programs.

5. Designs, installs, configures, and performs document management for cluster infrastructure, including operating systems, job schedulers, resource managers, provisioning managers, configuration managers, SAN devices, network devices, and other components.

6. Investigates, debugs, and addresses researcher inquiries and requests efficiently through a customer issue ticketing system. Implements customer-focused resolutions efficiently; communicates complex technical concepts in a simple, straightforward manner to address a broad range of stakeholders.

7. Creates opportunities to explore emerging technologies and technical developments to address expanding analytical requirements; identifies new services and develops corresponding implementation plans.

8. Advocates for best practices in the HPC field and champions collaborative relationships with peer HPC research organizations when necessary.

9. Contributes to an inclusive environment that values differences by building and maintaining collaborative relationships with team members, peers, and organizational leaders; actively embodies values and behaviors such as accountability, ethics, and best-in-class customer service; contributes to a culture of trust and transparency by sharing information broadly, openly, and deliberately.

10. Supports the vision for IT department and Institute; works closely with team members and management to implement and support effective solutions for HPC; maintains currency with technology, standards, and best practices; supports process improvement efforts within the team and across the organization.

The University of Southern California values diversity and is committed to equal opportunity in employment.


Minimum Education: Bachelor's degree, Combined experience/education as substitute for minimum education Minimum Experience: 5 years Minimum Field of Expertise: Expert understanding and strong technical knowledge of and experience with systems administration, backups, operating systems programming languages and associated hardware platforms. Possesses the required knowledge to build, configure, load and restore a system from the ground up. Previous systems administration experience required.