Find Your Next Job

Pcai And Ai Factory

Posted on Nov. 26, 2025

  • Ka, India
  • 0 - 0 USD (yearly)
  • Full Time

Pcai And Ai Factory

Tailor Your Resume for this Job


PCAI And AI Factory


This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office.

Who We Are:

Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE.

Job Description:


HPE Operations is our innovative IT services organization. It provides the expertise to advise, integrate, and accelerate our customers’ outcomes from their digital transformation. Our teams collaborate to transform insight into innovation. In today’s fast paced, hybrid IT world, being at business speed means overcoming IT complexity to match the speed of actions to the speed of opportunities. Deploy the right technology to respond quickly to market possibilities. Join us and redefine what’s next for you.

What you’ll do:

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI & AI Factory Solutions) to manage and optimize HPE’s next-generation AI infrastructure platforms. The ideal candidate will have deep hands-on expertise in AI, HPC, and GPU-accelerated environments, with strong knowledge of HPE Ezmeral, NVIDIA AI Enterprise, Containerized workloads, and Automation frameworks. This role focuses on the operational stability, lifecycle management, and continuous improvement of large-scale Private Cloud for AI (PCAI) and AI Factory deployments.

Key Responsibilities:

1. Platform Administration

Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance.

  • Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks.
  • Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester.
  • Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers.

2. Operational Monitoring & Incident Management

Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards. • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers.

  • Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues.
  • Maintain operational documentation, runbooks, and incident logs.

3. Lifecycle & Configuration Management

  • Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM.
  • Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads.
  • Manage configuration changes, infrastructure templates, and version baselines in production and staging environments.

4. AI Platform & Software Operations

  • Operate HPE Ezmeral Unified Analytics, Data Fabric, and AI Essentials platforms.
  • Support NVIDIA AI Enterprise (NVAIE) components including NIMs, NeMO frameworks, and RAPIDS runtime.
  • Manage and monitor AI/ML workloads (LLM, NLP, Computer Vision, Chatbots) on containerized clusters.
  • Ensure smooth operation of development tools like Jupyter, Spark, Airflow, MLflow, Kubeflow, and Ray.

5. Storage & Data Operations

  • Administer VAST, WEKA, and Alletra MP storage solutions for file, object, and distributed storage.
  • Monitor storage performance, replication, and capacity utilization.
  • Coordinate with storage engineering teams for performance optimization and capacity planning.

6. Security, IAM & Compliance

  • Implement and maintain Keycloak for authentication and role-based access control.
  • Ensure adherence to compliance, audit, and governance standards for AI workloads.
  • Support user and service account provisioning, credential management, and access reviews.

7. Continuous Improvement & Knowledge Enablement

  • Optimize automation workflows to reduce manual intervention and improve service response time.
  • Drive service health reviews, operational dashboards, and SLA compliance reporting.
  • Conduct enablement sessions for L1/L2 teams and act as the final escalation point for operational issues.
  • Collaborate with HPE Engineering for patch validation, release readiness, and operational feedback. Required Skills & Technical Expertise: Core Infrastructure Skills
  • Administration of HPE DL380a, DL325, Cray XD670, and GPU-based Compute environments.
  • Strong knowledge of NVIDIA GPU stack, InfiniBand NDR, and Spectrum-X switches.
  • Experience in managing VAST, WEKA, or Alletra MP storage systems. Software & Platform Operations
  • Virtualization: vSphere, RHEL, Ezmeral Runtime Enterprise

• Containers: Kubernetes, Rancher Harvester, KubeSphere, Morpheus • Automation: Ansible, AWX, NetBox, HPCM, SLURM

  • Observability: Grafana, NetQ, Exivity, DCGM
  • Security: Keycloak, IAM integrations AI/ML Platform Administration
  • Experience in HPE Ezmeral Unified Analytics and Data Fabric operations

• Familiarity with NVIDIA AI Enterprise, NIMs, NeMO, and Triton Inference Server • Working knowledge of TensorFlow, PyTorch, Spark, Kubeflow, MLflow, and Jupyter Preferred Certifications

: • HPE ASE / Master ASE (Compute, Storage, or Ezmeral)

  • NVIDIA Certified Professional / NVAIE Certification
  • RHCE / Kubernetes Administrator (CKA) / VMware VCP Soft Skills:
  • Strong analytical and troubleshooting capabilities.
  • Excellent communication and collaboration skills across global teams.
  • Ability to lead operations improvement initiatives and mentor support engineers.
  • Focused on reliability, scalability, and service excellence. For Internal Job Movement:
  • Approval of the employee's current manager is required.
  • Employees are expected to notify their manager prior to an interview.
  • Employees in Performance Improvement Plan are not eligible to apply.
  • Minimum level should be EXP if applying as part of Internal Job Posting. Why Join Us:
  • Work on next-generation AI infrastructure operations and automation

. • Be part of a global team managing HPE’s AI Factory and PCAI platforms supporting large-scale AI workloads.

  • Opportunity to contribute to service innovation and continuous improvement initiatives in AI infrastructure management

What you need to bring:

Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field.

  • 8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPUbased environments.

  • Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments.

  • Hands-on experience in automation and orchestration across bare metal and containerized infrastructure.

Additional Skills:

Accountability, Accountability, Action Planning, Active Learning, Active Listening, Bias, Business Growth, Business Planning, Coaching, Commercial Acumen, Creativity, Critical Thinking, Cross-Functional Teamwork, Customer Experience Strategy, Customer Solutions, Data Analysis Management, Data Collection Management (Inactive), Data Controls, Design Thinking, Empathy, Follow-Through, Growth Mindset, Intellectual Curiosity (Inactive), Long Term Planning, Managing Ambiguity {+ 5 more}

What We Can Offer You:

Health & Wellbeing

We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.

Personal & Professional Development

We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.

Unconditional Inclusion

We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.

Let's Stay Connected:

Follow @HPECareers on Instagram to see the latest on people, culture and tech at HPE.

#india #operations

Job:

Services

Job Level:

Expert


HPE is an Equal Employment Opportunity/ Veterans/Disabled/LGBT employer. We do not discriminate on the basis of race, gender, or any other protected category, and all decisions we make are made on the basis of qualifications, merit, and business need. Our goal is to be one global team that is representative of our customers, in an inclusive environment where we can continue to innovate and grow together. Please click here: Equal Employment Opportunity.

Hewlett Packard Enterprise is EEO Protected Veteran/ Individual with Disabilities.


HPE will comply with all applicable laws related to employer use of arrest and conviction records, including laws requiring employers to consider for employment qualified applicants with criminal histories.


No Fees Notice & Recruitment Fraud Disclaimer


It has come to HPE’s attention that there has been an increase in recruitment fraud whereby scammer impersonate HPE or HPE-authorized recruiting agencies and offer fake employment opportunities to candidates. These scammers often seek to obtain personal information or money from candidates.


Please note that Hewlett Packard Enterprise (HPE), its direct and indirect subsidiaries and affiliated companies, and its authorized recruitment agencies/vendors
will never charge any candidate a registration fee, hiring fee, or any other fee in connection with its recruitment and hiring process. The credentials of any hiring agency that claims to be working with HPE for recruitment of talent should be verified by candidates and candidates shall be solely responsible to conduct such verification. Any candidate/individual who relies on the erroneous representations made by fraudulent employment agencies does so at their own risk, and HPE disclaims liability for any damages or claims that may result from any such communication.


Tailor Your Resume for this Job


Share with Friends!

Similar Jobs


Luxinnovation Luxinnovation

Project Coordinator – Ai Factory (M/F)

Description Project Coordinator – AI Factory (m/f) Fixed-term contract, 2 years, full-time, R…

Full Time | Esch-sur-alzette, Luxembourg

Apply 6 months, 2 weeks ago