
NICE is looking for a Site Reliability Engineer in None – Apply Here!
Senior Cloud Site Reliability Engineer
Location: Salt Lake City, UT
The Senior Cloud SRE works to improve the reliability and availability of our solutions. This includes providing on-call support for Major Incidents and helping us reduce the duration and occurrence of outages.
A Typical Day Might Include the Following:
• Create a new dashboard to provide observability for a development team of the health of their application. This can include SLI/SLO metrics.
• Consult with development workstreams on SRE services and how we can assist them improve their reliability.
• Automate activities previously done manually to reduce toil.
• Participate in design, definition and scoping of a new solution to meet our internal customer needs. Thoroughly document this and ensure agreement by the participants.
• Document findings and share with other SREs.
• Work with teams to ensure proper monitoring is setup/enabled.
• Identify evolutionary improvements.
• Meet with Incident and Problem Management to discuss previous Major Incidents and help identify root cause and permanent fixes. Help identify which of these SREs can assist with.
• Assist other teams in doing data/performance analysis to identify why an issue is occurring.
• Review work of other SREs and help train them.
• Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
• Practice sustainable incident response and blameless post mortems.
• Assist in creation of automated end-to-end diagnostics.
• Communicate effectively to technical and non-technical peers and customers
• Coordinates and works on multiple cross-functional base work initiatives and projects.
• Participates in planning long and short term project efforts.
• Leads or provides technical direction for the planning, execution, and validation of testing work.
• Provides technical guidance and coaching/mentoring to team members.
• Follow established processes when performing work or help document and create processes as necessary.
• Document troubleshooting steps and results in appropriate locations for historical access.
• Ensures compliance with policies, procedures, and standards.
• Implements or coordinates remediation required by audits/assessments, and documents as necessary
• Provide on call support for high priority incidents
• Estimate time to complete activities/projects
To Land This Gig You’ll Need:
• Bachelor’s degree in Computer Science, Business Information Systems, or related field (or equivalent work experience) is required.
• 4+ years programming/scripting experience
• 4+ years of experience working within public or private cloud environments
• 4+ years of SRE or related experience
• Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding
• 6+ years communicating in English in a technical field.
• Can effectively troubleshoot supported applications effectively.
• Can work on complex issues which may span multiple applications or environments.
• Proactively engages with peers to discuss issues and keep stakeholders updated.
• Mentors co-workers with expertise
• Coordinates work with peers
• Shares discoveries and best practices
• Learns from others within the team
• Self-Driven. Proactively looks for ways to improve
• Able to work with little supervision and complete tasks and projects as directed.
Bonus Experience:
• Experience working with Prometheus, Datadog, Grafana, Splunk, BMC
• Experience with Application Performance Monitoring solutions-Dynatrace, AppDynamics, New Relic
• Experience working with Kubernetes, Docker, microservices, serverless compute
• Experience working with Ansible, Terraform
• Experience with one or more of the following: C#, C++, Java, Python, Perl, or Ruby.