Site Reliability Engineer

Company Details

Global leader in workforce solutions for machine learning

Description

CloudFactory is a global leader in combining people and technology to provide workforce solutions for machine learning and business process optimisation. As an Impact Sourcing Service Provider (ISSP), CloudFactory creates economic and leadership opportunities for talented people in developing nations, so we can earn, learn, and serve our way to become leaders worth following.

Our mission is to connect one million people in the developing world to digital-age work while raising them up as leaders to address poverty in their own communities. This is because we believe that talent is equally distributed around the world, but opportunity is not.

There has never been a more exciting time to join our Engineering team here at CloudFactory. Why? Because we are on a journey to connect talented people in the developing world to digital opportunities.

As a member of CloudFactory’s engineering team, you will be tasked with running a world-class distributed workforce management platform that will connect one million people in developing countries to basic computer work. SREs, in CloudFactory, are a blend of seasoned operators and software developers that fuse engineering principals, operational knowledge, security and automation. You will be responsible for designing, developing and running “Operational elements” of the Platform to ensure the availability of 99.5%

Responsibilities

  • Provide necessary operational support to multiple platforms. Participate in periodic 24×7 on-call duties
  • Design, build and maintain core(macro) infrastructure pieces that allow developers to focus on product features or that allows CF platforms to be resilient or scale as we grow.
  • Build best practices on IaC and create templates
  • Perform initial troubleshooting for all the services – including necessary roll back and restore to maintain the high platform availability, create necessary dashboards
  • Create, maintain and enhance the necessary run books to ensure on-call engineers have access to right information to respond and resolve issues
  • Deploy, support and monitor new and existing services, platforms, and application stacks
  • Capacity and performance management of environments
  • Help to define the necessary availability and SLA for various Platform based products and ensure necessary process, toolset and people are in place to realise it
  • Create, maintain and enhance the production readiness standards (performance, availability, security, compliance etc) for all services, applications and APIs and ensure standards are adhered to before services can go live into the production
  • Create, maintain and enhance monitoring, alerting and debugging tools and capabilities
  • Good architectural understanding of deployed services to provide early feedback to the engineering teams
  • Collaborate with Engineering teams to implement performance improvements identified through tracking service latency figures, CPU utilization figures, etc.
  • Ensure effective communication is maintained with necessary stakeholders

Requirements

Behavioural Skills

  1. Ability to work across global teams and working with different cultures across different time zones with strong communication and collaboration skills
  2. Tendency to go above and beyond to make things work; manage own and others work to meet the deadline and assist other team members in their deliverables
  3. Ability to breakdown complex problems into simple solutions

Technical Skills

  1. Experience in supporting microservice based solution, microservice implementation technologies (both serverless and containers)
  2. Experience with monitoring, alerting and incident management tools in AWS and microservice world
  3. Experience in setting up processes and tools to provide 24 x 7 operational support
  4. Understand customer issues, troubleshooting
  5. Experience with AWS cloud infrastructure (EC2, Cloudformation, ECS Fargate, Lambda, DynamoDB, Kinesis, AWS ElasticSearch etc)
  6. Some CI/CD experience with Github actions
  7. Understanding of web security and DevSecOps principles
  8. Strong communication and collaboration skills
  9. Ability to work across global teams and working with different cultures across different time zones

Technology Stack

  1. AWS (Lambda, SQS, SNS, DynamoDB, S3, ECR, ECS Fargate, EC2, RDS, Route 53, Kinesis, AWS ElasticSearch)
  2. Terraform, CloudFormation, Serverless framework
  3. Grafana, ELK stack, Cloudwatch, Prometheus
  4. Go, Nodejs, React, Python
  5. Jenkins, Docker, Shell script

Benefits

  • Lunch & Snacks Provided Monday-Friday
  • Phone Allowance and/or Internet Allowance
  • Travel Allowance
  • Social Security Fund
  • Festival Bonus
  • Health Spending Account
  • Medical Insurance
  • Accidental Insurance
  • Amazing Company Mission and Culture
  • Growth Opportunities

Tagged as: aws cloud infrastructure, terraform / cloudformation / serverless framework, go / nodejs / react / python, jenkins / docker / shell script

Visit us on LinkedInVisit us on FacebookVisit us on Twitter