Site Reliability Engineer - Cloud Infrastructure

TikTok
Dublin
Full time
18 hours ago
The Technical Infrastructure SRE team is responsible for managing the whole infrastructure and applications. Our mission is to ensure all production systems can support our fast growing world-wide user base as well as keep the entire systems stable, efficient and cost effective. We manage deployments, system capacity, traffic scheduling, fault tolerance, disaster recovery, emergency response, automations, operation platforms development, etc. 

Our team is full of diversity. We have team members in Singapore and China. Now we are extending our teams to Ireland. We are looking forward to seeing new talents joining our team and together helping TikTok grow.

What the team does:
1. Reliability: Ensuring the reliability and efficiency of our core infrastructure, focusing on system capacity and stability; setting up reliability standards and recovery SOP.
2. Reliability: Troubleshooting and locating the technical issues, bottleneck analysis, managing system high availability architecture transformation and upgrading.
3. Efficiency: Building automated operation solutions for large-scale systems; partnering with system development teams for system iteration.
4. Efficiency: Designing and implementing software platforms and monitoring frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance.
5. Cost: There are millions of CPUs. We should build delivery standards, and monitor and budget systems to optimize the cost of the company.
6. Compliance: Designing and setting up new IDC; designing and implementing a data protection plan to meet the standard requirement.

Responsibilities:
Be responsible for the basic engineering construction of byte infrastructure products & components, focusing on infrastructure O&M architecture optimization, automated O&M platform research and development, data and intelligent O&M. Through the methodology of software engineering and digital intelligence, O&M, around the O&M requirements of infrastructure products & components, built a layered and systematic O&M platform to solve the problem of ultra-large-scale cluster O&M management. (Goals) To provide stable, efficient, and low-cost serverless infrastructure facilities for Mid-Platform & Business. We aim to be the leading SRE team across the industry。

1. Reliability: Ensure the stability of the company's core infrastructure (system high availability and reliability), focus on system performance and capacity, establish O&M (Operation & Maintenance) standards and SOP processes.
2. Reliability: Troubleshooting and locating technical issues, collaborate with the technical team to develop and implement system capacity planning, performance testing, anomaly analysis, and fault diagnosis and resolution strategies.
3. Efficiency: Research and evaluate large-scale system architectures and technologies, use new tools and technologies to improve existing systems and processes to support business development.
4. Efficiency: Design and implement O&M platforms to achieve efficient, automated, and intelligent system maintenance.
5. Cost: Develop delivery standards for mass production system scales, from budgeting to resource delivery, to online system capacity assessments, to help the company optimize IT costs.
6. Compliance: Design and establish new IDC, design and implement data protection plans to meet standard requirements.
Apply
Other Job Recommendations:

Site Reliability Engineer, Video Live Streaming Architecture

TikTok
Dublin
They build a competitive video transmission network and multimedia processing platform, building data foundation and analysis...
1 week ago

Site Engineer

CMB Entreprend
County Kerry
€39,881 - €70,039
We are seeking a highly motivated and safety-conscious Site Engineer to join our team in Kerry. Responsibilities: Health, Safety,...
2 days ago

Reliability Engineer and Maintenance Excellence Program Lead

AbbVie
County Mayo
  • Define and own the global MEP strategy, governance model,...
  • Set, review, and drive performance of MEP KPIs (asset...
1 week ago

Site Reliability Engineer - Cloud Infrastructure

TikTok
Dublin
€142,556 - €180,508
The Technical Infrastructure SRE team is responsible for managing the whole infrastructure and applications Our mission is to...
2 days ago

Reliability Engineer

Egis Group
Dublin
  • Analytical thinker with excellent attention to detail.
  • Strong problem-solving and troubleshooting abilities...
2 days ago

Cloud Engineer

Fidelity Investments
County Galway
Cloud Engineer - Container Platforms Do you want to work with leading edge container technologies and build platforms for...
3 days ago

Site Reliability Engineer

Microsoft
Dublin
  • Develops technical expertise in the code, features, and...
  • Develops, tests, and implements changes to optimize code and...
2 weeks ago

Reliability Engineer and MEP Lead

AbbVie
County Mayo
  • Define and own the global MEP strategy, governance model,...
  • Set, review, and drive performance of MEP KPIs (asset...
3 weeks ago