Telegram-канал devopslibrary - DevOps&SRE Library: Unsorted

DevOps&SRE Library

16 May 2025 17:05

Tackling OOM: Strategies for Reliable ML Training on Kubernetes

Tackle OOMs => reliable training => win !

https://medium.com/better-ml/tackling-oom-strategies-for-reliable-ml-training-on-kubernetes-dcd49a2b83f9

Читать полностью…

DevOps&SRE Library

16 May 2025 09:02

From four to five 9s of uptime by migrating to Kubernetes

When we launched User Management along with a free tier of up to 1 million MAUs, we faced several challenges using Heroku: the lack of an SLA, limited rollout functionality, and inadequate data locality options. To address these, we migrated to Kubernetes on EKS, developing a custom platform called Terrace to streamline deployment, secret management, and automated load balancing.

https://workos.com/blog/from-four-to-five-9s-of-uptime-by-migrating-to-kubernetes

Читать полностью…

DevOps&SRE Library

15 May 2025 17:03

Kubernetes Authentication - Comparing Solutions

This post is a deep dive into comparing different solutions for authenticating into a Kubernetes cluster. The goal of this post is to give you an idea of what the various solutions provide for a typical cluster deployment using production capable configurations. We're also going to walk through deployments to get an idea as to how long it takes for each project and look at common operations tasks for the each solution. This blog post is written from the perspective of an enterprise deployment. If you're looking to run a Kubernetes lab, or use Kubernetes for a service provider, I think you'll still find this useful. We're not going to do a deep dive in how either OpenID connect or Kubernetes authentication actually works.

https://www.tremolo.io/post/kubernetes-authentication-comparing-solutions

Читать полностью…

DevOps&SRE Library

15 May 2025 09:01

Replacing StatefulSets With a Custom K8s Operator in Our Postgres Cloud Platform

Over the last year, the platform team here at Timescale has been working hard on improving the stability, reliability and cost efficiency of our infrastructure. Our entire cloud is run on Kubernetes, and we have spent a lot of engineering time working out how best to orchestrate its various parts. We have written many different Kubernetes operators for this purpose, but until this year, we always used StatefulSets to manage customer database pods and their volumes.

StatefulSets are a native Kubernetes workload resource used to manage stateful applications. Unlike Deployments, StatefulSets provide unique, stable network identities and persistent storage for each pod, ensuring ordered and consistent scaling, rolling updates, and maintaining state across restarts, which is essential for stateful applications like databases or distributed systems.

However, working with StatefulSets was becoming increasingly painful and preventing us from innovating. In this blog post, we’re sharing how we replaced StatefulSets with our own Kubernetes custom resource and operator, which we called PatroniSets, without a single customer noticing the shift. This move has improved our stability considerably, minimized disruptions to the user, and helped us perform maintenance work that would have been impossible previously.

https://www.timescale.com/blog/replacing-statefulsets-with-a-custom-k8s-operator-in-our-postgres-cloud-platform

Читать полностью…

DevOps&SRE Library

14 May 2025 17:04

The Karpenter Effect: Redefining Our Kubernetes Operations

A reflection on our journey towards AWS Karpenter, improving our Upgrades, Flexibility, and Cost-Efficiency in a 2,000+ Nodes Fleet

https://medium.com/adevinta-tech-blog/the-karpenter-effect-redefining-our-kubernetes-operations-80c7ba90a599

Читать полностью…

DevOps&SRE Library

14 May 2025 09:01

Optimising Node.js Application Performance

In this post, I’d like to take you through the journey of optimising Aurora, our high-traffic GraphQL front end API built on Node.js. Running on Google Kubernetes Engine, we’ve managed to reduce our pod count by over 30% without compromising latency, thanks to improvements in resource utilisation and code efficiency.

I’ll share what worked, what didn’t, and why. So whether you’re facing similar challenges or simply curious about real-world Node.js optimisation, you should find practical insights here that you can apply to your own projects.

https://tech.loveholidays.com/optimising-node-js-application-performance-7ba998c15a46

Читать полностью…

DevOps&SRE Library

13 May 2025 11:01

🌐Роль и задачи DevOps в современном IT

На открытом уроке рассмотрим:
- что меняется в DevOps;
- актуальные инструменты DevOps инженера;
- сравним DevOps c SRE, Platform Engineer.

После занятий вы будете знать:
- в чем различия и пересечения между ролями DevOps и SRE (Site Reliability Engineering;
- об актуальных трендах и изменениях в методологиях DevOps;
- об актуальных инструментах DevOps инженера.

👉 Регистрация и подробности о курсе DevOps Advanced
https://vk.cc/cLRSxd

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, erid: 2VtzqvTSm5E

Читать полностью…

DevOps&SRE Library

12 May 2025 17:05

The Lost Fourth Pillar of Observability - Config Data Monitoring

A lot has been written about logs, metrics, and traces as they are indeed key components in observability, application, and system monitoring. One thing that is often overlooked, however, is config data and its observability. In this blog, we'll explore what config data is, how it differs from logs, metrics, and traces, and discuss what architecture is needed to store this type of data and in which scenarios it provides value.

https://www.cloudquery.io/blog/fourth-lost-pillar-of-observability-config-data-monitoring

Читать полностью…

DevOps&SRE Library

11 May 2025 17:06

Anomaly Detection in Time Series Using Statistical Analysis

Setting up alerts for metrics isn’t always straightforward. In some cases, a simple threshold works just fine — for example, monitoring disk space on a device. You can just set an alert at 10% remaining, and you’re covered. The same goes for tracking available memory on a server.

But what if we need to monitor something like user behavior on a website? Imagine running a web store where you sell products. One approach might be to set a minimum threshold for daily sales and check it once a day. But what if something goes wrong, and you need to catch the issue much sooner — within hours or even minutes? In that case, a static threshold won’t cut it because user activity fluctuates throughout the day. This is where anomaly detection comes in.

https://medium.com/booking-com-development/anomaly-detection-in-time-series-using-statistical-analysis-cc587b21d008

Читать полностью…

DevOps&SRE Library

10 May 2025 17:03

outpost

Outpost is a self-hosted and open-source infrastructure that enables event producers to add outbound webhooks and Event Destinations to their platform with support for destination types such as Webhooks, Hookdeck Event Gateway, Amazon EventBridge, AWS SQS, AWS SNS, GCP Pub/Sub, RabbitMQ, and Kafka.

https://github.com/hookdeck/outpost

Читать полностью…

DevOps&SRE Library

09 May 2025 17:05

arkflow

High-performance Rust stream processing engine, providing powerful data stream processing capabilities, supporting multiple input/output sources and processors.

https://github.com/arkflow-rs/arkflow

Читать полностью…

DevOps&SRE Library

08 May 2025 17:01

oomd

oomd is userspace Out-Of-Memory (OOM) killer for linux systems.

https://github.com/facebookincubator/oomd

Читать полностью…

DevOps&SRE Library

07 May 2025 17:01

kubectl-klock

A kubectl plugin to render the kubectl get pods --watch output in a much more readable fashion.

Think of it as running watch kubectl get pods, but instead of polling, it uses the regular watch feature to stream updates as soon as they occur.

https://github.com/applejag/kubectl-klock

Читать полностью…

DevOps&SRE Library

07 May 2025 09:02

silver-surfer

Api-Version Compatibility Checker & Provides Migration Path for K8s Objects

https://github.com/devtron-labs/silver-surfer

Читать полностью…

DevOps&SRE Library

06 May 2025 11:01

🌐 OSPF или ISIS: машрутизация между зонами. Как разработать этот функционал и не ошибиться?

Понимание принципов работы маршрутизации между зонами позволяет на качественно новом уровне рассмотреть работу протоколов маршрутизации OSPF и IS-IS, работающих на основе информации о топологии сети и используемых внутри автономных систем (доменов маршрутизации).

Также сравнение отличий в реализации маршрутизации между зонами позволяют выявить ограничения в использовании того либо иного протокола.

На уроке:
- Рассмотрим, как реализована маршрутизация между зонами в OSPF
- Узнаем, как реализована маршрутизация между зонами в ISIS
- Реализуем на практике маршрутизацию между зонами в сети с использованием одного из современных протоколов маршрутизации

👉 Регистрация и подробности о курсе Network Engineer. Professional: https://vk.cc/cLDnyO

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, www.otus.ru, erid: 2VtzqwmHK6b

Читать полностью…

DevOps&SRE Library

16 May 2025 11:05

Go-митап с инженерами MWS

10 июня в Екатеринбурге пройдёт технический митап Go Up от MWS для Go-разработчиков.

Спикеры:
• Эмиль Ибрагимов — о генерации CLI из OpenAPI
• Валерий Локтаев — об автоматизации Terraform
• Георгий Фатеев — о безопасности Go-кода

Этот митап — классная возможность узнать, как строится облачная платформа изнутри, и задать вопросы топовым инженерам в неформальной обстановке. Go Up to the Cloud!

Музей истории Екатеринбурга, 18:00. Регистрация

Читать полностью…

DevOps&SRE Library

15 May 2025 18:01

🌐 MPLS и корпоративные сети: невостребованные опции или жизненно необходимый функционал?

Понимание основ технологий MPLS позволяет на качественно новом уровне рассмотреть применение их в сегменте корпоративных сетей.

Также сравнение типов сервисов, предоставляемых набором технологий MPLS, позволяют выявить ограничения в использовании какого- либо иного сервиса в корпоративных сетях.

На уроке:
- Рассмотрим основы MPLS
- Узнаем, как реализуются сервисы на базе MPLS
- Реализуем на практике один из сервисов MPLS

👉 Регистрация и подробности о курсе Network Engineer. Professional https://vk.cc/cLUwNB

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, www.otus.ru, erid: 2Vtzqwp5V7r

Читать полностью…

DevOps&SRE Library

15 May 2025 11:05

⚠️ Terraform меняет правила игры в DevOps. Хотите освоить инструмент, с которым инфраструктура развертывается в несколько кликов?

⏰ На открытом вебинаре 20 мая в 20:00 МСК вы узнаете, как Terraform делает инфраструктуру управляемой, прозрачной и масштабируемой. Разберём ключевые понятия: провайдеры, состояние, модули и переменные. Вы узнаете, почему IaC стал золотым стандартом DevOps.

💪 Научитесь автоматизировать развёртывание ресурсов, избавьтесь от рутинной ручной работы и освободите время на действительно важные задачи.

👉 Регистрируйтесь прямо сейчас и получите скидку на программу обучения «DevOps-практики и инструменты»: https://vk.cc/cLWecK

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, www.otus.ru, erid: 2Vtzquo9Zat

Читать полностью…

DevOps&SRE Library

14 May 2025 18:05

❓ Да что вы знаете про DevSecOps?

Проверь себя – пройди тест по ДевСекОпс и узнай, можешь ли ты стать DevSecOps-инженером!

🫵 Ответишь успешно — пройдешь на курс «Внедрение и работа в DevSecOps» от Отус по специальной цене.

Освойте принципы и популярные инструменты DevSecOps-инженера, которые помогут повысить вашу востребованность и доход на онлайн-курсе «Внедрение и работа в DevSecOps» от OTUS.

Авторская программа подготовлена опытным инженером и завалидированная партнером StartX.

➡️ ПРОЙТИ ТЕСТ

💥 Бонусом за успешно пройденный тест получишь доступ на сайт курса к записям лучших открытых уроков.

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, erid: 2VtzqwdYVv4

Читать полностью…

DevOps&SRE Library

14 May 2025 11:05

Удаление бакетов в S3: что стоит учесть?
Объектное хранилище S3 — надёжный способ работать с большими объёмами данных. 27 мая проведем митап для тех, кто хочет точно понимать, как устроены ключевые процессы S3 — от настройки версионирования до безопасного удаления бакетов.

В формате демо разберём
🔹 настройку версионирования, multipart-загрузок и lifecycle-политик
🔹 автоматизацию очистки бакета (включая delete marker и незавершённые multipart-загрузки)
🔹 как подготовить бакет к удалению
🔹 настройку политик доступа, временных ссылок и шифрование на стороне сервера SSE

Спикер
Евгения Тарашкевич, инженер K2 Cloud

Формат
Онлайн-митап

Ждем администраторов, девопсов, системных архитекторов и всех, кто работает с S3.

Зарегистрироваться>>

Читать полностью…

DevOps&SRE Library

13 May 2025 17:01

L4-L7 Performance: Comparing LoxiLB, MetalLB, NGINX, HAProxy

As Kubernetes continues to dominate the cloud-native ecosystem, the need for high-performance, scalable, and efficient networking solutions has become paramount. This blog compares LoxiLB with MetalLB as Kubernetes service load balancers and pits LoxiLB against NGINX and HAProxy for Kubernetes ingress. These comparisons mainly focus on performance for modern cloud-native workloads.

https://dev.to/nikhilmalik/l4-l7-performance-comparing-loxilb-metallb-nginx-haproxy-1eh0

Читать полностью…

DevOps&SRE Library

13 May 2025 09:02

Guardrails for Your Cloud: A Simple Guide to OPA and Terraform

https://devsecopsai.today/guardrails-for-your-cloud-a-simple-guide-to-opa-and-terraform-aada0d589dc5

Читать полностью…

DevOps&SRE Library

12 May 2025 09:02

Incident SEV scales are a waste of time

Ask an engineering leader about their incident response protocol and they’ll tell you about their severity scale. “The first thing we do is we assign a severity to the incident,” they’ll say, “so the right people will get notified.”

And this is sensible. In order to figure out whom to get involved, decision makers need to know how bad the problem is. If the problem is trivial, a small response will do, and most people can get on with their day. If it’s severe, it’s all hands on deck.

Severity correlates (or at least, it’s easy to imagine it correlating) to financial impact. This makes a SEV scale appealing to management: it takes production incidents, which are so complex as to defy tidy categorization on any dimension, and helps make them legible.

A typical SEV scale looks like this:

- SEV-3: Impact limited to internal systems.
- SEV-2: Non-customer-facing problem in production.
- SEV-1: Service degradation with limited impact in production.
- SEV-0: Widespread production outage. All hands on deck!

But when you’re organizing an incident response, is severity really what matters?

https://blog.danslimmon.com/2025/01/29/incident-sev-scales-are-a-waste-of-time/

Читать полностью…

DevOps&SRE Library

11 May 2025 09:02

tilt

Define your dev environment as code. For microservice apps on Kubernetes.

https://github.com/tilt-dev/tilt

Читать полностью…

DevOps&SRE Library

10 May 2025 09:01

brush

brush (Bo(u)rn(e) RUsty SHell) is a POSIX- and bash-compatible shell, implemented in Rust. It's built and tested on Linux and macOS, with experimental support on Windows. (Its Linux build is fully supported running on Windows via WSL.)

https://github.com/reubeno/brush

Читать полностью…

DevOps&SRE Library

09 May 2025 09:00

cloud-snitch

Map visualization and firewall for AWS activity, inspired by Little Snitch for macOS.

https://github.com/ccbrown/cloud-snitch

Читать полностью…

DevOps&SRE Library

08 May 2025 09:00

kubepfm

kubepfm is a simple wrapper to the kubectl port-forward command for multiple pods/deployments/services. It can start multiple kubectl port-forward processes based on the number of input targets. Terminating the tool (Ctrl-C) will also terminate all running kubectl sub-processes.

https://github.com/flowerinthenight/kubepfm

Читать полностью…

DevOps&SRE Library

07 May 2025 11:00

🐳❓ Хотите стать экспертом по Docker и микросервисам? Освойте ключевые навыки для разработки, упаковки и развертывания приложений с Docker-образами!

⏰ На открытом вебинаре 13 мая в 20:00 мск мы разберём, как эффективно использовать Docker для контейнеризации и автоматизации процессов развертывания микросервисов. Вы познакомитесь с принципами создания и оптимизации Docker-образов, а также с лучшими практиками DevOps и CI/CD.

Умение использовать Docker для автоматизации и управления микросервисами сделает вас более конкурентоспособным на рынке труда. Получите знания, которые востребованы в крупных компаниях.

👉 Регистрируйтесь на открытый урок и получите скидку на программу обучения «DevOps-практики и инструменты»: https://vk.cc/cLmRPj

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, www.otus.ru, erid: 2VtzqvZdW9h

Читать полностью…

DevOps&SRE Library

06 May 2025 17:05

Connecting Kubernetes K3s cluster to external router using BGP with MetalLB and Nginx Ingress

nikoolayy1/connecting-kubernetes-k3s-cluster-to-external-router-using-bgp-with-metallb-bgp-nginx-as-ingress-9bb767dcecd2" rel="nofollow">https://medium.com/@nikoolayy1/connecting-kubernetes-k3s-cluster-to-external-router-using-bgp-with-metallb-bgp-nginx-as-ingress-9bb767dcecd2

Читать полностью…

DevOps&SRE Library

06 May 2025 09:02

Turing Pi 2 Home cluster

https://tomassirio.medium.com/turing-pi-2-home-cluster-e4a7446ef4ba

Читать полностью…