Job Description
Job Description
The mission of the OFR is to support the Financial Stability Oversight Council (FSOC) in promoting financial stability by: collecting data on behalf of FSOC; providing such data to FSOC and member agencies; standardizing the types and formats of data reported and collected; performing applied research and essential long-term research; developing tools for risk measurement and monitoring; performing other related services; making the results of the activities of the OFR available to financial regulatory agencies; and assisting such member agencies in determining the types of formats of data authorized to be collected by such member agencies.
The Senior Systems Engineer - Observability (SSE) will define and implement infrastructure and application observability, set up governance, optimization, monitoring, and control for a consolidated common operating picture for IT operations. The role will work with engineering, application, security operations, Service Desk and, enterprise/solution architects to develop and implement services, monitor, report and automate where applicable. This role serves as a subject matter expert in a complex array of full stack solutions. Responsible for the migration of feeds from Splunk to Cribl, on-boarding new feeds, and providing Tier 3 support. Working with vendors on open tickets as well as working in an Agile environment and with enterprise change control systems. This role serves as a subject matter expert performing research, analysis, design, creation, and implementation to meet current and future requirements. Responsible for building and implementing an enterprise observability strategy and operationalizing it.
Key Tasks and Responsibilities
- Design, implement, and maintain high-performance and scalable observability solutions in a cloud environment.
- Collaborate with cross-functional teams to gather requirements, architect solutions, and deploy logging and monitoring environments that align with business needs.
- Configuration and maintenance of Datadog integrations including Webhooks, Amazon, Cisco, CrowdStrike, Cribl Stream, Container, VMWare, SNMP, journald, Okta, python, Zscaler, Microsoft 365, Webhooks, Palo Alto.
- Configuration of telemetry logs through Cribl Stream including syslog, SNMP traps, JSON, AWS CloudWatch, AWS S3.
- Development of custom data/telemetry pipelines including Grok parsing, GeoIP parsing, field remapping, and error tracking.
- Ingest telemetry logs directly from cloud SaaS providers such as Zscaler, Okta, CrowdStrike, ServiceNow, Microsoft 365.
- Installation and configuration of the Datadog Agent and Datadog Synthetics Agent on Windows servers, Linux servers, and Docker/Kubernetes containers.
- Configuration of the Datadog Agent to collect host logs, processes, custom metrics (including SNMP), and network performance monitoring (NPM).
- Configuration of Synthetic testing to monitor infrastructure uptime SLAs and SLOs using private locations.
- Configuration of service-related monitors based on metrics, logs, live processes, service checks, anomalies/outliers. Includes monitoring of serverless such as AWS Lambda functions.
- Development of custom dashboards with a focus on reliability and performance of services.
- Configuration and management of Service Catalog, including the definition of services and associated dashboards, monitors, SLOs, synthetic tests, metrics, and logs.
- Configuration of incident management and service-based analytics including integration with JIRA and/or ServiceNow.
- Maintain code repositories and versioning of any scripting or automation.
- Provides technical leadership, oversight, governance, and direction for integrating with, and reporting on, observability pipelines.
- Provide consultative services to support the application integrations required to be observed/monitored, such as Hadoop HDFS, Hadoop Map Reduce, Hive.
- Identify opportunities for monitoring improvement, including incorporating APM and RUM monitoring.
- Update documentation and user guides as needed.
- Collaborate with cross-functional teams.
- Configure monitors & alerts to integrate with Incident Management tools.
Education & Experience
- Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification.
- 7+ years of experience in information technology with hands-on technical/engineering roles including:
- 2+ years of experience working with Datadog, including hands-on experience administering AND supporting a Datadog migration or implementation.
- Hands-on experience supporting a Datadog migration or implementation.
- 3+ years of experience with AWS.
- 3+ years data onboarding within a large-scale enterprise environment.
- Experience in DataDog including building dashboards, reports, and alerts to meet customer requirements.
- Experience with Infrastructure & Monitoring as Code tools.
- Experience configuring and supporting additional Datadog modules.
- Solid understanding of networking and device configuration.
- Experience with migrating from other monitoring platforms to Datadog.
- Experience with Incident Response tools.
- Knowledge of Agile and continuous integration practices.
- Collaborative mindset that thrives in fast paced environments.
- Excellent verbal and written communication skills including the ability to author and present materials ranging from detailed technical specifications to high-level concepts for senior audiences.
Certifications
- Preference given for DataDog, Cribl and AWS certifications.
Security Clearance
- Public Trust
- Must be US Citizen
Other (Travel, Work Environment, DoD 8570 Requirements, Administrative Notes, etc.)
• This is a remote/work from home role.
Computer World Services is an affirmative action and equal employment opportunity employer. Current employees and/or qualified applicants will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, disability, protected veteran status, genetic information or any other characteristic protected by local, state, or federal laws, rules, or regulations.
Computer World Services is committed to the full inclusion of all qualified individuals. As part of this commitment, Computer World Services will ensure that individuals with disabilities (IWD) are provided reasonable accommodations. If reasonable accommodation is needed to participate in the job application or interview process, to perform essential job functions, and/or to receive other benefits and privileges of employment, please contact Aaron McClellan in Human Resources at
314.952.5138
or
Job Tags
Local area, Remote job, Work from home,