Data Scraper

The Data Scraper is the core data collection engine within the Mirox-Agent, actively retrieving real-time information from all monitored equipment at your plant. It connects to your loggers, inverters, meters and battery systems through a library of vendor-specific adapters, normalizes everything into one consistent metric vocabulary, and forwards the result to the rest of the platform — while running a growing set of edge analytics (performance ratio, curtailment tracking, clear-sky baselines, forecasting and network monitoring) directly at the plant.

Purpose and Role

The Data Scraper serves a single, focused purpose: actively collect raw measurements from equipment and forward them for processing. It acts as the bridge between diverse manufacturer equipment and the unified Mirox platform, translating proprietary data formats into standardized metrics.

Core Responsibilities:

Connect to data loggers and monitoring devices through vendor-specific adapters
Retrieve raw measurements on configurable schedules
Transform manufacturer-specific data into standardized metric format
Automatically discover and track installation components
Monitor component activity and operational status
Run edge analytics (expected power, performance ratio, curtailment, clear-sky and forecasts)
Inspect the plant's local network and audit device access
Forward metrics to the Time-Series Database and Digital Twin service
Report component health and operational status to the IoT Cloud

This separation of concerns keeps the Data Scraper lightweight, focused, and independently deployable.

Architecture Overview

The Data Scraper operates as an asynchronous, event-driven service where multiple data collection tasks run concurrently:

Key Architectural Principles:

Stateless: No persistent state between restarts
Adapter-Based: A dedicated, vendor-specific adapter speaks each device's protocol
Self-Healing: Automatic error recovery with exponential backoff
Concurrent: Each data source collected independently
Edge-Deployed: Runs on or near the plant, close to the equipment it reads

Network Access Requirements

The Data Scraper requires direct TCP/IP network access to data sources for communication. This typically means:

Direct Ethernet/WiFi connectivity to the device's IP address
Open network ports for the device's protocol (e.g., TCP 80/443 for HTTP and WebSocket APIs)
Proper network routing between the Data Scraper host and devices

If direct network access is not possible (e.g., isolated OT networks, air-gapped systems, serial-only devices), we may need to implement an intermediate data collector such as:

Third-party data logger with network connectivity
Protocol gateway (Serial-to-Ethernet, fieldbus bridge, etc.)
Custom hardware solution for specialized interfaces

Consult with our engineering team to evaluate connectivity options for your specific installation.

The Adapter System

Device-specific protocols

Adapter

Standardized data format

Concept

An adapter is a vendor-specific connector module that knows how to communicate with one particular family of device. Each adapter is hand-built and reverse-engineered against that device's own web, API, or database interface — there is no single "speaks-any-protocol" engine. The adapter system is the core extensibility mechanism: supporting a new device means adding a new adapter, which we do on request (see below).

Each adapter is a self-contained module responsible for:

Connection Management - Establishing and maintaining communication
Data Retrieval - Fetching measurements using the appropriate protocol
Data Transformation - Converting to standardized metric format

All adapters inherit from a base class that provides health monitoring, automatic retry logic, exponential backoff on failures, metric validation, and status reporting to the IoT platform.

Health Management

Each adapter implements an automatic state machine:

INITIALIZING: Starting up and establishing initial connections
HEALTHY: Operating normally with successful data collections
UNHEALTHY: Experiencing errors but attempting to continue
RECONNECTING: Performing recovery actions after repeated failures
FROZEN: The device is returning stale data — the same values repeat, or no new reading has arrived within the expected window
PAUSED: Temporarily paused by user command; resumes automatically when the pause expires

The system automatically transitions between states, reports status to the platform, and attempts recovery without manual intervention. The FROZEN state is what lets the platform tell a genuinely-down logger apart from one that is merely repeating a stuck value.

Supported Devices

The Data Scraper ships with around fifteen vendor-specific adapters, each built for a particular device family. The list below reflects what is supported today and grows whenever a new device is integrated.

Device family	What it is	How it is read
Bluelog data logger	Meteocontrol-style logger (sensors + strings)	HTTP login plus a live WebSocket feed
SMA Sunny Central	Central inverter controller	Vendor HTTP API (with shutdown detection)
SMA Power Manager	SMA plant controller	Vendor HTTP API
Sungrow logger	Sungrow inverter data logger	Live WebSocket connection
Huawei SmartLogger	SmartLogger 1000 / 3000 / 4000	HTTP web interface (zero-config onboarding)
Fronius inverters	Fronius Datamanager / Datalogger (multiple inverters + optional sensor card)	Fronius Solar API — open JSON REST, no credentials (zero-config onboarding)
Janitza meters	Power-quality meters	HTTP, no credentials (zero-config onboarding)
Phoenix Contact PLC	PLCnext / SPS controller	Vendor HTTPS REST API (zero-config onboarding)
Dexcon controller	Plant controller	Vendor HTTPS REST API
Zebotec	Inverters and sensors	Vendor HTTP API
FREQCON BESS	Battery storage system	Vendor HTTP plus a time-series query interface
PRTG	Network-monitoring server	PRTG HTTP API
Weather model	Open-Meteo weather + on-device PV power model	HTTP (clear-sky and irradiance modelling)

Real Transports in Use

Across these adapters, the actual communication methods are vendor HTTP/HTTPS APIs (most common), live WebSocket connections (Bluelog and Sungrow), S3 / file access, and a time-series query interface used only by the FREQCON battery adapter. There is no generic Modbus, MQTT, FTP/SFTP, or WebDAV ingestion — each integration is purpose-built for its device.

Creating New Adapters

New adapters can be developed to support additional devices or protocols. The modular design and base class functionality significantly reduce development time.

Compatibility with Legacy Equipment: We can create adapters for older devices that were never specifically designed for data export. As long as the device provides its data in any accessible way—whether through a REST API, web interface, database, file system, or any other mechanism—we can extract and integrate that data into the platform.

Unrestricted Data Collection: Our adapters are not limited to the pre-defined data-export formats that data loggers typically provide. We can collect any data that the device makes available, going beyond the standard set of metrics a manufacturer's logger might expose. If a device has additional diagnostic information, advanced parameters, or hidden data points accessible through its interface, we can retrieve and standardize them.

Custom Adapters on Demand

We can create new adapters for virtually any data source at any time upon customer request. The adapter system is designed for rapid extensibility—new protocol support can typically be implemented within days depending on complexity. If you have equipment from a manufacturer not yet supported, contact us to discuss custom adapter development.

No Vendor Documentation Required

Adapter development does not strictly require vendor API documentation. Through network traffic analysis, protocol reverse engineering (where legally permitted), and empirical testing, we can often create functional adapters even for devices with undocumented interfaces. This capability is particularly valuable for legacy equipment or systems with proprietary protocols.

Onboarding Through the Platform

For a subset of devices, you can bring a logger online from the platform without hand-writing any configuration. An onboarding wizard asks the agent to dry-run-connect to the device and streams the live probe results back to you, so you see immediately whether the connection works before committing it. There are two flavours:

Zero-config onboarding — the adapter already owns the device's full reading set, so the wizard just shows a read-only live preview and you save. Available today for Janitza meters, Huawei SmartLogger, and Phoenix Contact controllers. Janitza needs no credentials at all.
Interactive mapping for generic loggers — some loggers expose arbitrary raw values the platform can't interpret on its own. For these the wizard asks the agent to enumerate every raw value the device exposes (group, name, unit, live sample), the operator maps each one to a known metric, and a mapping-driven dry run previews the exact metrics that would be produced before saving. QReader is the first generic logger supported this way.

Not Universal Plug-and-Play

Only the device families above are wizard-onboardable today (the three zero-config families plus generic loggers like QReader). All other adapters still require a per-device configuration delivered with the agent, so treat onboarding automation as device-specific rather than universal.

Metric Standardization

All collected data is transformed into a standardized metric format defined by the platform's metric taxonomy. This ensures consistency across all data sources and enables unified processing downstream.

Metric Structure

Each metric follows a standardized structure compatible with modern time-series databases:

Components:

Name: Standardized metric identifier from predefined taxonomy
Value: Numeric measurement in base SI units
Labels: Key-value pairs for component identification and grouping
Timestamp: Optional preservation of original device timestamp

Standard Labels:

Source adapter type and instance number
Human-readable names
Component identifiers (inverter ID, string number, etc.)
Physical location or grouping information

Unit Conventions

All metrics use base SI units regardless of what the manufacturer's device reports:

Power: Watts (W)
Energy: Watt-hours (Wh)
Voltage: Volts (V)
Current: Amperes (A)
Temperature: Celsius (°C)
Irradiance: Watts per square meter (W/m²)

Adapters automatically convert from manufacturer-specific units (kW, MWh, etc.) to these standards during the transformation phase.

Metric Categories

The platform defines 376 standardized metric types organized into 11 families:

Family	Count	What it covers
Powerplant	156	Grid, AC output, inverters, combiner boxes, strings, irradiation
Battery	86	Battery box, storage, module and cell measurements
Weather	46	Weather inputs and measurements
Weather Model	16	Modelled PV production from weather
Network SNMP	16	SNMP readings from network devices
Agent	19	Agent self-telemetry and health
Network Monitor	11	Local-network monitoring measurements
AI Usage	11	AI feature usage at the edge
Operator	7	Operator-fleet telemetry
Network	4	Basic connectivity
Scraper	4	Data Scraper self-metrics

Label expansions (per-string, per-phase, per-inverter, and so on) multiply these into far more individual time series at a real plant. For the complete metric taxonomy and definitions, see Metric Collection.

Data Collection Flow

Polling Strategies

Adapters support two polling modes:

Interval-Based (default): Executes every N seconds after the previous collection completes. Simple and responsive to varying collection durations.

Static Time-Based: Executes at fixed intervals from midnight with optional offset (e.g., at 00:01, 05:01, 10:01 for 5-minute intervals with 1-minute offset). Useful for alignment with external systems.

Processing Pipeline

After collection, metrics pass through several processing stages:

Metric Preparation: Source labels are added, timestamps applied, and structure validated.

Filtering: Configured filters can modify values, validate ranges, or skip metrics based on rules.

Calculations: Automated calculators derive additional metrics:

Solar radiation power integrated to irradiation energy
String voltage × current calculated to power
Power values integrated to energy over time

Component Discovery: As metrics flow through, the Data Scraper automatically discovers and identifies installation components. This is a crucial feature—since the Data Scraper is the layer that actively collects data, it inherently knows which components exist and are providing data. The system automatically discovers:

Inverters (from inverter power metrics)
String combiner boxes / GAKs (from GAK metrics)
Individual strings (from string voltage/current metrics)
Irradiation sensors (from radiation metrics)
Grid connection points (from grid energy metrics)

Discovered components are synchronized to the IoT platform for inventory management, creating a real-time, self-maintaining equipment registry without manual configuration.

Component Activity Tracking

Because the Data Scraper continuously polls data sources, it can detect which components are actively providing data and which have stopped responding. When a component stops sending metrics, the Data Scraper marks it as inactive in the IoT Cloud. When it resumes, it's automatically marked active again. This provides real-time awareness of equipment operational status—not just whether the Data Scraper can reach the data logger, but whether individual components within the installation are functioning and reporting data.

Production Detection: The system monitors plant operational state:

Detects when production begins based on irradiance and power
Identifies unexpected shutdowns during production hours
Reports state transitions for alerting

Metric Grouping: Metrics are batched by time series to optimize database insertion performance.

Network Traffic Optimization

After metric grouping and batching, the Data Scraper applies additional compression before transmitting data to the Mirox-Cloud. This significantly reduces network traffic volume, which is particularly beneficial when internet bandwidth is limited or metered. For more details on bandwidth considerations, see On-Site Deployment.

Data Export

Processed metrics are forwarded to two destinations:

Time-Series Database: Metrics are pushed in batches with rate limiting and retry logic for long-term storage and historical querying.

Digital Twin Webhook: A separate background task continuously forwards the latest metric values to the Digital Twin service (a completely separate microservice) for real-time analysis. The Data Scraper has no knowledge of what the Digital Twin does with the data—it simply provides the metrics. For information about Digital Twin processing, see Digital Twin.

Stateless Operation

The Data Scraper is designed to be completely stateless:

No persistent storage—all state exists only in memory
Can be stopped and restarted without data loss
Multiple instances can run independently for different parks
Each polling cycle is independent of previous cycles
Crash-resilient with no risk of corrupting persistent state

The only persistent state exists externally:

Configuration files (version-controlled)
Time-Series Database (external system)
IoT Cloud component registry (external system)

This design ensures operational simplicity, reliability, and easy horizontal scaling.

Separation of Concerns

The Data Scraper has a narrow, focused responsibility that enables clear separation from other platform components:

Data Scraper:

Collects raw measurements from equipment
Transforms data to standard format
Discovers and tracks components
Monitors component activity status
Forwards metrics to other services

Digital Twin: Validates against physics models and detects anomalies and losses

Time-Series Database: Stores historical data, provides query interface

IoT Cloud: Maintains component registry, tracks device status, manages equipment inventory

This separation enables independent development, testing, deployment, and scaling of each component while ensuring each service focuses on its core competency.

Advanced Features

Automatic Health Monitoring

Each adapter implements a state machine that tracks operational health with automatic reporting to the platform and exposure via the metrics API for operational monitoring.

Automatic Component Discovery

The Data Scraper's position as the data collection layer gives it a unique advantage: it inherently knows which components exist at an installation because it directly interacts with the metrics they produce. As metrics flow through the system, components are automatically discovered from metric labels and registered with the IoT platform.

Discovery Process:

Metrics arrive with identifying labels (inverter ID, string number, sensor location, etc.)
The Data Scraper extracts component information from these labels
New components are automatically registered with the IoT Cloud
Component metadata (type, identifier, location) is synchronized
The platform maintains an up-to-date equipment inventory without manual entry

This self-discovery mechanism ensures the platform always knows what equipment exists at the installation, eliminating the need for manual configuration and reducing deployment time.

Production State Detection

The service monitors plant operational state and detects production starts, unexpected shutdowns during production hours, and state transitions for alerting and analysis, reporting only when the state actually changes. It also watches for overproduction — output above the clear-sky model for a sustained period — which can flag a logger that is returning frozen values and, in that case, fall back to the clear-sky model so the data stream stays sensible. This provides real-time operational awareness beyond just raw measurements.

Calculated Metrics

Several calculators automatically derive metrics from raw measurements—solar radiation integrated to irradiation energy, string power calculated from voltage and current, and power values integrated to energy over time. These calculations happen transparently, enriching the data stream without requiring explicit configuration.

Edge Analytics

Beyond raw collection, the Data Scraper runs a set of analytics directly at the plant, computed from the live metric feed and exported as chartable time series alongside the raw data.

Expected Power and Performance Ratio

The agent computes the expected power for each plant from its irradiance sources (on-site pyranometer, satellite radiation, or weather model) and compares it against actual output to derive the performance ratio (PR) — a normalized measure of how well the plant is converting available sunlight into electricity. PR is calculated per day and accounts for clipping, frost, and data-quality filters so that anomalous days do not distort the result.

Live Curtailment Tracking

When a plant produces less than it could, the agent attributes the foregone production to its cause: curtailment by the marketer (a deliberate market-driven cap) versus curtailment by the grid operator. This distinction matters for loss accounting and contractual reporting. Curtailment is tracked per minute and emitted as both instantaneous power and cumulative energy. See Loss Detection for how curtailment fits into overall loss attribution.

Clear-Sky Baseline and Forecasting

Clear-sky baseline: a theoretical ideal-conditions PV curve maintained as a long-term record, giving you a stable reference to compare real output against.
Day-ahead forecast: a short-horizon PV production forecast derived from weather data, so you can anticipate the next day's output. These are weather-physics based, not statistical guesses.

Historical Backfill

When a plant is first connected or after a gap, the agent can backfill historical data — replaying raw readings and re-deriving the analytics above for a requested window, then handing the result to the live pipelines so charts are complete from day one rather than starting empty.

Network Monitoring

The Data Scraper includes a built-in local-network inspector that runs alongside data collection and maps the plant's on-site network. It discovers devices on the configured network ranges by sweeping for reachable hosts, reading the address table, identifying vendors from hardware addresses, and probing devices for their identity. Discovered devices are classified against a large library of known network-equipment profiles, and an AI device-identification step helps recognize device families that simple rules miss.

Once devices are known, the inspector polls them for reachability and health (response time, interface and resource status) and reports the results to the platform. You can trigger or stop a discovery scan, re-check an individual device, and review the discovered network — see Local Network Inspector.

Proxy Auditing

For plants reachable through the Mirox Browser Proxy, the agent audits human access to local device interfaces. It groups each person's activity into sessions, redacts sensitive query data before anything is stored, and can produce an AI-generated summary of what a session did. This feeds the platform's access audit trail — see Access Audit Log.

Performance Characteristics

Typical Performance:

Polling frequencies: 1-300 seconds per adapter (configurable)
Concurrent adapters: 20+ running simultaneously
Throughput: 10,000+ metrics per minute sustained
Latency: Sub-100ms from collection to database insertion
Resource usage: 5-15% CPU, 100-500 MB memory on edge hardware

The asynchronous architecture ensures high concurrency without blocking, enabling efficient collection from many sources simultaneously.

Digital Twin — the physics-based analysis engine that consumes the metrics the Data Scraper collects
Mirox-Agent Overview — how the Data Scraper fits within the wider edge agent
Deployment Options — on-site versus cloud deployment trade-offs for the agent
Metric Collection — the full standardized metric taxonomy
Local Network Inspector — the on-site network monitoring surface
Loss Detection — how curtailment and other losses are attributed