Dark Data Ingestion · FTP to AI Pipeline

Bridge Legacy FTP Data toVector Databases & AI Pipelines

Industrial IoT sensors, cameras, and mainframes speak FTP. Your RAG pipelines, vector stores, and Airflow DAGs speak S3 and webhooks. Rilavek is the protocol bridge that connects them — zero code, zero intermediate storage, millisecond latency.

Rilavek writes to S3 — which plugs natively into

Cloudflare R2Backblaze B2Snowflake (external stage)Databricks (external location)Pinecone (via Lambda)Weaviate (via webhook)Apache AirflowPrefectAWS SageMakerLangChain / LlamaIndexAWS Lambda

From legacy hardware to AI-ready data in three steps

No new software on your hardware. No staging servers. No polling loops. The data flows the moment a transfer completes.

01

Legacy Source Uploads

Industrial cameras, PLCs, CNC machines, or banking mainframes upload via FTP, SFTP, or FTPS — exactly as they do today. No firmware changes, no SDK to install.

02

In-Memory Protocol Bridge

Rilavek receives the stream and translates it to an S3 Upload in real time. Data is never written to our disk. No gap between arrival and routing — the upload and the S3 write are the same operation.

03

AI Ecosystem Activates

The file lands in your S3 bucket and we immediately fire a signed webhook. Your Airflow DAG, Lambda function, or LangChain agent wakes up with the S3 path and begins processing.

Your AI stack is modern. Your dark data sources are not.

A factory floor produces thousands of inspection images per hour — still uploaded via FTP. A hospital imaging system exports DICOM files over SFTP. A banking mainframe drops daily batch exports the same way it has since 1998. Meanwhile your RAG pipelines, vector databases, and inference APIs all expect files to appear in S3, with a webhook to act on them instantly.

The standard fix is an ETL stack: an SFTP poller, a staging bucket, a conversion job, and a notification queue. Every component is a failure point, every hop adds latency, and the whole thing needs to be maintained. Rilavek replaces that stack with a single in-memory protocol bridge — FTP in, S3 out, webhook fired, pipeline running.

Zero-Retention Architecture

Traditional pipelines write to an intermediate staging bucket, then copy to the destination. That copy is a compliance liability for healthcare, finance, and defense. Rilavek's in-memory streaming passes bytes directly from the FTP/SFTP source to your S3 destination. We never write your data to our disk. A pipe, not a bucket.

The Dark Data Problem

Gartner estimates that over 80% of enterprise data is "dark" — generated by machines, stored on local disks or legacy FTP servers, and never reaching analytics or AI systems. Sensor readings, camera frames, machine logs, document exports — the highest-signal input for industrial AI, locked behind legacy protocols.

Real-World Pipeline Patterns

Industrial Vision → Pinecone

Source: High-speed manufacturing cameras (FTP)

Bridge: Rilavek → S3 + webhook triggers embedding Lambda

Destination: Image embeddings indexed in Pinecone for defect similarity search

SFTP Upload → Airflow DAG

Source: Partner data feeds via SFTP

Bridge: Rilavek webhook hits Airflow's REST API trigger endpoint

Outcome: DAG runs immediately on new files — no polling, no schedule lag

Mainframe Exports → RAG Pipeline

Source: Banking / healthcare legacy batch exports (SFTP)

Bridge: Rilavek → S3 bucket mounted as Snowflake external stage

Outcome: LlamaIndex reads from S3, chunks documents, feeds Weaviate for RAG

IoT Telemetry → SageMaker Training

Source: PLC sensor logs and telemetry (FTP)

Bridge: Rilavek fan-out → S3 (primary) + S3 (Backblaze replica)

Outcome: SageMaker training jobs pull from S3 — dual-bucket redundancy at no extra upload cost

The ingestion layer your AI stack is missing

Object storage was not built for high-velocity inference pipelines. These capabilities close the gap between legacy data sources and modern AI systems.

Millisecond Webhook Delivery

Webhooks fire at the transport layer the moment a file transfer completes. Trigger Airflow DAGs, Lambda functions, or Prefect flows with zero polling lag.

FTP / SFTP / FTPS → S3

Legacy devices upload via the protocol they ship with. We translate to S3-compatible API calls at the ingestion layer — no SDK, no config change on hardware.

Zero-Retention Compliance

In-memory streaming only. No intermediate disk writes, no object copies. Designed for HIPAA, PCI DSS, and financial data regulatory requirements.

Vector Database Pipelines

File lands in S3, webhook triggers your embedding Lambda, embeddings go to Pinecone, Weaviate, or Qdrant. Rilavek is the reliable ingestion trigger in that chain.

Airflow & Prefect Integration

POST the webhook directly to Airflow's DAG trigger REST endpoint or Prefect's event API. No polling scheduler, no cron job — pure event-driven orchestration.

Multi-Destination Fan-out

Route the same upload to multiple S3 buckets simultaneously — one for archival, one as a Snowflake external stage, one as a Databricks external location.

Industrial IoT at Scale

Hundreds of PLCs, sensors, or cameras upload in parallel. Sender Groups let you onboard entire device fleets under one set of credentials.

Camera & Vision Pipelines

Camera uploads route to S3 for archival and simultaneously trigger a Lambda for real-time computer vision inference — defect detection, object recognition, OCR.

Data Lake & Warehouse Ready

Write to any S3-compatible bucket. Snowflake and Databricks query your data directly from S3 via external stages — no separate ingestion pipeline required.

Common questions

Does the hardware need to be updated to work with Rilavek?

No. If the device supports FTP, FTPS, or SFTP, you change only the host IP and credentials in the device's network settings. Everything else stays the same. Rilavek handles all protocol translation on the cloud side.

How does Rilavek connect to a vector database like Pinecone?

Rilavek writes the file to your S3 bucket, then fires a signed webhook to any HTTP endpoint you configure. That webhook can hit an AWS Lambda function that reads the file from S3, generates embeddings, and upserts them into Pinecone or Weaviate. Rilavek is the reliable trigger — your embedding logic runs in Lambda.

Can I use this to trigger an Airflow DAG when a file arrives?

Yes. Configure the webhook URL to point at Airflow's REST API trigger endpoint (POST /api/v1/dags/{dag_id}/dagRuns). Rilavek POSTs a signed JSON payload with the S3 path and file metadata the moment transfer completes, giving Airflow everything it needs to start the DAG immediately without polling.

Can I route the same file to multiple AI destinations simultaneously?

Yes. Fan-out is a core feature. Route the same upload to multiple S3 buckets simultaneously — one for long-term archival, one as a Snowflake external stage, one as a Databricks external location. The webhook fires once for each completed upload, so downstream systems are notified independently.

How do I connect to LangChain or LlamaIndex?

Both LangChain and LlamaIndex have native S3 loaders (S3DirectoryLoader, S3Reader). Once Rilavek writes the file to S3 and fires the webhook, your agent application reads directly from S3 using the provided path. No custom ingestion code needed.

Does Rilavek store my data?

No. Rilavek uses in-memory streaming — data passes through our infrastructure directly from the FTP/SFTP source to your S3 destination without any intermediate disk writes. We never retain a copy of your files.

Is this compliant with HIPAA or financial data regulations?

The zero-retention architecture (no intermediate disk writes) and per-sender identity isolation are specifically designed to support HIPAA and financial compliance requirements. We recommend verifying specific obligations with your compliance team.

What destinations are supported?

Any S3-compatible storage: AWS S3, Cloudflare R2, Backblaze B2, Wasabi, and MinIO. Snowflake and Databricks connect to S3 natively via external stages or external locations — Rilavek writes the file to S3 and they query it directly. Vector databases and workflow tools (Pinecone, Weaviate, Airflow, Prefect) connect via the webhook. Source protocols: FTP, FTPS, SFTP, and HTTP/TUS.

Stop leaving dark data out of your AI training set.

Connect legacy FTP and SFTP infrastructure to vector databases, Airflow DAGs, and LLM pipelines in minutes. No middleware, no polling, no data at rest.

Free plan includes 10GB of transfer. No credit card required.