NginxPulse LogoNginxPulse Docs

Log Parsing

Log Parsing

Flow

  1. Initial scan: parse recent window after startup.
  2. Incremental scan: periodic scan by system.taskInterval.
  3. Backfill: fill older logs in background.
  4. IP geo backfill: resolve IP locations asynchronously.

Incremental scan & state

  • State file: var/nginxpulse_data/nginx_scan_state.json
  • If current size < last size, the file is treated as rotated and re-parsed.
  • Site ID is derived from websites[].name. Renaming creates a new site.

Batch size

  • system.parseBatchSize controls batch size (default 100).
  • Can be overridden by LOG_PARSE_BATCH_SIZE.

Progress & ETA

Endpoint: GET /api/status

  • log_parsing_progress
  • log_parsing_estimated_remaining_seconds
  • ip_geo_progress
  • ip_geo_estimated_remaining_seconds

Poll this endpoint to update progress in UI.

10G+ log optimization

  • Parsing writes core fields first; IP geo is queued.
  • IP geo is resolved in batches after parsing.
  • For speed: increase parseBatchSize, use faster disk, or split logs by day.

IIS default rule (W3C Extended)

NginxPulse now supports logType=iis (alias: iis-w3c). The built-in parser follows the common IIS W3C default field order:

date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken

Notes:

  • Metadata lines starting with # (such as #Software, #Version, #Fields) are skipped automatically.
  • URL is built from cs-uri-stem; when cs-uri-query is not -, it is appended as path?query.
  • IIS W3C timestamps are typically UTC, and the default time layout is 2006-01-02 15:04:05.

Config example:

{
  "name": "iis-site",
  "logPath": "/var/log/iis/u_ex*.log",
  "logType": "iis"
}

Sample line:

2026-02-08 10:05:34 10.0.0.10 GET /index.html a=1&b=2 443 - 203.0.113.8 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64) https://example.com/ 200 0 0 36

Retention

  • system.logRetentionDays controls cleanup.
  • Cleanup runs at 02:00 (system timezone).

Mounting Multiple Log Files

WEBSITES is a JSON array, each item describes one site. logPath must be a container-accessible path.

Example:

environment:
  WEBSITES: '[{"name":"Site 1","logPath":"/share/logs/nginx/access-site1.log","domains":["www.kaisir.cn","kaisir.cn"]}, {"name":"Site 2","logPath":"/share/logs/nginx/access-site2.log","domains":["home.kaisir.cn"]}]'
volumes:
  - ./nginx_data/logs/site1/access.log:/share/logs/nginx/access-site1.log:ro
  - ./nginx_data/logs/site2/access.log:/share/logs/nginx/access-site2.log:ro

If you have many sites, consider mounting the entire log directory and specify exact files in WEBSITES:

environment:
  WEBSITES: '[{"name":"Site 1","logPath":"/share/logs/nginx/access-site1.log","domains":["www.kaisir.cn","kaisir.cn"]}, {"name":"Site 2","logPath":"/share/logs/nginx/access-site2.log","domains":["home.kaisir.cn"]}]'
volumes:
  - ./nginx_data/logs:/share/logs/nginx/

Tip: If logs are rotated daily, use * to replace the date, e.g. {"logPath":"/share/logs/nginx/site1.top-*.log"}.

Compressed logs (.gz)

.gz logs are supported. logPath can point to a single .gz file or a glob:

{"logPath": "/share/logs/nginx/access-*.log.gz"}

There is a gzip sample in var/log/gz-log-read-test/.

Remote Log Sources (sources)

When logs are not convenient to mount locally, you can use sources instead of logPath. Once sources is set, logPath is ignored.

sources is a JSON array. Each item defines a log source. This design allows:

  1. Multiple sources per site (multiple machines/directories/buckets).
  2. Different parsing/auth/polling strategies per source.
  3. Easy extension for rotation/archival without changing old sources.

Common fields:

  • id: unique source ID (recommend globally unique).
  • type: local / sftp / http / s3 / agent.
  • mode:
    • poll: periodic pulling (default).
    • stream: streaming input only (currently Push Agent only).
    • hybrid: stream + polling fallback (only Push Agent streams; others still use poll).
  • pollInterval: polling interval (e.g. 5s).
  • pattern: rotation glob (SFTP/Local/S3 use glob; HTTP uses index JSON).
  • compression: auto / gz / none.
  • parse: override parsing (see “Parsing Override”).

stream mode is mainly for Push Agent; other sources still run as poll.

Option 1: HTTP Exposed Logs

Best when you can provide HTTP access to log files (internal network or with auth).

Method A: Expose files via Nginx/Apache (lock it down to avoid leakage)

location /logs/ {
  alias /var/log/nginx/;
  autoindex on;
  # Add basic auth / IP allowlist
}

Then configure sources:

{
  "id": "http-main",
  "type": "http",
  "mode": "poll",
  "url": "https://logs.example.com/logs/access.log",
  "rangePolicy": "auto",
  "pollInterval": "10s"
}

rangePolicy:

  • auto: prefer Range; fallback to full download (skips already-read bytes).
  • range: force Range; error if not supported.
  • full: always download full file.

Method B: JSON index API
Good for rotated logs (daily/hourly) or .gz archives:

{
  "index": {
    "url": "https://logs.example.com/index.json",
    "jsonMap": {
      "items": "items",
      "path": "path",
      "size": "size",
      "mtime": "mtime",
      "etag": "etag",
      "compressed": "compressed"
    }
  }
}

Recommended index contract:

  1. Return a JSON with an array of log objects.
  2. Each item must include path (a fetchable URL).
  3. Provide size / mtime / etag to detect changes and avoid duplicates.
  4. mtime supports RFC3339 / RFC3339Nano / 2006-01-02 15:04:05 / Unix seconds.

Example response:

{
  "items": [
    {
      "path": "https://logs.example.com/access-2024-11-03.log.gz",
      "size": 123456,
      "mtime": "2024-11-03T13:00:00Z",
      "etag": "abc123",
      "compressed": true
    },
    {
      "path": "https://logs.example.com/access.log",
      "size": 98765,
      "mtime": 1730638800,
      "etag": "def456",
      "compressed": false
    }
  ]
}

If your fields differ, map them in jsonMap:

{
  "index": {
    "url": "https://logs.example.com/index.json",
    "jsonMap": {
      "items": "data",
      "path": "url",
      "size": "length",
      "mtime": "updated_at",
      "etag": "hash",
      "compressed": "gz"
    }
  }
}

Notes:

  • path must be a directly accessible log URL.
  • For .gz, provide stable etag / size / mtime to avoid duplicate parsing.
  • If HTTP Range is not supported, use auto or full.

Option 2: SFTP Pull

Ideal when SSH/SFTP access is available, no extra HTTP service needed.

{
  "id": "sftp-main",
  "type": "sftp",
  "mode": "poll",
  "host": "1.2.3.4",
  "port": 22,
  "user": "nginx",
  "auth": { "keyFile": "/secrets/id_rsa", "passphrase": "", "password": "" },
  "path": "/var/log/nginx/access.log",
  "pattern": "/var/log/nginx/access-*.log.gz",
  "pollInterval": "5s"
}

auth supports keyFile, passphrase (private key passphrase), and password.

SFTP key-based login walkthrough (local -> remote)

  1. Generate a dedicated key pair on your local machine (recommended: ed25519):
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/nginxpulse_sftp -C "nginxpulse-sftp"
  1. Install the public key on the remote user:
ssh-copy-id -i ~/.ssh/nginxpulse_sftp.pub <user>@<host>

If ssh-copy-id is unavailable:

cat ~/.ssh/nginxpulse_sftp.pub | ssh <user>@<host> \
'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'
  1. Ensure remote permissions are correct:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
  1. Verify SSH key login from local (public key only):
ssh -i ~/.ssh/nginxpulse_sftp -o PreferredAuthentications=publickey <user>@<host>
  1. Verify SFTP key login:
sftp -i ~/.ssh/nginxpulse_sftp <user>@<host>
  1. After verification, configure sources:
{
  "id": "sftp-main",
  "type": "sftp",
  "host": "<host>",
  "port": 22,
  "user": "<user>",
  "auth": {
    "keyFile": "/absolute/path/to/nginxpulse_sftp",
    "passphrase": ""
  },
  "path": "/var/log/nginx/access.log"
}

keyFile must be an absolute path accessible on the machine (or container) running NginxPulse.

  1. If login still fails, use verbose SSH output first:
ssh -vvv -i ~/.ssh/nginxpulse_sftp -o PreferredAuthentications=publickey <user>@<host>

On Alpine, a common SSH log check is:

grep sshd /var/log/messages | tail -n 80

Option 3: Object Storage (S3/OSS)

Best when logs are archived to OSS/S3 (Aliyun/Tencent/AWS compatible endpoints).

{
  "id": "s3-main",
  "type": "s3",
  "mode": "poll",
  "endpoint": "https://oss-cn-hangzhou.aliyuncs.com",
  "bucket": "nginx-logs",
  "prefix": "prod/access/",
  "pollInterval": "30s"
}

Parsing Override (sources[].parse)

If formats differ across sources, override parsing per source:

{
  "parse": {
    "logType": "nginx",
    "logRegex": "^(?P<ip>\\S+) - (?P<user>\\S+) \\[(?P<time>[^\\]]+)\\] \"(?P<request>[^\"]+)\" (?P<status>\\d+) (?P<bytes>\\d+) \"(?P<referer>[^\"]*)\" \"(?P<ua>[^\"]*)\"$",
    "timeLayout": "02/Jan/2006:15:04:05 -0700"
  }
}

Push Agent (Realtime)

Designed for internal networks or edge nodes. Logs are pushed in real time.

You need to set up two machines:

Parsing server (runs NginxPulse)

  1. Start nginxpulse (ensure backend :8089 is reachable).
  2. Recommend enabling access keys: ACCESS_KEYS (or system.accessKeys).
  3. Get websiteID: call GET /api/websites.
  4. If you need a custom format for the agent, add a type=agent source for parse override:
{
  "name": "Main Site",
  "sources": [
    {
      "id": "agent-main",
      "type": "agent",
      "parse": {
        "logFormat": "$remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\""
      }
    }
  ]
}

Log server (stores logs)

  1. Prepare the agent (build or use prebuilt).

Build:

go build -o bin/nginxpulse-agent ./cmd/nginxpulse-agent

Prebuilt binaries:

  • prebuilt/nginxpulse-agent-darwin-arm64
  • prebuilt/nginxpulse-agent-linux-amd64
  1. Create agent config on the log server (fill in parsing server and websiteID).
    • Fetch websiteID from the parsing server: curl http://<nginxpulse-server>:8089/api/websites The id field is the websiteID.
{
  "server": "http://<nginxpulse-server>:8089",
  "accessKey": "your-key",
  "websiteID": "abcd",
  "sourceID": "agent-main",
  "paths": ["/var/log/nginx/access.log"],
  "pollInterval": "1s",
  "batchSize": 200,
  "flushInterval": "2s"
}
  1. Run the agent:
./bin/nginxpulse-agent -config configs/nginxpulse_agent.json

Notes:

  • The log server must reach http://<nginxpulse-server>:8089/api/ingest/logs.
  • To override parsing, set a type=agent source with id=sourceID and fill parse.
  • The agent skips .gz files; if a log file shrinks (rotation), it restarts from the beginning.

Notes

  • If reparse happens on restart, make sure no stale process is running.
  • Globs may match more files than expected.
  • Gzip logs are parsed as full files based on metadata.