Log Parsing
Log Parsing
Flow
- Initial scan: parse recent window after startup.
- Incremental scan: periodic scan by
system.taskInterval. - Backfill: fill older logs in background.
- IP geo backfill: resolve IP locations asynchronously.
Incremental scan & state
- State file:
var/nginxpulse_data/nginx_scan_state.json - If current size < last size, the file is treated as rotated and re-parsed.
- Site ID is derived from
websites[].name. Renaming creates a new site.
Batch size
system.parseBatchSizecontrols batch size (default 100).- Can be overridden by
LOG_PARSE_BATCH_SIZE.
Progress & ETA
Endpoint: GET /api/status
log_parsing_progresslog_parsing_estimated_remaining_secondsip_geo_progressip_geo_estimated_remaining_seconds
Poll this endpoint to update progress in UI.
10G+ log optimization
- Parsing writes core fields first; IP geo is queued.
- IP geo is resolved in batches after parsing.
- For speed: increase
parseBatchSize, use faster disk, or split logs by day.
IIS default rule (W3C Extended)
NginxPulse now supports logType=iis (alias: iis-w3c). The built-in parser follows the common IIS W3C default field order:
date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
Notes:
- Metadata lines starting with
#(such as#Software,#Version,#Fields) are skipped automatically. - URL is built from
cs-uri-stem; whencs-uri-queryis not-, it is appended aspath?query. - IIS W3C timestamps are typically UTC, and the default time layout is
2006-01-02 15:04:05.
Config example:
{
"name": "iis-site",
"logPath": "/var/log/iis/u_ex*.log",
"logType": "iis"
}Sample line:
2026-02-08 10:05:34 10.0.0.10 GET /index.html a=1&b=2 443 - 203.0.113.8 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64) https://example.com/ 200 0 0 36Retention
system.logRetentionDayscontrols cleanup.- Cleanup runs at 02:00 (system timezone).
Mounting Multiple Log Files
WEBSITES is a JSON array, each item describes one site. logPath must be a container-accessible path.
Example:
environment:
WEBSITES: '[{"name":"Site 1","logPath":"/share/logs/nginx/access-site1.log","domains":["www.kaisir.cn","kaisir.cn"]}, {"name":"Site 2","logPath":"/share/logs/nginx/access-site2.log","domains":["home.kaisir.cn"]}]'
volumes:
- ./nginx_data/logs/site1/access.log:/share/logs/nginx/access-site1.log:ro
- ./nginx_data/logs/site2/access.log:/share/logs/nginx/access-site2.log:roIf you have many sites, consider mounting the entire log directory and specify exact files in WEBSITES:
environment:
WEBSITES: '[{"name":"Site 1","logPath":"/share/logs/nginx/access-site1.log","domains":["www.kaisir.cn","kaisir.cn"]}, {"name":"Site 2","logPath":"/share/logs/nginx/access-site2.log","domains":["home.kaisir.cn"]}]'
volumes:
- ./nginx_data/logs:/share/logs/nginx/Tip: If logs are rotated daily, use
*to replace the date, e.g.{"logPath":"/share/logs/nginx/site1.top-*.log"}.
Compressed logs (.gz)
.gz logs are supported. logPath can point to a single .gz file or a glob:
{"logPath": "/share/logs/nginx/access-*.log.gz"}There is a gzip sample in var/log/gz-log-read-test/.
Remote Log Sources (sources)
When logs are not convenient to mount locally, you can use sources instead of logPath. Once sources is set, logPath is ignored.
sources is a JSON array. Each item defines a log source. This design allows:
- Multiple sources per site (multiple machines/directories/buckets).
- Different parsing/auth/polling strategies per source.
- Easy extension for rotation/archival without changing old sources.
Common fields:
id: unique source ID (recommend globally unique).type:local/sftp/http/s3/agent.mode:poll: periodic pulling (default).stream: streaming input only (currently Push Agent only).hybrid: stream + polling fallback (only Push Agent streams; others still usepoll).
pollInterval: polling interval (e.g.5s).pattern: rotation glob (SFTP/Local/S3 use glob; HTTP uses index JSON).compression:auto/gz/none.parse: override parsing (see “Parsing Override”).
streammode is mainly for Push Agent; other sources still run aspoll.
Option 1: HTTP Exposed Logs
Best when you can provide HTTP access to log files (internal network or with auth).
Method A: Expose files via Nginx/Apache (lock it down to avoid leakage)
location /logs/ {
alias /var/log/nginx/;
autoindex on;
# Add basic auth / IP allowlist
}Then configure sources:
{
"id": "http-main",
"type": "http",
"mode": "poll",
"url": "https://logs.example.com/logs/access.log",
"rangePolicy": "auto",
"pollInterval": "10s"
}rangePolicy:
auto: prefer Range; fallback to full download (skips already-read bytes).range: force Range; error if not supported.full: always download full file.
Method B: JSON index API
Good for rotated logs (daily/hourly) or .gz archives:
{
"index": {
"url": "https://logs.example.com/index.json",
"jsonMap": {
"items": "items",
"path": "path",
"size": "size",
"mtime": "mtime",
"etag": "etag",
"compressed": "compressed"
}
}
}Recommended index contract:
- Return a JSON with an array of log objects.
- Each item must include
path(a fetchable URL). - Provide
size/mtime/etagto detect changes and avoid duplicates. mtimesupports RFC3339 / RFC3339Nano /2006-01-02 15:04:05/ Unix seconds.
Example response:
{
"items": [
{
"path": "https://logs.example.com/access-2024-11-03.log.gz",
"size": 123456,
"mtime": "2024-11-03T13:00:00Z",
"etag": "abc123",
"compressed": true
},
{
"path": "https://logs.example.com/access.log",
"size": 98765,
"mtime": 1730638800,
"etag": "def456",
"compressed": false
}
]
}If your fields differ, map them in jsonMap:
{
"index": {
"url": "https://logs.example.com/index.json",
"jsonMap": {
"items": "data",
"path": "url",
"size": "length",
"mtime": "updated_at",
"etag": "hash",
"compressed": "gz"
}
}
}Notes:
pathmust be a directly accessible log URL.- For
.gz, provide stableetag/size/mtimeto avoid duplicate parsing. - If HTTP Range is not supported, use
autoorfull.
Option 2: SFTP Pull
Ideal when SSH/SFTP access is available, no extra HTTP service needed.
{
"id": "sftp-main",
"type": "sftp",
"mode": "poll",
"host": "1.2.3.4",
"port": 22,
"user": "nginx",
"auth": { "keyFile": "/secrets/id_rsa", "passphrase": "", "password": "" },
"path": "/var/log/nginx/access.log",
"pattern": "/var/log/nginx/access-*.log.gz",
"pollInterval": "5s"
}
authsupportskeyFile,passphrase(private key passphrase), andpassword.
SFTP key-based login walkthrough (local -> remote)
- Generate a dedicated key pair on your local machine (recommended:
ed25519):
ssh-keygen -t ed25519 -a 100 -f ~/.ssh/nginxpulse_sftp -C "nginxpulse-sftp"- Install the public key on the remote user:
ssh-copy-id -i ~/.ssh/nginxpulse_sftp.pub <user>@<host>If ssh-copy-id is unavailable:
cat ~/.ssh/nginxpulse_sftp.pub | ssh <user>@<host> \
'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'- Ensure remote permissions are correct:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys- Verify SSH key login from local (public key only):
ssh -i ~/.ssh/nginxpulse_sftp -o PreferredAuthentications=publickey <user>@<host>- Verify SFTP key login:
sftp -i ~/.ssh/nginxpulse_sftp <user>@<host>- After verification, configure
sources:
{
"id": "sftp-main",
"type": "sftp",
"host": "<host>",
"port": 22,
"user": "<user>",
"auth": {
"keyFile": "/absolute/path/to/nginxpulse_sftp",
"passphrase": ""
},
"path": "/var/log/nginx/access.log"
}
keyFilemust be an absolute path accessible on the machine (or container) running NginxPulse.
- If login still fails, use verbose SSH output first:
ssh -vvv -i ~/.ssh/nginxpulse_sftp -o PreferredAuthentications=publickey <user>@<host>On Alpine, a common SSH log check is:
grep sshd /var/log/messages | tail -n 80Option 3: Object Storage (S3/OSS)
Best when logs are archived to OSS/S3 (Aliyun/Tencent/AWS compatible endpoints).
{
"id": "s3-main",
"type": "s3",
"mode": "poll",
"endpoint": "https://oss-cn-hangzhou.aliyuncs.com",
"bucket": "nginx-logs",
"prefix": "prod/access/",
"pollInterval": "30s"
}Parsing Override (sources[].parse)
If formats differ across sources, override parsing per source:
{
"parse": {
"logType": "nginx",
"logRegex": "^(?P<ip>\\S+) - (?P<user>\\S+) \\[(?P<time>[^\\]]+)\\] \"(?P<request>[^\"]+)\" (?P<status>\\d+) (?P<bytes>\\d+) \"(?P<referer>[^\"]*)\" \"(?P<ua>[^\"]*)\"$",
"timeLayout": "02/Jan/2006:15:04:05 -0700"
}
}Push Agent (Realtime)
Designed for internal networks or edge nodes. Logs are pushed in real time.
You need to set up two machines:
Parsing server (runs NginxPulse)
- Start nginxpulse (ensure backend
:8089is reachable). - Recommend enabling access keys:
ACCESS_KEYS(orsystem.accessKeys). - Get
websiteID: callGET /api/websites. - If you need a custom format for the agent, add a
type=agentsource for parse override:
{
"name": "Main Site",
"sources": [
{
"id": "agent-main",
"type": "agent",
"parse": {
"logFormat": "$remote_addr - $remote_user [$time_local] \"$request\" $status $body_bytes_sent \"$http_referer\" \"$http_user_agent\""
}
}
]
}Log server (stores logs)
- Prepare the agent (build or use prebuilt).
Build:
go build -o bin/nginxpulse-agent ./cmd/nginxpulse-agentPrebuilt binaries:
prebuilt/nginxpulse-agent-darwin-arm64prebuilt/nginxpulse-agent-linux-amd64
- Create agent config on the log server (fill in parsing server and
websiteID).- Fetch
websiteIDfrom the parsing server:curl http://<nginxpulse-server>:8089/api/websitesTheidfield is thewebsiteID.
- Fetch
{
"server": "http://<nginxpulse-server>:8089",
"accessKey": "your-key",
"websiteID": "abcd",
"sourceID": "agent-main",
"paths": ["/var/log/nginx/access.log"],
"pollInterval": "1s",
"batchSize": 200,
"flushInterval": "2s"
}- Run the agent:
./bin/nginxpulse-agent -config configs/nginxpulse_agent.jsonNotes:
- The log server must reach
http://<nginxpulse-server>:8089/api/ingest/logs. - To override parsing, set a
type=agentsource withid=sourceIDand fillparse. - The agent skips
.gzfiles; if a log file shrinks (rotation), it restarts from the beginning.
Notes
- If reparse happens on restart, make sure no stale process is running.
- Globs may match more files than expected.
- Gzip logs are parsed as full files based on metadata.