Tracing & Logging

Overview

The system implements distributed tracing and structured logging to provide end-to-end visibility across async operations. Every request generates a unique trace that flows through the API, orchestration, and platform-specific publish functions.

Implementation:

Logging: apps/functions/utils/logger.js
Trace context: Passed through function calls and Cloud Tasks

Distributed Tracing

Core Concepts

Trace: A complete request flow from API entry to all platform publishes

Identified by unique traceId (UUID)
May span multiple Cloud Functions executions
Crosses async boundaries via Cloud Tasks

Span: A single operation within a trace

Identified by unique spanId (UUID)
Has a parent span (except root span)
Tracks operation start/end and status
Contains operation-specific metadata

Flow: A business process that may generate multiple traces

Identified by flowId (often postSubmissionId)
Groups related operations across time
Useful for multi-step workflows

Trace Hierarchy

traceId: "550e8400-e29b-41d4-a716-446655440000"
├─ spanId: "root" (depth: 0)
│  └─ API Request: POST /v1/posts
│     ├─ spanId: "orchestrator-span" (depth: 1)
│     │  └─ Orchestration: Dispatch tasks
│     │     ├─ spanId: "instagram-span" (depth: 2)
│     │     │  └─ Publish to Instagram
│     │     ├─ spanId: "threads-span" (depth: 2)
│     │     │  └─ Publish to Threads
│     │     └─ spanId: "x-span" (depth: 2)
│     │        └─ Publish to X

Trace Context Propagation

1. API Layer (apps/functions/api.js)

// Generate root trace context
const traceId = uuidv4();
const spanId = 'root';
const depth = 0;

req.traceContext = { traceId, spanId, depth };

2. Orchestrator (apps/functions/orchestrators/publishing-service.js)

// Create child span for orchestration
const orchestratorSpanId = uuidv4();
const orchestratorContext = {
  traceId: req.traceContext.traceId,
  spanId: orchestratorSpanId,
  parentSpanId: req.traceContext.spanId,
  depth: req.traceContext.depth + 1
};

3. Platform Publish Tasks

// Each task receives trace context in payload
const taskPayload = {
  userId,
  postSubmissionId,
  accountId,
  traceContext: {
    traceId: orchestratorContext.traceId,
    spanId: uuidv4(), // New span for this platform
    parentSpanId: orchestratorContext.spanId,
    depth: orchestratorContext.depth + 1
  }
};

Trace Context Fields

Field	Type	Description	Example
`traceId`	UUID	Unique trace identifier	`"550e8400-e29b-41d4-a716-446655440000"`
`spanId`	UUID	Unique span identifier	`"7c9e6679-7425-40de-944b-e07fc1f90ae7"`
`parentSpanId`	UUID	Parent span (for hierarchy)	`"root"`
`depth`	number	Nesting level in trace tree	`0` (root), `1` (orchestrator), `2` (platform)
`flowId`	string	Business flow identifier	`postSubmissionId`

Structured Logging

Log Collection

All logs are written to the logs Firestore collection for queryability and persistence.

Collection: logs/{eventId}

Document structure:

{
  // Identity
  eventId: string,           // Unique log entry ID (UUID)
  timestamp: Timestamp,      // When this log was created

  // Trace context
  traceId: string | null,    // Distributed trace ID
  spanId: string | null,     // Current span ID
  parentSpanId: string | null, // Parent span ID (for hierarchy)
  depth: number,             // Span nesting depth
  flowId: string | null,     // Business flow ID (e.g., postSubmissionId)

  // Log metadata
  level: string,             // "debug" | "info" | "warn" | "error"
  message: string,           // Human-readable message
  spanStatus: string,        // "ok" | "error"

  // Time-series optimization
  bucketHour: string,        // Format: "YYYY-MM-DD-HH" (e.g., "2025-01-15-10")

  // Context
  userId: string | null,     // User who initiated the operation
  functionName: string,      // Cloud Function name

  // Additional data
  ...metadata               // Operation-specific fields (platform, accountId, etc.)
}

Log Levels

Level	Usage	Example
`debug`	Detailed troubleshooting info	"Normalized video URL: https://..."
`info`	Normal operation events	"Successfully dispatched to Instagram"
`warn`	Potentially problematic events	"Using default account, multiple exist"
`error`	Errors and failures	"Failed to publish to X: Token expired"

Logger API

import logger from './utils/logger.js';

// Log with trace context
logger.info({
  message: 'Orchestration started',
  traceContext: { traceId, spanId, depth },
  flowId: postSubmissionId,
  userId: req.user.uid,
  metadata: {
    platforms: ['instagram', 'threads'],
    accountIds: { instagram: 'acc_123', threads: 'acc_456' }
  }
});

// Log error with span status
logger.error({
  message: 'Failed to publish to Instagram',
  traceContext: { traceId, spanId, depth },
  flowId: postSubmissionId,
  spanStatus: 'error',
  error: err.message,
  metadata: {
    platform: 'instagram',
    accountId: 'acc_123',
    errorCode: 'invalid_token'
  }
});

Time-Series Bucketing

To enable efficient time-series queries, all logs include a bucketHour field:

// Format: "YYYY-MM-DD-HH"
bucketHour: "2025-01-15-10"

Benefits:

Query logs by hour without full table scan
Composite index: (bucketHour ASC, timestamp ASC)
Enables fast time-range queries
Simplifies retention policies (delete old buckets)

Querying Logs

By Trace ID (Full Request Flow)

// Get all operations for a specific request
const logs = await db.collection('logs')
  .where('traceId', '==', traceId)
  .orderBy('timestamp', 'asc')
  .get();

// Logs will include:
// - API request received
// - Orchestration started
// - Tasks dispatched
// - Each platform publish attempt

By Flow ID (Business Process)

// Get all operations for a post submission
const logs = await db.collection('logs')
  .where('flowId', '==', postSubmissionId)
  .orderBy('timestamp', 'asc')
  .get();

// Useful for debugging a specific post across retries

By Time Range (Time-Series)

// Get all logs for a specific hour
const logs = await db.collection('logs')
  .where('bucketHour', '==', '2025-01-15-10')
  .orderBy('timestamp', 'asc')
  .get();

// Get logs for multiple hours
const logs = await db.collection('logs')
  .where('bucketHour', '>=', '2025-01-15-10')
  .where('bucketHour', '<=', '2025-01-15-12')
  .orderBy('bucketHour', 'asc')
  .orderBy('timestamp', 'asc')
  .get();

By User (User-Specific Logs)

// Get all operations initiated by a user
const logs = await db.collection('logs')
  .where('userId', '==', userId)
  .where('bucketHour', '>=', startBucket)
  .orderBy('bucketHour', 'asc')
  .orderBy('timestamp', 'asc')
  .limit(100)
  .get();

Failed Operations

// Find all errors in the last hour
const errors = await db.collection('logs')
  .where('bucketHour', '==', currentBucket)
  .where('level', '==', 'error')
  .orderBy('timestamp', 'desc')
  .get();

// Find failed spans
const failedSpans = await db.collection('logs')
  .where('bucketHour', '==', currentBucket)
  .where('spanStatus', '==', 'error')
  .orderBy('timestamp', 'desc')
  .get();

Composite Indexes

Required Firestore indexes for efficient log queries:

{
  "collectionGroup": "logs",
  "queryScope": "COLLECTION",
  "fields": [
    { "fieldPath": "traceId", "order": "ASCENDING" },
    { "fieldPath": "timestamp", "order": "ASCENDING" }
  ]
},
{
  "collectionGroup": "logs",
  "queryScope": "COLLECTION",
  "fields": [
    { "fieldPath": "flowId", "order": "ASCENDING" },
    { "fieldPath": "timestamp", "order": "ASCENDING" }
  ]
},
{
  "collectionGroup": "logs",
  "queryScope": "COLLECTION",
  "fields": [
    { "fieldPath": "bucketHour", "order": "ASCENDING" },
    { "fieldPath": "timestamp", "order": "ASCENDING" }
  ]
},
{
  "collectionGroup": "logs",
  "queryScope": "COLLECTION",
  "fields": [
    { "fieldPath": "userId", "order": "ASCENDING" },
    { "fieldPath": "bucketHour", "order": "ASCENDING" },
    { "fieldPath": "timestamp", "order": "ASCENDING" }
  ]
},
{
  "collectionGroup": "logs",
  "queryScope": "COLLECTION",
  "fields": [
    { "fieldPath": "bucketHour", "order": "ASCENDING" },
    { "fieldPath": "level", "order": "ASCENDING" },
    { "fieldPath": "timestamp", "order": "DESCENDING" }
  ]
}

See: firestore.indexes.json

Error Logs

High-severity errors are also written to a dedicated errorLogs collection for faster querying and alerting.

Collection: errorLogs/{id}

Document structure:

{
  id: string,                // Auto-generated
  timestamp: Timestamp,      // When error occurred
  level: 'error',           // Always "error"
  message: string,          // Error message
  code: string,             // Error code (e.g., "invalid_token")

  // Context
  userId: string | null,
  functionName: string,
  endpoint: string | null,  // API endpoint if applicable

  // Trace context
  traceId: string | null,
  spanId: string | null,
  flowId: string | null,

  // Error details
  stack: string | null,     // Stack trace
  metadata: object          // Additional error context
}

Usage:

// Automatically logged by error handling middleware
// Also manually logged for critical errors
logger.error({
  message: 'Failed to publish to Instagram',
  code: 'instagram_api_error',
  traceContext,
  flowId: postSubmissionId,
  error: err,
  metadata: { platform: 'instagram', statusCode: 400 }
});

Observability Dashboard Queries

Request Success Rate

// Calculate success rate for last hour
const bucket = getCurrentBucket(); // e.g., "2025-01-15-10"

const total = await db.collection('logs')
  .where('bucketHour', '==', bucket)
  .where('functionName', '==', 'api')
  .count()
  .get();

const errors = await db.collection('logs')
  .where('bucketHour', '==', bucket)
  .where('functionName', '==', 'api')
  .where('level', '==', 'error')
  .count()
  .get();

const successRate = ((total - errors) / total) * 100;

Platform Performance

// Average success rate per platform
const platforms = ['instagram', 'threads', 'x', 'tiktok', 'youtube', 'linkedin'];

for (const platform of platforms) {
  const logs = await db.collection('logs')
    .where('bucketHour', '==', bucket)
    .where('metadata.platform', '==', platform)
    .get();

  const total = logs.size;
  const failures = logs.docs.filter(doc => doc.data().level === 'error').length;
  const successRate = ((total - failures) / total) * 100;

  console.log(`${platform}: ${successRate.toFixed(2)}% success`);
}

Slowest Operations

// Find operations with high duration (stored in metadata)
const slowOps = await db.collection('logs')
  .where('bucketHour', '==', bucket)
  .where('metadata.durationMs', '>', 5000) // >5 seconds
  .orderBy('metadata.durationMs', 'desc')
  .limit(10)
  .get();

Error Frequency by Code

// Group errors by error code
const errors = await db.collection('errorLogs')
  .where('timestamp', '>=', hourAgo)
  .get();

const errorCounts = {};
errors.forEach(doc => {
  const code = doc.data().code || 'unknown';
  errorCounts[code] = (errorCounts[code] || 0) + 1;
});

// Output: { "invalid_token": 5, "rate_limit_exceeded": 12, ... }

Log Retention

Strategy

Logs are retained based on time and importance:

Collection	Retention	Reason
`logs`	30 days	Debugging recent issues, normal operations
`errorLogs`	90 days	Longer retention for error pattern analysis
`users/{uid}/apiRequests`	30 days	API usage tracking per user

Cleanup Process

Option 1: Scheduled Cloud Function

// Run daily to delete old logs
export const cleanupLogs = onSchedule('0 2 * * *', async () => {
  const cutoff = new Date();
  cutoff.setDate(cutoff.getDate() - 30);

  const oldLogs = await db.collection('logs')
    .where('timestamp', '<', cutoff)
    .limit(500)
    .get();

  const batch = db.batch();
  oldLogs.forEach(doc => batch.delete(doc.ref));
  await batch.commit();
});

Option 2: TTL Policy (Firestore TTL)

// Set TTL field on log creation
{
  ...logData,
  expireAt: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000) // 30 days
}

// Enable Firestore TTL policy for automatic deletion
// gcloud firestore fields ttls update expireAt \
//   --collection-group=logs \
//   --enable-ttl

Best Practices

1. Always Include Trace Context

// ✅ Good: Include trace context
logger.info({
  message: 'Dispatching task',
  traceContext: { traceId, spanId, depth },
  flowId: postSubmissionId
});

// ❌ Bad: Missing trace context
logger.info({ message: 'Dispatching task' });

2. Use Appropriate Log Levels

// ✅ Good: Error for failures
logger.error({ message: 'API call failed', error: err });

// ❌ Bad: Info for errors
logger.info({ message: 'Something went wrong' });

3. Structure Metadata

// ✅ Good: Structured metadata
logger.info({
  message: 'Post published',
  metadata: {
    platform: 'instagram',
    accountId: 'acc_123',
    postId: 'post_456'
  }
});

// ❌ Bad: Unstructured string
logger.info({ message: 'Post published to instagram acc_123 as post_456' });

4. Set Span Status

// ✅ Good: Mark span as error on failure
logger.error({
  message: 'Publish failed',
  spanStatus: 'error',
  traceContext
});

// Mark span as ok on success
logger.info({
  message: 'Publish succeeded',
  spanStatus: 'ok',
  traceContext
});

Visualization

Trace Timeline Example

Time: 0ms ──────────────────────────────────────> 5000ms

[API Request] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  traceId: 550e8400, spanId: root, depth: 0

           [Orchestrator] ━━━━━━━━━━━━━━━━━━━━━━━
             traceId: 550e8400, spanId: abc123, depth: 1

                      [Instagram] ━━━━━━━━━━━━━
                      spanId: def456, depth: 2

                      [Threads] ━━━━━━━━━━━
                      spanId: ghi789, depth: 2

                      [X] ━━━━━━━━━━━━━━━━━━━━━
                      spanId: jkl012, depth: 2

Orchestration - How tracing flows through orchestration
Public API - API endpoints that generate traces
Rate Limiting - Rate limit tracking logs

Overview​

Distributed Tracing​

Core Concepts​

Trace Hierarchy​

Trace Context Propagation​

Trace Context Fields​

Structured Logging​

Log Collection​

Log Levels​

Logger API​

Time-Series Bucketing​

Querying Logs​

By Trace ID (Full Request Flow)​

By Flow ID (Business Process)​

By Time Range (Time-Series)​

By User (User-Specific Logs)​

Failed Operations​

Composite Indexes​

Error Logs​

Observability Dashboard Queries​

Request Success Rate​

Platform Performance​

Slowest Operations​

Error Frequency by Code​

Log Retention​

Strategy​

Cleanup Process​

Best Practices​

1. Always Include Trace Context​

2. Use Appropriate Log Levels​

3. Structure Metadata​

4. Set Span Status​

Visualization​

Trace Timeline Example​

Related Documentation​