Composite Spatial Indexes — PostGIS + Python

Q: When should I use a true composite GiST index instead of a covering INCLUDE index?

Use a true composite GiST index (with btree_gist) when your WHERE clause consistently filters on both the geometry column and the scalar columns — the planner can prune branches in the GiST tree using all indexed predicates. Use an INCLUDE covering index when scalar filters are applied post-spatially and you only want to eliminate heap fetches for projection columns, not tighten the index scan itself.

Q: Why does the planner ignore my composite spatial index and choose a sequential scan?

The most common causes are stale planner statistics (run ANALYZE), an SRID mismatch that forces a cast, or a search radius so large that the planner estimates more than ~5% of rows will match. Lower random_page_cost to 1.1 on SSDs and run ANALYZE after bulk loads to keep statistics fresh.

Composite spatial indexes sit at the intersection of PostGIS geometry filtering and relational scalar predicates — a technique covered in depth under Advanced GiST Indexing & Optimization. They address a concrete production bottleneck: queries that narrow by a spatial bounding box and a categorical or temporal column. When designed correctly, a single index satisfies both predicates, the planner bypasses the heap for projection columns, and query latency drops by an order of magnitude. This page walks through the full workflow: extension setup, column-order analysis, index construction, EXPLAIN plan validation, Python integration, and ongoing maintenance.

Prerequisites and Infrastructure Validation

Before building a composite spatial index, confirm the required extensions are present and your Python toolchain is wired correctly.

Required PostgreSQL extensions:

sql

-- Verify extensions are installed
SELECT extname, extversion
FROM pg_extension
WHERE extname IN ('postgis', 'btree_gist');

-- Install if missing
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS btree_gist;  -- mandatory for scalar columns in GiST

Required Python packages:

psycopg2-binary>=2.9      # or psycopg[binary]>=3.1
sqlalchemy>=2.0
geoalchemy2>=0.14
shapely>=2.0

Verify GiST support and SRID consistency on your target table:

sql

-- Check geometry column SRID and type
SELECT f_table_name, f_geometry_column, type, srid
FROM geometry_columns
WHERE f_table_name = 'sensor_events';

-- Confirm no existing conflicting indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'sensor_events'
  AND indexdef ILIKE '%gist%';

An SRID of 0 in geometry_columns means the column accepts mixed SRIDs — a common source of planner failures. Enforce SRID at the column level before building the composite index:

sql

ALTER TABLE sensor_events
  ALTER COLUMN geom TYPE geometry(Point, 4326)
  USING ST_SetSRID(geom::geometry, 4326);

Core Execution Workflow

Step 1 — Analyse Query Patterns and Determine Column Order

Composite indexes are highly sensitive to column ordering. PostgreSQL’s GiST tree traversal evaluates columns left-to-right: the leading column drives branch pruning. For spatial workloads, the geometry column belongs in position one when spatial predicates appear in the majority of your queries. Secondary columns should be high-cardinality filters that consistently accompany the spatial bound.

Examine your most expensive queries via pg_stat_statements:

sql

SELECT query, calls, total_exec_time / calls AS avg_ms, rows
FROM pg_stat_statements
WHERE query ILIKE '%sensor_events%'
  AND query ILIKE '%ST_%'
ORDER BY total_exec_time DESC
LIMIT 10;

A typical multi-predicate telemetry query that benefits from a composite index:

sql

SELECT id, payload, recorded_at
FROM sensor_events
WHERE ST_DWithin(
        geom::geography,
        ST_SetSRID(ST_MakePoint(-73.9857, 40.7484), 4326)::geography,
        5000          -- metres
      )
  AND event_type = 'temperature'
  AND status = 'active';

The natural column order here is (geom, event_type, status). The spatial bounding box eliminates the most candidates first; event_type then further narrows the result set before status applies.

If ingestion time frequently accompanies the spatial filter, see Composite Indexes for Geometry and Timestamp Columns for how tstzrange operators interact with GiST tree traversal and why naively appending a timestamptz column to a composite GiST index can backfire.

If the scalar filter (status, for example) is highly selective and often queried without a spatial bound, a partial GiST index on the geometry column with a WHERE status = 'active' predicate is a better fit — it keeps the index small and avoids planner confusion on scalar-only queries.

Step 2 — Construct the Composite Index

btree_gist provides GiST-compatible operator classes for standard PostgreSQL types (text, integer, timestamptz, enums, etc.), enabling them to participate in the same GiST tree as PostGIS geometry. Without btree_gist, only the geometry column can live in a GiST index.

True composite GiST index (geometry + scalar predicates):

sql

-- Requires btree_gist extension (installed above)
CREATE INDEX CONCURRENTLY idx_sensor_events_geom_type_status
ON sensor_events
USING GIST (geom, event_type, status);

The CONCURRENTLY option builds the index without locking writes — essential on live tables. It takes longer and requires a unique index name, but is the only safe choice in production.

Covering index with INCLUDE (PostgreSQL 12+):

sql

-- Filters on geom only; event_type and status are stored in leaf pages for projection
CREATE INDEX CONCURRENTLY idx_sensor_events_geom_cover
ON sensor_events
USING GIST (geom)
INCLUDE (event_type, status);

The INCLUDE columns do not participate in GiST tree pruning; they exist solely to eliminate heap fetches for frequently selected columns. For index-only scan eligibility details, see Index-Only Scan Strategies.

Decision matrix:

Scenario	Recommended approach
WHERE filters on geom and scalar columns	True composite GiST (`btree_gist`)
WHERE filters on geom only; scalar columns only projected	Covering `INCLUDE` index
Scalar column is queried independently of spatial bounds	Separate B-tree or partial GiST
Temporal range queries alongside spatial	See geometry + timestamp index guide

Step 3 — Validate Planner Behaviour with EXPLAIN ANALYZE

Creating the index is necessary but not sufficient. Confirm the planner selects it and that I/O drops measurably. Always run ANALYZE before inspecting plans on freshly indexed tables:

sql

ANALYZE sensor_events;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, payload
FROM sensor_events
WHERE ST_DWithin(
        geom::geography,
        ST_SetSRID(ST_MakePoint(-73.9857, 40.7484), 4326)::geography,
        5000
      )
  AND event_type = 'temperature';

Expected output for a composite index hit:

Index Scan using idx_sensor_events_geom_type_status on sensor_events
  (cost=0.28..18.43 rows=12 width=96)
  (actual time=0.412..1.087 rows=9 loops=1)
  Index Cond: ((geom && ...) AND (event_type = 'temperature'::text))
  Buffers: shared hit=14 read=2
Planning Time: 0.318 ms
Execution Time: 1.124 ms

Key signals to read in the output — covered in depth at Reading EXPLAIN ANALYZE Output for Spatial Joins:

Index Scan using idx_sensor_events_geom_type_status — the composite index is selected. If you see Seq Scan instead, check for SRID mismatches (a cast forces the planner off the index) and stale statistics.
Rows Removed by Filter — high values mean the index passes too many candidates to the recheck. The geometry column or scalar columns may need higher cardinality, or the search radius is too wide relative to the data distribution.
Buffers: shared read — should be low after the first execution (cache warming). Persistently high read values indicate index bloat; run REINDEX CONCURRENTLY.
Heap Fetches: 0 — achievable with the INCLUDE covering index pattern when all projected columns are included. A true composite GiST index will still fetch heap pages for non-indexed projection columns.

Step 4 — Integrate with Python Query Workflows

Parameterised queries are non-negotiable: they prevent SQL injection, allow the planner to cache execution plans, and ensure geometry is constructed server-side rather than serialised through Python string formatting.

psycopg2 implementation:

python

import psycopg2
from psycopg2.extras import RealDictCursor

DSN = "host=localhost dbname=spatial_db user=app_user password=secret"

def fetch_nearby_events(lon: float, lat: float, radius_m: float, event_type: str) -> list[dict]:
    with psycopg2.connect(DSN) as conn:
        with conn.cursor(cursor_factory=RealDictCursor) as cur:
            cur.execute(
                """
                SELECT id, payload, recorded_at
                FROM sensor_events
                WHERE ST_DWithin(
                        geom::geography,
                        ST_SetSRID(ST_MakePoint(%s, %s), 4326)::geography,
                        %s
                      )
                  AND event_type = %s
                  AND status = 'active'
                ORDER BY geom <-> ST_SetSRID(ST_MakePoint(%s, %s), 4326)
                LIMIT 100;
                """,
                (lon, lat, radius_m, event_type, lon, lat),
            )
            return cur.fetchall()

The <-> KNN operator in the ORDER BY clause leverages the same GiST index for distance ordering — see KNN nearest-neighbour queries for the interaction between ST_DWithin pre-filtering and <-> ordering on the same index. Avoid constructing WKT or WKB geometry strings in Python; pass raw coordinates and let PostGIS handle type construction.

SQLAlchemy 2.0 + GeoAlchemy2 implementation:

python

from sqlalchemy import create_engine, select, text
from sqlalchemy.orm import Session, DeclarativeBase, Mapped, mapped_column
from geoalchemy2 import Geometry
from geoalchemy2.functions import ST_DWithin, ST_SetSRID, ST_MakePoint, ST_GeogFromWKB

ENGINE = create_engine(
    "postgresql+psycopg2://app_user:secret@localhost/spatial_db",
    pool_pre_ping=True,
    pool_size=10,
)

class Base(DeclarativeBase):
    pass

class SensorEvent(Base):
    __tablename__ = "sensor_events"
    id: Mapped[int]
    payload: Mapped[str]
    recorded_at: Mapped[str]
    event_type: Mapped[str]
    status: Mapped[str]
    geom: Mapped[Geometry] = mapped_column(Geometry("POINT", srid=4326))

def fetch_events_orm(lon: float, lat: float, radius_m: float, event_type: str) -> list:
    origin = ST_SetSRID(ST_MakePoint(lon, lat), 4326)
    with Session(ENGINE) as session:
        stmt = (
            select(SensorEvent.id, SensorEvent.payload, SensorEvent.recorded_at)
            .where(
                ST_DWithin(
                    SensorEvent.geom.cast(type_=Geometry("POINT", srid=4326)),
                    origin,
                    radius_m,
                    True,   # use_spheroid=True — operates on geography for metre-accurate distance
                ),
                SensorEvent.event_type == event_type,
                SensorEvent.status == "active",
            )
            .limit(100)
        )
        return session.execute(stmt).fetchall()

For GeoAlchemy2 column mapping patterns including hybrid properties and WKB deserialisation, see the SQLAlchemy & GeoAlchemy2 integration workflows section. The use_spheroid=True flag in GeoAlchemy2’s ST_DWithin wrapper routes through the geography cast, which is the same path that triggers your composite index when the column is stored as geometry(Point, 4326).

Inspect the generated SQL during development to confirm the parameterised geometry is correct:

python

from sqlalchemy.dialects import postgresql
print(stmt.compile(dialect=postgresql.dialect(), compile_kwargs={"literal_binds": True}))

Step 5 — Handle Result Sets and Avoid N+1 Patterns

For high-volume spatial queries, avoid fetchall() on unbounded result sets. Use a server-side cursor to stream rows:

python

def stream_nearby_events(lon: float, lat: float, radius_m: float) -> None:
    with psycopg2.connect(DSN) as conn:
        with conn.cursor(name="spatial_stream", cursor_factory=RealDictCursor) as cur:
            cur.itersize = 500   # fetch 500 rows per network round-trip
            cur.execute(
                """
                SELECT id, payload, geom
                FROM sensor_events
                WHERE ST_DWithin(
                        geom::geography,
                        ST_SetSRID(ST_MakePoint(%s, %s), 4326)::geography,
                        %s
                      )
                  AND status = 'active';
                """,
                (lon, lat, radius_m),
            )
            for row in cur:
                process(row)   # handle each batch without loading into memory

Named cursors in psycopg2 are server-side; itersize controls the FETCH batch size. This pattern is essential when ST_DWithin with a large radius can match tens of thousands of rows.

Performance Considerations

EXPLAIN Plan Walkthrough

A well-tuned composite spatial query should show this execution shape:

Index Scan using idx_sensor_events_geom_type_status on sensor_events
  (cost=0.28..22.5 rows=15 width=104)
  (actual time=0.38..1.21 rows=11 loops=1)
  Index Cond: ((geom && '...'::geometry) AND (event_type = 'temperature'))
  Filter: (status = 'active')
  Rows Removed by Filter: 2
  Buffers: shared hit=18

The Filter line after Index Cond shows status applied as a post-index recheck rather than an index predicate. This is expected when status has low cardinality (e.g. two values). The recheck is cheap because the row is already in the buffer. If Rows Removed by Filter is high, add status to the index column list or create a partial GiST index filtered to WHERE status = 'active'.

Relevant GUC Settings

sql

-- On SSD storage, lower random_page_cost to match sequential cost
ALTER SYSTEM SET random_page_cost = 1.1;

-- Increase effective_cache_size to reflect OS page cache
ALTER SYSTEM SET effective_cache_size = '12GB';   -- ~75% of RAM

-- During index builds, allow more memory per operation
ALTER SYSTEM SET maintenance_work_mem = '512MB';

SELECT pg_reload_conf();

After adjusting GUCs, re-run EXPLAIN (ANALYZE, BUFFERS) to confirm the planner now picks the index on a fresh connection (planner picks up GUC changes per connection).

Statistics Freshness

Spatial data is rarely uniformly distributed. The planner’s selectivity estimates depend on pg_statistic histograms computed by ANALYZE. After bulk loads or high-velocity inserts, stale statistics cause the planner to over- or under-estimate match counts:

sql

-- Force statistics refresh on the spatial table
ANALYZE sensor_events;

-- Check when statistics were last collected
SELECT schemaname, relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'sensor_events';

Lower the autovacuum scale factor for high-write spatial tables to keep statistics current:

sql

ALTER TABLE sensor_events SET (
  autovacuum_analyze_scale_factor = 0.01,   -- trigger at 1% changed rows (default 20%)
  autovacuum_vacuum_scale_factor  = 0.02
);

Common Failure Modes and Fixes

SRID Mismatch Forces a Cast

Symptom: Planner selects Seq Scan even with the composite index present.

Diagnosis:

sql

EXPLAIN SELECT id FROM sensor_events
WHERE ST_DWithin(geom, ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326), 5000);
-- Check: does EXPLAIN show "SRID mismatch" or a function-wrapped column?

Fix: Ensure query geometry SRID matches the indexed column SRID exactly. If the column is geometry(Point, 4326), the query parameter must be ST_SetSRID(..., 4326) — not 0 or a different CRS. Any implicit ST_Transform call in the WHERE clause wraps the column in a function and makes the index inaccessible.

Missing `btree_gist` Extension

Symptom: ERROR: data type text has no default operator class for access method "gist" when creating the composite index.

Fix:

sql

CREATE EXTENSION IF NOT EXISTS btree_gist;
-- Then re-run the CREATE INDEX statement

Planner Chooses Sequential Scan on Small Tables

Symptom: On tables with fewer than ~5,000 rows, the planner may prefer a sequential scan even with fresh statistics. This is correct behaviour — the sequential scan is genuinely faster on small tables.

Workaround for forced testing:

sql

SET enable_seqscan = off;
EXPLAIN (ANALYZE) SELECT ... ;
SET enable_seqscan = on;   -- always restore

Do not leave enable_seqscan = off in production. Use it only to confirm index correctness during development.

OOM on Large Unbounded Result Sets

Symptom: Python process runs out of memory on wide-radius ST_DWithin queries returning hundreds of thousands of rows.

Fix: Switch from cursor.fetchall() to a named server-side cursor with itersize as shown in Step 5. Always pair ST_DWithin with a LIMIT clause in read APIs. Consider bounding-box pre-filtering with the && operator as a cheaper upstream filter before the exact ST_DWithin call.

Index Bloat from High-Velocity Inserts

Symptom: pg_relation_size(indexrelid) grows faster than the table; query performance degrades over days of ingestion.

Diagnosis:

sql

SELECT
    indexrelname,
    pg_relation_size(indexrelid)                         AS index_bytes,
    pg_relation_size(indrelid)                           AS table_bytes,
    round(100.0 * pg_relation_size(indexrelid)
          / nullif(pg_relation_size(indrelid), 0), 1)   AS index_pct
FROM pg_stat_user_indexes
WHERE indexrelname LIKE 'idx_sensor_events_%';

When index_pct exceeds 60–80%, schedule a rebuild:

sql

REINDEX INDEX CONCURRENTLY idx_sensor_events_geom_type_status;

Verification

After completing the above steps, confirm the implementation is working correctly:

sql

-- 1. Confirm index exists and is valid
SELECT indexname, indexdef, indisvalid
FROM pg_indexes
JOIN pg_index ON pg_indexes.indexname = (
    SELECT relname FROM pg_class WHERE oid = pg_index.indexrelid
)
WHERE tablename = 'sensor_events'
  AND indexdef ILIKE '%gist%';

-- 2. Confirm the index is being used (scan count > 0 after workload)
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'sensor_events';

-- 3. Spot-check row count matches expectations
SELECT count(*)
FROM sensor_events
WHERE ST_DWithin(
        geom::geography,
        ST_SetSRID(ST_MakePoint(-73.9857, 40.7484), 4326)::geography,
        5000
      )
  AND event_type = 'temperature'
  AND status = 'active';

-- 4. Time the query (should be <10 ms on a properly indexed table of 1M rows)
\timing
SELECT id FROM sensor_events
WHERE ST_DWithin(
        geom::geography,
        ST_SetSRID(ST_MakePoint(-73.9857, 40.7484), 4326)::geography,
        5000
      )
  AND event_type = 'temperature'
  AND status = 'active'
LIMIT 50;

A healthy result shows idx_scan incrementing, EXPLAIN reporting Index Scan (not Seq Scan), and execution time in low single-digit milliseconds on SSD-backed storage.

FAQ

When should I use a true composite GiST index instead of a covering INCLUDE index?

Use a true composite GiST index (with btree_gist) when your WHERE clause consistently filters on both the geometry column and the scalar columns — the planner can prune branches in the GiST tree using all indexed predicates. Use an INCLUDE covering index when scalar filters are applied post-spatially and you only want to eliminate heap fetches for projection columns, not tighten the index scan itself.

Why does the planner ignore my composite spatial index and choose a sequential scan?

The most common causes are stale planner statistics (run ANALYZE), an SRID mismatch that forces a cast, or a search radius so large that the planner estimates more than ~5% of rows will match. Lower random_page_cost to 1.1 on SSDs and run ANALYZE after bulk loads to keep statistics fresh.

Does column order matter in a composite GiST index?

Yes. PostgreSQL evaluates GiST composite index columns left-to-right during tree traversal. Place the geometry column first when spatial predicates are the primary filter, then high-cardinality scalar columns. If scalar filters alone often drive queries without a spatial bound, consider a separate partial GiST index rather than relying on a composite.

Advanced GiST Indexing & Optimization — parent section covering GiST fundamentals, operator classes, and the full indexing strategy
Composite Indexes for Geometry and Timestamp Columns — deep dive on temporal range predicates in composite GiST indexes
Partial GiST Indexes — when a filtered single-column index outperforms a composite
Index-Only Scan Strategies — eliminating heap fetches with INCLUDE and visibility maps
Query Plan Analysis with EXPLAIN — interpreting EXPLAIN ANALYZE output for spatial joins and index decisions