2

I have a large BigQuery table, big_table, around 5 TB in size. It is partitioned by the column partition_date, which has about 2000 distinct values.

I also have a smaller table, small_table, which contains only two distinct partition_date values and lists specific (partition_date, product_id) combinations that should be deleted from big_table.

Example:

-- big_table (partitioned by partition_date) incudes further columns

partition_date product_id
2023-01-01 A123
2023-01-01 B456
2023-01-02 A123

-- small_table

partition_date product_id
2023-01-01 A123
2023-01-01 B456

My delete query looks like this:

DELETE FROM big_table
WHERE (partition_date, product_id) IN (
  SELECT partition_date, product_id FROM small_table
);

However, this query is much more expensive than expected — it scans a large portion of the entire table, even though small_table only contains two partition dates.

I also tried these variants, but the cost remains almost the same:

`
DELETE FROM big_table b
WHERE EXISTS (
  SELECT 1
  FROM small_table s
  WHERE s.partition_date = b.partition_date
    AND s.product_id = b.product_id
);`

and


`DELETE FROM big_table b
USING small_table s
WHERE b.partition_date = s.partition_date
  AND b.product_id = s.product_id;
`

If I simply delete or recreate those two partitions directly, it’s much cheaper.

So my questions are:

Why doesn’t BigQuery automatically prune partitions when using a JOIN or EXISTS with the partition column?

Is there a way to make BigQuery recognize partition filters in such delete queries?

What’s the most cost-efficient strategy for deleting specific row combinations from only a few partitions?

I understand that one possible workaround is to precompute all affected partitions in a separate step, for example:

`-- Step 1: Collect affected partitions into an array
DECLARE affected_partitions ARRAY<DATE>;

SET affected_partitions = (
  SELECT ARRAY_AGG(DISTINCT partition_date)
  FROM `project.dataset.small_table`
);

-- Step 2: Use that array in the delete query
DELETE FROM `project.dataset.big_table`
WHERE partition_date IN UNNEST(affected_partitions)
  AND (partition_date, product_id) IN (
    SELECT partition_date, product_id
    FROM `project.dataset.small_table`
  );

This approach probably allows BigQuery to prune partitions efficiently, but it’s quite inconvenient, since it requires rewriting the query and managing an additional pre-step manually.

1
  • Hello, I found an explanation stating that BigQuery partition pruning works at query planning time, not at query execution time. The optimizer must determine which partitions to scan before executing subqueries or joins. Could this be it? Commented Nov 2 at 18:58

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.