Why does a DELETE with a JOIN on partitioned columns in BigQuery cost more than dropping specific partitions?

Ask Question

Asked 29 days ago

Modified 28 days ago

Viewed 47 times

Part of Google Cloud Collective

I have a large BigQuery table, big_table, around 5 TB in size. It is partitioned by the column partition_date, which has about 2000 distinct values.

I also have a smaller table, small_table, which contains only two distinct partition_date values and lists specific (partition_date, product_id) combinations that should be deleted from big_table.

Example:

-- big_table (partitioned by partition_date) incudes further columns

partition_date	product_id
2023-01-01	A123
2023-01-01	B456
2023-01-02	A123

-- small_table

partition_date	product_id
2023-01-01	A123
2023-01-01	B456

My delete query looks like this:

DELETE FROM big_table
WHERE (partition_date, product_id) IN (
  SELECT partition_date, product_id FROM small_table
);

However, this query is much more expensive than expected — it scans a large portion of the entire table, even though small_table only contains two partition dates.

I also tried these variants, but the cost remains almost the same:

`
DELETE FROM big_table b
WHERE EXISTS (
  SELECT 1
  FROM small_table s
  WHERE s.partition_date = b.partition_date
    AND s.product_id = b.product_id
);`

and


`DELETE FROM big_table b
USING small_table s
WHERE b.partition_date = s.partition_date
  AND b.product_id = s.product_id;
`

If I simply delete or recreate those two partitions directly, it’s much cheaper.

So my questions are:

Why doesn’t BigQuery automatically prune partitions when using a JOIN or EXISTS with the partition column?

Is there a way to make BigQuery recognize partition filters in such delete queries?

What’s the most cost-efficient strategy for deleting specific row combinations from only a few partitions?

I understand that one possible workaround is to precompute all affected partitions in a separate step, for example:

`-- Step 1: Collect affected partitions into an array
DECLARE affected_partitions ARRAY<DATE>;

SET affected_partitions = (
  SELECT ARRAY_AGG(DISTINCT partition_date)
  FROM `project.dataset.small_table`
);

-- Step 2: Use that array in the delete query
DELETE FROM `project.dataset.big_table`
WHERE partition_date IN UNNEST(affected_partitions)
  AND (partition_date, product_id) IN (
    SELECT partition_date, product_id
    FROM `project.dataset.small_table`
  );

This approach probably allows BigQuery to prune partitions efficiently, but it’s quite inconvenient, since it requires rewriting the query and managing an additional pre-step manually.

edited Nov 3 at 6:31

Alparslan ŞEN

6881 gold badge9 silver badges28 bronze badges

asked Nov 2 at 18:48

Alex Lipkin

211 bronze badge

Hello, I found an explanation stating that BigQuery partition pruning works at query planning time, not at query execution time. The optimizer must determine which partitions to scan before executing subqueries or joins. Could this be it?

nseaprotector
– nseaprotector

2025-11-02 18:58:25 +00:00
Commented Nov 2 at 18:58

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does a DELETE with a JOIN on partitioned columns in BigQuery cost more than dropping specific partitions?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest