I have a large BigQuery table, big_table, around 5 TB in size. It is partitioned by the column partition_date, which has about 2000 distinct values.
I also have a smaller table, small_table, which contains only two distinct partition_date values and lists specific (partition_date, product_id) combinations that should be deleted from big_table.
Example:
-- big_table (partitioned by partition_date) incudes further columns
| partition_date | product_id |
|---|---|
| 2023-01-01 | A123 |
| 2023-01-01 | B456 |
| 2023-01-02 | A123 |
-- small_table
| partition_date | product_id |
|---|---|
| 2023-01-01 | A123 |
| 2023-01-01 | B456 |
My delete query looks like this:
DELETE FROM big_table
WHERE (partition_date, product_id) IN (
SELECT partition_date, product_id FROM small_table
);
However, this query is much more expensive than expected — it scans a large portion of the entire table, even though small_table only contains two partition dates.
I also tried these variants, but the cost remains almost the same:
`
DELETE FROM big_table b
WHERE EXISTS (
SELECT 1
FROM small_table s
WHERE s.partition_date = b.partition_date
AND s.product_id = b.product_id
);`
and
`DELETE FROM big_table b
USING small_table s
WHERE b.partition_date = s.partition_date
AND b.product_id = s.product_id;
`
If I simply delete or recreate those two partitions directly, it’s much cheaper.
So my questions are:
Why doesn’t BigQuery automatically prune partitions when using a JOIN or EXISTS with the partition column?
Is there a way to make BigQuery recognize partition filters in such delete queries?
What’s the most cost-efficient strategy for deleting specific row combinations from only a few partitions?
I understand that one possible workaround is to precompute all affected partitions in a separate step, for example:
`-- Step 1: Collect affected partitions into an array
DECLARE affected_partitions ARRAY<DATE>;
SET affected_partitions = (
SELECT ARRAY_AGG(DISTINCT partition_date)
FROM `project.dataset.small_table`
);
-- Step 2: Use that array in the delete query
DELETE FROM `project.dataset.big_table`
WHERE partition_date IN UNNEST(affected_partitions)
AND (partition_date, product_id) IN (
SELECT partition_date, product_id
FROM `project.dataset.small_table`
);
This approach probably allows BigQuery to prune partitions efficiently, but it’s quite inconvenient, since it requires rewriting the query and managing an additional pre-step manually.