I have a JOBS table in Oracle 19c and a method that fetches jobs from the table, sets their status as RUNNING, and then releases the lock by committing the transaction. Basically, I have something like this:
public interface JobsRepository extends JpaRepository<Job, UUID> {
@Query(value = """
SELECT * FROM JOBS j
WHERE j.ID IN (
SELECT j2.ID
FROM JOBS j2
WHERE j2.STATUS = 'PENDING'
AND j2.DUE_DATE < SYSTIMESTAMP
ORDER BY j2.PRIORITY DESC
FETCH FIRST :batchSize ROWS ONLY
)
FOR UPDATE SKIP LOCKED
""", nativeQuery = true)
List<Job> fetchJobsAndLock(@Param("batchSize") int batchSize);
}
@Service public class JobService {
@Transactional(propagation = Propagation.REQUIRES_NEW)
public List<JobDTO> fetchJobs(int batchSize) {
List<Job> jobs = jobRepository.fetchJobsAndLock(batchSize);
for (Job job : jobs) {
job.setStatus(Job.Status.RUNNING);
}
jobRepository.saveAll(jobs);
return jobs.stream().map(this::mapToDTO).toList();
}
}
@Service public class RunnerService {
public void pollJobs() {
List<JobDTO> jobs = jobService.fetchJobs(10);
for (JobDTO job : jobs) {
executor.execute(() -> handleJob(job));
}
}
@PostConstruct
void scheduleTasks() {
taskScheduler.scheduleAtFixedRate(this::pollJobs, 5000);
}
}
This code represents my basic job fetching and locking mechanism. The problem is that sometimes the same job is processed twice at the same time.
I have two running application instances, and the Oracle DB server also has two primary instances. When I look at the logs, both app-1 and app-2 fetched the same job with just a 3 ms difference.
I don’t understand how this is possible. As far as I know, the lock should be held during the transaction lifetime, and when app-1 locks the row (for example, JobId = 1), the other app should skip that row.
What am I missing here?