-1

Hi a have a dataframe with some user_id, the months that they are active, and the lead_month which they're active.I need to perform the calculation of column "active_months" shown at the bellow image, that counts how many consecutives month this user is active. So when it took more than 1 month to this user to be active again, we reset our count starting again at 1.

I can't groupby my data, I need to work as window function, because I have other operations to make at user_id level

Can anyone help me?

enter image description here

I tried a window function with Window().partitionBy(['account_id']).orderBy('reference_month').rowsBetween(Window.unboundedPreceding, Window.currentRow) but it doesn't rest the count to 1

1
  • 2
    please don't post your data as images and format it as table instead. please also provide a small data table to play with (see minimal reproducible example, also How to Ask) Commented Feb 1, 2024 at 17:28

1 Answer 1

0

Hope my test dataframe is covering all the bases:

import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.sql.window import Window

df = spark.createDataFrame([
  (1, '2021-12-01', '2022-01-01'),
  (1, '2022-01-01', '2022-02-01'),
  (1, '2022-02-01', '2022-03-01'),
  (2, '2023-01-01', '2023-03-01'),
  (2, '2023-03-01', '2023-04-01'),
  (2, '2023-04-01', '2023-07-01'),
], ['id', 'reference_month', 'lead_month'])

df = (
  df
  .select('id', f.col('reference_month').cast(DateType()), f.col('lead_month').cast(DateType()))
  .withColumn('delta_lead_months', f.months_between(f.col('lead_month'), f.col('reference_month')))
  .withColumn('active_months', f.count(f.col('reference_month')).over(Window.partitionBy('id').orderBy('reference_month').rowsBetween(Window.unboundedPreceding, Window.currentRow)))
)

df.show(truncate = False)

df.show(truncate = False)

and the output:

+---+---------------+----------+-----------------+-------------+                
|id |reference_month|lead_month|delta_lead_months|active_months|
+---+---------------+----------+-----------------+-------------+
|1  |2021-12-01     |2022-01-01|1.0              |1            |
|1  |2022-01-01     |2022-02-01|1.0              |2            |
|1  |2022-02-01     |2022-03-01|1.0              |3            |
|2  |2023-01-01     |2023-03-01|2.0              |1            |
|2  |2023-03-01     |2023-04-01|1.0              |2            |
|2  |2023-04-01     |2023-07-01|3.0              |3            |
+---+---------------+----------+-----------------+-------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Hi, thanks for your response, but let's consider that this only 1 account. Changing the is to only 1 it doesn't work
Hi, thanks for your response, but let's consider that it's only 1 account. Changing the id to only 1 it doesn't work
So what’s the factor to partition data? Is it the year value of the lead_month?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.