Is there a way to select random samples based on a distribution of a column using spark sql? For example for the dataframe below, I’d like to select a total of 6 rows but about 2 rows with `prod_name = A`

and 2 rows of `prod_name = B`

and 2 rows of `prod_name = C`

, because they each account for 1/3 of the data? Note that each product doesn’t always account for 1/3 percent. It’s just an example. Many thanks for your help.

prod_name | value ---------------------- A | 100 A | 200. A | 300 A | 400 B | 500 B | 600 B | 650 B | 700 C | 500 C | 600 C | 650 C | 700

## Answer

Using sampleBy should do it. The required fractions for each `prod_name`

can be calculated by dividing the expected number for rows by the actual number of rows:

df=... prods=df.select('prod_name').distinct().collect() cnt=df.count() expected_rows=6 # change this number to adjust of rows in the result fractions={r['prod_name']:expected_rows/cnt for r in prods} df.stat.sampleBy('prod_name', fractions).show()

Output:

+---------+-----+ |prod_name|value| +---------+-----+ | A| 200| | A| 400| | B| 500| | B| 700| | C| 500| | C| 700| +---------+-----+

The size of the result might not exactly match the number of `expected_rows`

as the sampling involves random operations. The result will vary a bit.