pyspark.sql.functions.string_agg_distinct#

pyspark.sql.functions.string_agg_distinct(col, delimiter=None)[source]#

Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.

An alias of listagg_distinct().

New in version 4.0.0.

Parameters
colColumn or column name

target column to compute on.

delimiterColumn, literal string or bytes, optional

the delimiter to separate the values. The default value is None.

Returns
Column

the column for computed results.

Examples

Example 1: Using string_agg_distinct function

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.string_agg_distinct('strings')).show()
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
|                               abc|
+----------------------------------+

Example 2: Using string_agg_distinct function with a delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.string_agg_distinct('strings', ', ')).show()
+--------------------------------+
|string_agg(DISTINCT strings, , )|
+--------------------------------+
|                         a, b, c|
+--------------------------------+

Example 3: Using string_agg_distinct function with a binary column and delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(b'',), (b'',), (None,), (b'',), (b'',)],
...                            ['bytes'])
>>> df.select(sf.string_agg_distinct('bytes', b'B')).show()
+---------------------------------+
|string_agg(DISTINCT bytes, X'42')|
+---------------------------------+
|                 [01 42 02 42 03]|
+---------------------------------+

Example 4: Using string_agg_distinct function on a column with all None values

>>> from pyspark.sql import functions as sf
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("strings", StringType(), True)])
>>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema)
>>> df.select(sf.string_agg_distinct('strings')).show()
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
|                              NULL|
+----------------------------------+