pyspark.sql.functions.string_agg_distinct#

pyspark.sql.functions.string_agg_distinct(col, delimiter=None)[source]#

Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter.

An alias of listagg_distinct().

New in version 4.0.0.

Parameters

colColumn or column name: target column to compute on.
delimiterColumn, literal string or bytes, optional: the delimiter to separate the values. The default value is None.

Returns

Column: the column for computed results.

Examples

Example 1: Using string_agg_distinct function

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.string_agg_distinct('strings')).show()
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
|                               abc|
+----------------------------------+

Example 2: Using string_agg_distinct function with a delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([('a',), ('b',), (None,), ('c',), ('b',)], ['strings'])
>>> df.select(sf.string_agg_distinct('strings', ', ')).show()
+--------------------------------+
|string_agg(DISTINCT strings, , )|
+--------------------------------+
|                         a, b, c|
+--------------------------------+

Example 3: Using string_agg_distinct function with a binary column and delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(b'',), (b'',), (None,), (b'',), (b'',)],
...                            ['bytes'])
>>> df.select(sf.string_agg_distinct('bytes', b'B')).show()
+---------------------------------+
|string_agg(DISTINCT bytes, X'42')|
+---------------------------------+
|                 [01 42 02 42 03]|
+---------------------------------+

Example 4: Using string_agg_distinct function on a column with all None values

>>> from pyspark.sql import functions as sf
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("strings", StringType(), True)])
>>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema)
>>> df.select(sf.string_agg_distinct('strings')).show()
+----------------------------------+
|string_agg(DISTINCT strings, NULL)|
+----------------------------------+
|                              NULL|
+----------------------------------+