Bemærk
Adgang til denne side kræver godkendelse. Du kan prøve at logge på eller ændre mapper.
Adgang til denne side kræver godkendelse. Du kan prøve at ændre mapper.
Removes duplicate values from the array.
Syntax
from pyspark.sql import functions as sf
sf.array_distinct(col)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or str |
Name of column or expression |
Returns
pyspark.sql.Column: A new column that is an array of unique values from the input column.
Examples
Example 1: Removing duplicate values from a simple array
from pyspark.sql import functions as sf
df = spark.createDataFrame([([1, 2, 3, 2],)], ['data'])
df.select(sf.array_distinct(df.data)).show()
+--------------------+
|array_distinct(data)|
+--------------------+
| [1, 2, 3]|
+--------------------+
Example 2: Removing duplicate values from multiple arrays
from pyspark.sql import functions as sf
df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data'])
df.select(sf.array_distinct(df.data)).show()
+--------------------+
|array_distinct(data)|
+--------------------+
| [1, 2, 3]|
| [4, 5]|
+--------------------+
Example 3: Removing duplicate values from an array with all identical values
from pyspark.sql import functions as sf
df = spark.createDataFrame([([1, 1, 1],)], ['data'])
df.select(sf.array_distinct(df.data)).show()
+--------------------+
|array_distinct(data)|
+--------------------+
| [1]|
+--------------------+
Example 4: Removing duplicate values from an array with no duplicate values
from pyspark.sql import functions as sf
df = spark.createDataFrame([([1, 2, 3],)], ['data'])
df.select(sf.array_distinct(df.data)).show()
+--------------------+
|array_distinct(data)|
+--------------------+
| [1, 2, 3]|
+--------------------+
Example 5: Removing duplicate values from an empty array
from pyspark.sql import functions as sf
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
schema = StructType([
StructField("data", ArrayType(IntegerType()), True)
])
df = spark.createDataFrame([([],)], schema)
df.select(sf.array_distinct(df.data)).show()
+--------------------+
|array_distinct(data)|
+--------------------+
| []|
+--------------------+