Similar to relational database tables, a DataFrame in Spark is a dataset organized into named columns. Spark DataFrame consists of columns and rows. When you are working on a multiple data sources, you may receive a data with unwanted values such as junk characters in your Spark DataFrames. In this article, we will check how to replace such a junk value in Spark DataFrame column. We will also check methods to replace values in Spark DataFrames.
Replace Spark DataFrame Column Value
It is very common requirement to cleanse the source data before storing it into final Spark SQL tables. You can perform the cleansing operations either on DataFrame or use built-in Spark SQL function to remove junk values. The method is same in both Pyspark and Spark Scala.
Following are some methods that you can use to change the dataFrame column value in Spark using Scala or Pyspark.
- regexp_replace Function to Replace Spark DataFrame Column Values
- Translate Function to Replace Spark DataFrame Column Values
Now, let us check these methods with an example.
Demo Data
Following is the test DataFrame that we will be using in all our examples.
PySpark DataFrame
testDF = spark.createDataFrame([(1,"aab"), (2,"aab"), (3,"222"), (4,"222"), (5,"222"), (6,"aab"), (7,"333"), (8,"444")], ["id", "dpt_id"])
+---+------+
| id|dpt_id|
+---+------+
| 1| aab|
| 2| aab|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| aab|
| 7| 333|
| 8| 444|
+---+------+
Spark DataFrame
val testDF = Seq((1,"aab"), (2,"aab"), (3,"222"), (4,"222"), (5,"222"), (6,"aab"), (7,"333"), (8,"444")).toDF("id", "dpt_id")
+---+------+
| id|dpt_id|
+---+------+
| 1| aab|
| 2| aab|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| aab|
| 7| 333|
| 8| 444|
+---+------+
You may have noticed in the test DataFrame that there is an invalid value (“aab”).
Replace Spark DataFrame Column Value using regexp_replace
One of the easiest methods that you can use to replace the dataFrame column value is using regexp_replace
function.
PySpark Example:
consider following PySpark example which replaces “aab” with zero.
from pyspark.sql.functions import *
newDF = testDF.withColumn('dpt_id', regexp_replace('dpt_id', 'aab', '0'))
newDF.show()
+---+------+
| id|dpt_id|
+---+------+
| 1| 0|
| 2| 0|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 0|
| 7| 333|
| 8| 444|
+---+------+
Spark Example:
consider following Spark example which replaces “aab” with zero.
import org.apache.spark.sql.functions
val newDF = testDF.withColumn("dpt_id", regexp_replace(col("dpt_id"), "aab", "0"))
newDF.show()
+---+------+
| id|dpt_id|
+---+------+
| 1| 0|
| 2| 0|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 0|
| 7| 333|
| 8| 444|
+---+------+
You may have noticed withColumn along with regexp_replace function. The function regexp_replace will generate a new column by replacing all the occurrences of “aab” with zero in our DataFrame. The PySpark function withColumn replaces column if the column name exists in data frame.
Replace Spark DataFrame Column Value using Translate Function
The using Translate function is recommended if you are replacing individual characters within given DataFrame.
PySpark Example:
Consider following PySpark example to replace occurrences of “aab” with zero.
from pyspark.sql.functions import *
newDF = testDF.withColumn('dpt_id', translate('dpt_id', 'aab', '0'))
newDF.show()
+---+----+
| id|d_id|
+---+----+
| 1| 0|
| 2| 0|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 0|
| 7| 333|
| 8| 444|
+---+----+
Spark Example:
Consider following Spark example to replace occurrences of “aab” with zero.
import org.apache.spark.sql.functions
val newDF = testDF.withColumn("dpt_id", translate(col("dpt_id"), "aab", "0"))
newDF.show()
+---+------+
| id|dpt_id|
+---+------+
| 1| 00|
| 2| 00|
| 3| 222|
| 4| 222|
| 5| 222|
| 6| 00|
| 7| 333|
| 8| 444|
+---+------+
Similar to first method, we have used withColumn along with translate function. The translate function will generate a new column by replacing all occurrences of “aab” with zero. The function withColumn replaces column if the name exists in data frame.
Related Articles,
Hope this helps 🙂