How to Find String in Spark DataFrame? – Scala and PySpark

  • Post author:
  • Post last modified:February 17, 2022
  • Post category:Apache Spark
  • Reading time:7 mins read

As a data engineer, you get to work on many different datasets and databases. It is common requirement to enrich the input data by filtering out unwanted data or to search for a specific string within a data or Spark DataFrame if you are working on Apache Spark. For example, identify the unwanted or junk string within a dataset. In this article, we will check how to find a string in Spark DataFrame with various methods. We shall see what are different methods find a string in a given data using PySpark and Scala.

How to Find a String in Spark DataFrame?

Apache Spark is an open-source unified analytics engine for large-scale data processing. Being an open-source project, many contributors add many new features to the Spark framework. Apache Spark supports many different built in API methods that you can use to search or find a specific string in an Apache Spark DataFrame.

Following are the some of the commonly used methods to search strings in Spark DataFrame.

Demo Data

Following is the demo dataframe that we are going to use in all our examples.

Scala Demo DataFrame
val testDF = Seq((1,"Smith Jhon"), (2,"Michael M"), (3,"Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")
).toDF("ID", "Name")

+---+--------------+
| ID|          Name|
+---+--------------+
|  1|    Smith Jhon|
|  2|     Michael M|
|  3|    Williamson|
|  4|     Jack Rose|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark Demo DataFrame
testDF = spark.createDataFrame([(1,"Smith Jhon"), (2,"Michael M"), (3,"Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")], ["ID", "Name"])

+---+--------------+
| ID|          Name|
+---+--------------+
|  1|    Smith Jhon|
|  2|     Michael M|
|  3|    Williamson|
|  4|     Jack Rose|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Apache Spark Contains() Function to Find Strings in DataFrame

Similar to Python contains() string function, the Spark built-in contains() function is one of the few useful functions that you can use to search a string in your Spark DataFrame.

You can use contains() function in Spark and PySpark to match the dataframe column values contains a literal string.

Spark Contains() Function

Following is Spark contains() function example to find a string.

import org.apache.spark.sql.functions.col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|    Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark Contains() Function

Following is PySpark contains() function example to find a string.

from pyspark.sql.functions import col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|    Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Filter Spark DataFrame using like Function

Similar to SQL LIKE function, the like function in Spark and PySpark to match the dataframe column values contains a literal string.

Spark like Function to find a Strings in DataFrame

Following is Spark like function example to search a string.

import org.apache.spark.sql.functions.col
testDF.filter(col("name").like("%son")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|    Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame

Following is PySpark like function example to search string.

from pyspark.sql.functions import col
testDF.filter(col("name").like("%son")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  3|    Williamson|
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Filter Spark DataFrame using rlike Function

Similar to SQL RLIKE function, you can use regular expression to search a string using Spark and PySpark rlike method. This is one of the commonly used methods to search string in a Spark DataFrame.

Spark rlike Function to Search String in DataFrame

Following is Spark like function example to search string.

import org.apache.spark.sql.functions.col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame

Following is PySpark rlike function example to search string.

from pyspark.sql.functions import col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID|          Name|
+---+--------------+
|  5|Bob Williamson|
|  6|Rob Williamson|
+---+--------------+

Other Articles,

Hope this helps 🙂