As a data engineer, you get to work on many different datasets and databases. It is common requirement to enrich the input data by filtering out unwanted data or to search for a specific string within a data or Spark DataFrame if you are working on Apache Spark. For example, identify the unwanted or junk string within a dataset. In this article, we will check how to find a string in Spark DataFrame with various methods. We shall see what are different methods find a string in a given data using PySpark and Scala.
How to Find a String in Spark DataFrame?
Apache Spark is an open-source unified analytics engine for large-scale data processing. Being an open-source project, many contributors add many new features to the Spark framework. Apache Spark supports many different built in API methods that you can use to search or find a specific string in an Apache Spark DataFrame.
Following are the some of the commonly used methods to search strings in Spark DataFrame.
- Apache Spark Contains() Function
- Filter Spark DataFrame using like Function
- Filter Spark DataFrame using rlike Function
Demo Data
Following is the demo dataframe that we are going to use in all our examples.
Scala Demo DataFrame
val testDF = Seq((1,"Smith Jhon"), (2,"Michael M"), (3,"Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")
).toDF("ID", "Name")
+---+--------------+
| ID| Name|
+---+--------------+
| 1| Smith Jhon|
| 2| Michael M|
| 3| Williamson|
| 4| Jack Rose|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark Demo DataFrame
testDF = spark.createDataFrame([(1,"Smith Jhon"), (2,"Michael M"), (3,"Williamson"), (4,"Jack Rose"),(5,"Bob Williamson"), (6, "Rob Williamson")], ["ID", "Name"])
+---+--------------+
| ID| Name|
+---+--------------+
| 1| Smith Jhon|
| 2| Michael M|
| 3| Williamson|
| 4| Jack Rose|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Apache Spark Contains() Function to Find Strings in DataFrame
Similar to Python contains() string function, the Spark built-in contains()
function is one of the few useful functions that you can use to search a string in your Spark DataFrame.
You can use contains()
function in Spark and PySpark to match the dataframe column values contains a literal string.
Spark Contains() Function
Following is Spark contains() function example to find a string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3| Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark Contains() Function
Following is PySpark contains()
function example to find a string.
from pyspark.sql.functions import col
testDF.filter(col("name").contains("Williamson")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3| Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Filter Spark DataFrame using like Function
Similar to SQL LIKE function, the like
function in Spark and PySpark to match the dataframe column values contains a literal string.
Spark like Function to find a Strings in DataFrame
Following is Spark like function example to search a string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").like("%son")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3| Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame
Following is PySpark like function example to search string.
from pyspark.sql.functions import col
testDF.filter(col("name").like("%son")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 3| Williamson|
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Filter Spark DataFrame using rlike Function
Similar to SQL RLIKE function, you can use regular expression to search a string using Spark and PySpark rlike
method. This is one of the commonly used methods to search string in a Spark DataFrame.
Spark rlike Function to Search String in DataFrame
Following is Spark like function example to search string.
import org.apache.spark.sql.functions.col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
PySpark like Function to Search String in DataFrame
Following is PySpark rlike function example to search string.
from pyspark.sql.functions import col
testDF.filter(col("name").rlike("Bob|Rob")).show()
+---+--------------+
| ID| Name|
+---+--------------+
| 5|Bob Williamson|
| 6|Rob Williamson|
+---+--------------+
Other Articles,
- Best Methods to Compare Two Tables in SQL
- How to Handle NULL in Snowflake? Functions
- Why You Should Learn Snowflake? Complete Features
- How to Replace Spark DataFrame Column Value? – Scala and PySpark
Hope this helps 🙂