[Spark] Pyspark - substring으로 문자열 자르기

728x90

Pyspark를 이용해서 spark에 있는 string 형식(YYYYmmddHH)의 날짜 데이터에서 day 까지만 데이터를 추출해야 할 일이 생겼다.

YYYYmmddHH -> YYYYmmdd 만 출력/추출 하고 싶을 때 substring() 함수를 사용해보자.

SELECT substring(hour, 1, 8) AS day
FROM table_name

substring() function 사용하기

pyspark.sql.functions.substring(str, pos, len)

하위 문자열은 pos 에서 시작 하고 str이 문자열 유형일 때 길이가 len 이거나 str이 이진 유형일 때 길이가 len 이고 byte에서 pos 에서 시작하는 바이트 배열의 슬라이스를 반환한다 .

참고 : 위치(len)는 기준이 0이 아니라 1이다.

source

def substring_index(str, delim, count):
    """
    Returns the substring from string str before count occurrences of the delimiter delim.
    If count is positive, everything the left of the final delimiter (counting from left) is
    returned. If count is negative, every to the right of the final delimiter (counting from the
    right) is returned. substring_index performs a case-sensitive match when searching for delim.

    .. versionadded:: 1.5.0

    Examples
    --------
    >>> df = spark.createDataFrame([('a.b.c.d',)], ['s'])
    >>> df.select(substring_index(df.s, '.', 2).alias('s')).collect()
    [Row(s='a.b')]
    >>> df.select(substring_index(df.s, '.', -3).alias('s')).collect()
    [Row(s='b.c.d')]
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.substring_index(_to_java_column(str), delim, count))

예시

df = spark.createDataFrame([('abcd',)], ['s',])
df.select(substring(df.s, 1, 2).alias('s')).collect()

# 결과: [Row(s='ab')]

Reference

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.substring.html

저작자표시 비영리 변경금지 (새창열림)

'[Spark]' 카테고리의 다른 글

[Spark] pyspark datafrmae to csv (0)	2023.07.13

tech-you

[Spark] Pyspark - substring으로 문자열 자르기

substring() function 사용하기

source

예시

Reference

'[Spark]' 카테고리의 다른 글

티스토리툴바

[Spark] Pyspark - substring으로 문자열 자르기

substring() function 사용하기

source

예시

Reference

'[Spark]' 카테고리의 다른 글

'[Spark]' Related Articles

티스토리툴바