파이썬과 머신러닝으로 텍스트를 요약하는 방법

트랜스포머와 바트 라지 CNN

트랜스포머는 최근 텍스트 요약과 같은 매우 고급 자연어 처리 사용 사례를 구현할 수 있게 해준 고급 Python 프레임워크입니다.

트랜스포머와 신경망 이전에는 몇 가지 옵션이 있었지만 그 어느 것도 만족스럽지 못했습니다.

지난 몇 년 동안 다양한 사용 사례를 위해 트랜스포머를 기반으로 사전 학습된 우수한 자연어 처리 모델이 많이 만들어졌습니다. Facebook에서 출시한 Bart Large CNN은 텍스트 요약에 탁월한 결과를 제공합니다.

Python 코드에서 Bart Large CNN을 사용하는 방법은 다음과 같습니다.

파이썬으로 텍스트 요약하기

Bart Large CNN을 사용하는 가장 간단한 방법은 허깅 페이스 리포지토리에서 다운로드하고 트랜스포머 라이브러리에서 텍스트 요약 파이프라인을 사용하는 것입니다:

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18."""

summary = summarizer(article, max_length=130, min_length=30))

출력:

Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.

보시다시피, 이것은 4줄의 파이썬 코드에 불과하며 요약의 품질이 매우 좋습니다! 하지만 모델이 크기 때문에 처음 다운로드하는 데 시간이 걸린다는 것을 눈치채셨을 것입니다.

최소 길이 및 최대 길이 매개변수는 요약의 최소 및 최대 크기를 나타냅니다. 이 매개변수는 단어가 아닌 토큰의 개수를 나타냅니다. 기본적으로 토큰은 단어일 수도 있지만 구두점이나 하위 단어일 수도 있습니다. 일반적으로 100개의 토큰은 대략 75개의 단어와 같다고 생각할 수 있습니다.

중요 참고: 입력 텍스트는 이 모델의 내부 제한으로 인해 1024토큰(800단어 이상)보다 클 수 없습니다. 더 큰 텍스트 조각을 요약하려면 텍스트의 여러 부분을 독립적으로 요약한 다음 결과를 재조합하는 것이 좋은 전략입니다. 요약의 요약을 수행할 수도 있습니다!

성능 고려 사항

하지만 이 Bart Large CNN 모델에는 두 가지 주요 문제가 있습니다.

첫째, 다른 딥 러닝 모델과 마찬가지로 상당한 양의 디스크 공간과 RAM(약 1.5GB!)이 필요합니다. 그리고 이것은 GPT-3, GPT-J, T5 11B 등과 같은 거대한 딥 러닝 모델에 비해 여전히 작은 딥 러닝 모델로 간주될 수 있습니다.

더 중요한 것은 속도가 매우 느리다는 것입니다. 이 모델은 실제로 내부에서 텍스트 생성을 수행하며, 텍스트 생성은 본질적으로 느립니다. 800개의 단어로 구성된 텍스트를 요약하려고 할 때 좋은 CPU에서 약 20초가 걸립니다...

해결책은 GPU에 Bart 대형 CNN을 배포하는 것입니다. 예를 들어, NVIDIA Tesla T4에서는 10배의 속도 향상을 기대할 수 있으며 800단어 텍스트를 약 2초 만에 요약할 수 있습니다.

물론 GPU는 매우 비싸기 때문에 계산을 해보고 투자할 만한 가치가 있는지 결정하는 것은 사용자의 몫입니다!

프로덕션에 외부 API 활용하기

Bart Large CNN을 사용한 텍스트 요약은 간단한 스크립트에서 매우 쉽게 사용할 수 있지만, 프로덕션에서 대량의 요청에 사용하려면 어떻게 해야 할까요?

위에서 언급했듯이 첫 번째 해결책은 GPU로 자체 하드웨어 프로비저닝을 처리하고 요약 속도를 높이기 위해 일부 프로덕션 최적화 작업을 수행하는 것입니다.

두 번째 해결책은 이 작업을 API를 통해 Bart Large CNN 모델을 제공하는 NLP Cloud와 같은 전용 서비스에 위임하는 것입니다. 요약 API 엔드포인트를 여기에서 테스트해 보세요!

결론

2022년에는 Transformers와 Bart Large CNN 덕분에 파이썬에서 아주 적은 노력으로 고급 텍스트 요약 기능을 수행할 수 있게 됩니다.

텍스트 요약은 점점 더 많은 회사에서 애플리케이션에서 자동화하는 매우 유용한 작업입니다. 보시다시피, 복잡성은 성능 측면에서 비롯됩니다. Bart Large CNN으로 텍스트 요약 속도를 높이기 위한 몇 가지 기술이 존재하지만, 이는 다른 글에서 다룰 주제입니다!

이 글이 다음 프로젝트를 위한 시간 절약에 도움이 되길 바랍니다! NLP Cloud에서 텍스트 요약 기능을 자유롭게 사용해 보세요!

François
NLP 클라우드의 풀스택 엔지니어

파이썬과 머신러닝으로 텍스트를 요약하는 방법

2022년 4월 6일

트랜스포머와 바트 라지 CNN

파이썬으로 텍스트 요약하기

성능 고려 사항

프로덕션에 외부 API 활용하기

결론