总结是一项非常常见的任务,许多开发者都希望能够实现自动化。例如,为你正在写的每篇博客文章自动创建一个摘要不是很好吗?或者为你的员工自动总结文件?存在大量的好的应用程序。
基于转化器的模型,如Bart Large CNN,使得在Python中总结文本变得容易。这些机器学习模型很容易使用,但是很难扩展。让我们看看如何使用Bart Large CNN以及如何优化其性能。
Transformers是一个先进的Python框架,最近使实现非常先进的自然语言处理用例成为可能,如文本总结。
在《变形金刚》和神经网络之前,有几个选择,但没有一个真正令人满意。
在过去的这些年里,基于变形金刚的许多良好的预训练自然语言处理模型已经被创造出来,用于各种使用情况。巴特大型CNN已经由Facebook发布,并在文本总结方面给出了出色的结果。
下面是如何在你的Python代码中使用Bart Large CNN。
使用Bart Large CNN的最简单方法是从Hugging Face资源库中下载,并使用Transformers库中的文本总结管道。
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18."""
summary = summarizer(article, max_length=130, min_length=30))
输出。
Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.
正如你所看到的,这只是4行Python代码,而且总结的质量非常好!但是你可能已经注意到,这个模型很大,所以第一次下载它需要时间。
min_length和max_length参数表示摘要的最小和最大尺寸。它们代表一些标记,而不是单词。基本上一个标记可以是一个词,也可以是标点符号或子词。一般来说,你可以认为100个标记大致相当于75个单词。
重要提示:你的输入文本不能大于1024个标记(或多或少相当于800个单词),因为这是该模型的一个内部限制。如果你想总结更大的文本,一个好的策略是独立总结文本的几个部分,然后重新组合结果。你甚至可以进行总结的总结!
不过这种Bart Large CNN模式有2个主要问题。
首先,像许多深度学习模型一样,它需要大量的磁盘空间和内存(大约1.5GB!)。而与GPT-3、GPT-J、T5 11B等巨大的深度学习模型相比,这仍然可以被认为是一个小的深度学习模型。
更重要的是,它的速度相当慢。这个模型实际上是在引擎盖下进行文本生成,而文本生成本来就很慢。如果你试图总结一段由800字组成的文本,在一个好的CPU上需要20秒左右......
解决方案是将巴特大型CNN部署在GPU上。例如,在NVIDIA Tesla T4上,你可以期待x10的速度,你的800字的文本将在2秒左右得到总结。
当然,GPU是非常昂贵的,所以这取决于你的计算和决定投资是否值得!
用Bart Large CNN进行文本总结在一个简单的脚本中非常容易使用,但如果你想在生产中使用它来处理大量的请求怎么办?
如上所述,第一个解决方案是负责为自己的硬件配置GPU,并致力于一些生产优化,以使总结更快。
第二个解决方案是将这项任务委托给像NLP Cloud这样的专门服务,它将通过API为你提供Bart大型CNN模型。 在这里测试我们的总结性API端点!
在2022年,由于有了Transformers和Bart Large CNN,在Python中只需很少的努力就可以进行高级文本总结。
文本总结是一项非常有用的任务,现在越来越多的公司在其应用程序中实现了自动化。正如你所看到的,其复杂性来自于性能方面。为了加快使用Bart Large CNN进行文本总结的速度,存在一些技术,但这将是另一篇文章的主题!
我希望这篇文章能帮助你为你的下一个项目节省时间! 欢迎尝试在NLP云上进行文本总结!
Julien Salinas
NLP Cloud的首席技术官