Kā apkopot tekstu, izmantojot Python un mašīnmācīšanos

Transformatori un Bart Lielā CNN

Transformers ir uzlabota Python lietojumprogramma, kas nesen ļāva sasniegt ļoti progresīvus dabiskās valodas apstrādes lietojuma gadījumus, piemēram, teksta apkopošanu.

Pirms Transformeriem un neironu tīkliem bija pieejamas vairākas iespējas, taču neviena no tām īsti neapmierināja.

Pēdējos gados ir izveidoti daudzi labi iepriekš apmācīti dabiskās valodas apstrādes modeļi, kas balstīti uz transformatoriem un paredzēti dažādiem lietošanas gadījumiem. Bart Large CNN ir izlaidis Facebook, un tas sniedz lieliskus rezultātus teksta kopsavilkuma veidošanā.

Lūk, kā izmantot Bart Large CNN savā Python kodā.

Teksta apkopošana programmā Python

Visvienkāršākais veids, kā izmantot Bart Large CNN, ir lejupielādēt to no Hugging Face repozitorija un izmantot teksta apkopošanas cauruļvadu no bibliotēkas Transformers:

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18."""

summary = summarizer(article, max_length=130, min_length=30))

Izvades rezultāts:

Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.

Kā jūs varat, tas ir tikai 4 Python koda rindas, un kopsavilkuma kvalitāte ir ļoti laba! Taču jūs, iespējams, esat pamanījuši, ka modelis ir liels, tāpēc tā lejupielādei pirmajā reizē ir nepieciešams laiks.

Parametri min_length un max_length norāda kopsavilkuma minimālo un maksimālo izmēru. Tie apzīmē žetonu, nevis vārdu skaitu. Būtībā token var būt vārds, kā arī interpunkcija vai apakšsvārdi. Kopumā var uzskatīt, ka 100 žetoni ir aptuveni vienādi ar 75 vārdiem.

Svarīga piezīme: jūsu ievades teksts nevar būt lielāks par 1024 žetoniem (vairāk vai mazāk vienāds ar 800 vārdiem), jo tas ir šā modeļa iekšējais ierobežojums. Ja vēlaties apkopot lielākus teksta fragmentus, laba stratēģija ir apkopot vairākas teksta daļas neatkarīgi un pēc tam rezultātus atkal apkopot. Jūs pat varat veikt kopsavilkumu kopsavilkumus!

Darbības apsvērumi

Tomēr ar šo Barta Lielā CNN modeli ir divas galvenās problēmas.

Pirmkārt, tāpat kā daudziem dziļas mācīšanās modeļiem, arī šim modelim ir nepieciešams ievērojams apjoms diska vietas un RAM (aptuveni 1,5 GB!). Un to joprojām var uzskatīt par mazu dziļas mācīšanās modeli, salīdzinot ar tādiem milzīgiem modeļiem kā GPT-3, GPT-J, T5 11B utt.

Vēl svarīgāk ir tas, ka tas ir diezgan lēns. Šis modelis faktiski veic teksta ģenerēšanu, un teksta ģenerēšana ir ļoti lēna. Ja jūs mēģināt apkopot tekstu, kas sastāv no 800 vārdiem, ar labu procesoru tas aizņems aptuveni 20 sekundes...

Risinājums ir izvietot Bart lielu CNN uz GPU. Piemēram, izmantojot NVIDIA Tesla T4, varat sagaidīt x10 paātrinājumu, un jūsu 800 vārdu teksta fragments tiks apkopots aptuveni 2 sekundēs.

Grafiskie procesori, protams, ir ļoti dārgi, tāpēc jums pašiem ir jāizvērtē, vai ieguldījums ir tā vērts!

Ārējā API izmantošana ražošanai

Teksta apkopošanu ar Bart Large CNN ir ļoti viegli izmantot vienkāršā skripta režīmā, bet ko darīt, ja vēlaties to izmantot ražošanā, lai apstrādātu lielu pieprasījumu apjomu?

Kā minēts iepriekš, pirmais risinājums būtu rūpēties par savas aparatūras nodrošināšanu ar GPU un strādāt pie dažiem ražošanas optimizācijas pasākumiem, lai kopsavilkuma veidošana būtu ātrāka.

Otrs risinājums būtu deleģēt šo uzdevumu tādam specializētam pakalpojumam kā NLP Cloud, kas, izmantojot API, apkalpos Bart Large CNN modeli. Izmēģiniet mūsu kopsavilkuma API galapunktu šeit!

Secinājums

2022. gadā, pateicoties Transformers un Bart Large CNN, ir iespējams veikt progresīvu teksta apkopošanu Python valodā, pieliekot ļoti maz pūļu.

Teksta apkopošana ir ļoti noderīgs uzdevums, ko arvien vairāk uzņēmumu tagad automatizē savā lietojumprogrammā. Kā redzams, sarežģītība nāk no veiktspējas puses. Pastāv daži paņēmieni, lai paātrinātu teksta apkopošanu ar Bart Large CNN, bet tas būs temats citam rakstam!

Es ceru, ka šis raksts palīdzēs jums ietaupīt laiku nākamajam projektam! Izmēģiniet teksta apkopošanu NLP Cloud!

Julien Salinas
NLP Cloud tehniskais direktors