nlp文本数据增强_如何使用Texthero为您的NLP项目准备基于文本的数据集

news/2024/7/2 14:14:50

nlp文本数据增强

Natural Language Processing (NLP) is one of the most important fields of study and research in today’s world. It has many applications in the business sector such as chatbots, sentiment analysis, and document classification.Preprocessing and representing text is one of the trickiest and most annoying parts of working on an NLP project. Text-based datasets can be incredibly thorny and difficult to preprocess. But fortunately, the latest Python package called Texthero can help you solve these challenges.

自然语言处理(NLP)是当今世界上最重要的研究领域之一。 它在商业领域有许多应用程序,例如聊天机器人,情感分析和文档分类。预处理和表示文本是NLP项目工作中最棘手,最烦人的部分之一。 基于文本的数据集可能非常棘手且难以预处理。 但是幸运的是,最新的Python软件包Texthero可以帮助您解决这些挑战。

什么是Texthero? (What is Texthero?)

Texthero is a simple Python toolkit that helps you work with a text-based dataset. It provides quick and easy functionalities that let you preprocess, represent, map into vectors and visualize text data in just a couple of lines of code.

Texthero是一个简单的Python工具箱,可帮助您处理基于文本的数据集。 它提供了快速简便的功能,使您只需几行代码即可进行预处理,表示,映射为向量可视化文本数据。

Texthero is designed to be used on top of pandas, so it makes it easier to preprocess and analyze text-based Pandas Series or Dataframes.

Texthero被设计用于熊猫的顶部,因此可以更轻松地预处理和分析基于文本的Pandas系列或数据框。

If you are working on an NLP project, Texthero can help you get things done faster than before and gives you more time to focus on important tasks.

如果您正在从事NLP项目,Texthero可以帮助您比以前更快地完成工作,并给您更多时间专注于重要任务。

NOTE: The Texthero library is still in the beta version. You might face some bugs and pipelines might change. A faster and better version will be released and it will bring some major changes.

注意: Texthero库仍处于beta版本。 您可能会遇到一些错误,并且管道可能会更改。 将会发布更快更好的版本,它将带来一些重大变化。

Texthero概述 (Texthero Overview)

Texthero has four useful modules that handle different functionalities that you can apply in your text-based dataset.

Texthero具有四个有用的模块,这些模块处理可应用于基于文本的数据集中的不同功能。

  1. Preprocessing

    前处理

    This module allows for the efficient pre-processing of text-based Pandas Series or DataFrames. It has different methods to clean your text dataset such as lowercase(), remove_html_tags() and remove_urls().

    该模块允许对基于文本的Pandas Series或DataFrames进行高效的预处理。 它有多种清理文本数据集的方法,例如小写(),remove_html_tags()和remove_urls()。

  2. NLP

    自然语言处理

    This module has a few NLP tasks such as named_entities, noun_chunks, and so on.

    该模块具有一些NLP任务,例如named_entities,名词_chunks等。

  3. Representation

    表示

    This module has different algorithms to map words into vectors such as TF-IDF,  GloVe, Principal Component Analysis(PCA), and term_frequency.

    该模块具有不同的算法,可将单词映射到向量中,例如TF-IDF,GloVe,主成分分析(PCA)和term_frequency。

  4. Visualization

    可视化

    The last module has three different methods to visualize the insights and statistics of a text-based Pandas DataFrame. It can plot a scatter plot and word cloud.

    最后一个模块提供了三种不同的方法来可视化基于文本的Pandas DataFrame的见解和统计信息。 它可以绘制散点图和词云。

安装Texthero (Install Texthero)

Texthero is free, open-source, and well documented. To install it open a terminal and execute the following command:

Texthero是免费的,开源的并且有据可查。 要安装它,请打开一个终端并执行以下命令:

pip install texthero

The package uses a lot of other libraries on the back-end such as Gensim, SpaCy, scikit-learn, and NLTK. You don't need to install them all separately, pip will take care of that.

该软件包在后端使用了许多其他库,例如Gensim,SpaCy,scikit-learn和NLTK。 您不需要将它们全部单独安装,pip会解决这一问题。

如何使用Texthero (How to use Texthero)

In this article I will use a news dataset to show you how you can use different methods provided by texthero's modules in your own NLP project.

在本文中,我将使用新闻数据集向您展示如何在自己的NLP项目中使用texthero模块提供的不同方法。

We will start by importing important Python packages that we are going to use.

我们将从导入将要使用的重要Python包开始。

#import important packagesimport texthero as hero
import pandas as pd

Then we'll load a dataset from the data directory. The dataset for this article focuses on news in the Swahili Language.

然后,我们将从数据目录中加载数据集。 本文的数据集关注斯瓦希里语的新闻。

#load dataset data = pd.read_csv("data/swahili_news_dataset.csv")

Let's look at the top 5 rows of the dataset:

让我们看一下数据集的前5行:

# show top 5 rows data.head()

As you can see, in our dataset we have three columns (id, content, and category). For this article we will focus on the content feature.

如您所见,在我们的数据集中,我们有三列(id,content和category)。 对于本文,我们将重点介绍内容功能。

# select news content only and show top 5 rowsnews_content = data[["content"]]
news_content.head()

We have created a new dataframe focused on content only, and then we'll show the top 5 rows.

我们创建了一个仅关注内容的新数据框,然后显示了前5行。

使用Texthero进行预处理。 (Preprocessing with Texthero.)

We can use the clean(). method to pre-process a text-based Pandas Series.

我们可以使用clean()。 预处理基于文本的Pandas系列的方法。

# clean the news content by using clean method from hero packagenews_content['clean_content'] = hero.clean(news_content['content'])

The clean() method runs seven functions when you pass a pandas series. These seven functions are:

传递熊猫系列时, clean()方法将运行七个函数。 这七个功能是:

  • lowercase(s): Lowercases all text.

    小写:所有文字均小写。
  • remove_diacritics(): Removes all accents from strings.

    remove_diacritics():从字符串中删除所有重音符号。
  • remove_stopwords(): Removes all stop words.

    remove_stopwords():删除所有停用词。
  • remove_digits(): Removes all blocks of digits.

    remove_digits():删除所有数字块。
  • remove_punctuation(): Removes all string.punctuation (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~).

    remove_punctuation():删除所有字符串。标点符号(!“#$%&'()* +,-。/ :; <=>?@ [] ^ _`{|}〜)。
  • fillna(s): Replaces unassigned values with empty spaces.

    fillna(s):将未分配的值替换为空格。
  • remove_whitespace(): Removes all white space between words

    remove_whitespace():删除单词之间的所有空白

Now we can see the cleaned news content.

现在我们可以看到清理的新闻内容。

#show unclean and clean news contentnews_content.head()

定制清洗 (Custom Cleaning)

If the default pipeline from the clean() method does not fit your needs, you can create a custom pipeline with the list of functions that you want to apply in your dataset.

如果clean()方法的默认管道不符合您的需求,则可以使用要在数据集中应用的函数列表创建自定义管道。

As an example, I created a custom pipeline with only 5 functions to clean my dataset.

作为示例,我创建了一个仅包含5个函数的自定义管道来清理数据集。

#create custom pipeline
from texthero import preprocessingcustom_pipeline = [preprocessing.fillna,preprocessing.lowercase,preprocessing.remove_whitespace,preprocessing.remove_punctuation,preprocessing.remove_urls,]

Now I can use the custome_pipeline to clean my dataset.

现在,我可以使用custome_pipeline清理数据集。

#altearnative for custom pipelinenews_content['clean_custom_content'] = news_content['content'].pipe(hero.clean, custom_pipeline)

You can see the clean dataset we have created by using custom pipeline .

您可以看到我们使用自定义管道创建的干净数据集。

# show output of custome pipelinenews_content.clean_custom_content.head()

有用的预处理方法 (Useful preprocessing methods)

Here are some other useful functions from preprocessing modules that you can try to clean you text-based dataset.

这是预处理模块中的一些其他有用功能,您可以尝试清理基于文本的数据集。

删除数字 (Remove digits)

You can use the remove_digits() function to remove digits in your text-based datasets.

您可以使用remove_digits()函数来删除基于文本的数据集中的数字。

text = pd.Series("Hi my phone number is +255 711 111 111 call me at 09:00 am")
clean_text = hero.preprocessing.remove_digits(text)print(clean_text)

output: Hi my phone number is +        call me at  :  am dtype: object

输出:您好我的电话号码是+打电话给我:am dtype:object

删除停用词 (Remove stopwords)

You can use the remove_stopwords() function to remove stopwords in your text-based datasets.

您可以使用remove_stopwords()函数删除基于文本的数据集中的停用词。

text = pd.Series("you need to know NLP to develop the chatbot that you desire")
clean_text = hero.remove_stopwords(text)print(clean_text)

output:    need  know NLP  develop  chatbot   desire dtype: object

输出:需要了解NLP开发聊天机器人的愿望dtype:对象

删除网址 (Remove URLs)

You can use the remove_urls() function to remove links in your text-based datasets.

您可以使用remove_urls()函数来删除基于文本的数据集中的链接。

text = pd.Series("Go to https://www.freecodecamp.org/news/ to read more articles you like")
clean_text = hero.remove_urls(text)print(clean_text)

output:   Go to    to read more articles you like dtype: object

输出:转到喜欢的文章dtype:object

标记化 (Tokenize)

Tokenize each row of the given Pandas Series by using the tokenize() method and return a Pandas Series where each row contains a list of tokens.

使用tokenize()方法对给定的熊猫系列的每一行进行标记,并返回一个熊猫系列,其中每一行都包含令牌列表。

text = pd.Series(["You can think of Texthero as a tool to help you understand and work with text-based dataset. "])
clean_text = hero.tokenize(text)print(clean_text)

output:   [You, can, think, of, Texthero, as, a, tool, to, help, you, understand, and, work, with, text, based, dataset] dtype: object

输出:[您可以想到Texthero,作为一种工具,以帮助您理解并使用基于文本的数据集] dtype:对象

删除HTML标签 (Remove HTML tags)

You can remove html tags from the given Pandas Series by using the remove_html_tags() method.

您可以使用remove_html_tags()方法从给定的Pandas系列中删除html标签。

text = pd.Series("<html><body><h2>hello world</h2></body></html>")
clean_text = hero.remove_html_tags(text)print(clean_text)

output:   hello world dtype: object

输出:hello world dtype:对象

有用的可视化方法 (Useful visualization methods)

Texthero contains different method to visualize insights and statistics of a text-based Pandas DataFrame.

Texthero包含不同的方法来可视化基于文本的Pandas DataFrame的见解和统计信息。

热门词汇 (Top words)

If you want to know the top words in your text-based dataset, you can use the top_words() method from the visualization module. This method is useful if you want see additional words that you can add to the stop words lists.

如果您想了解基于文本的数据集中的热门单词,则可以使用可视化模块中的top_words()方法。 如果要查看可以添加到停用词列表中的其他词,则此方法很有用。

This method does not return a bar graph, so I will use matplotlib to visualize the top words in a bar graph.

此方法不会返回条形图,因此我将使用matplotlib可视化条形图中的顶部单词。

import matplotlib.pyplot as pltNUM_TOP_WORDS = 20top_20 = hero.visualization.top_words(news_content['clean_content']).head(NUM_TOP_WORDS)# Draw the bar charttop_20.plot.bar(rot=90, title="Top 20 words");plt.show(block=True);

In the graph above we can visualize the top 20 words from our news dataset.

在上图中,我们可以可视化新闻数据集中的前20个单词。

词云 (Wordclouds)

The wordcloud() method from the visualization module plots an image using WordCloud from the word_cloud package.

可视化模块中的wordcloud()方法使用word_cloud包中的WordCloud绘制图像。

#Plot wordcloud image using WordCloud method
hero.wordcloud(news_content.clean_content, max_words=100,)

We passed the dataframe series and number of maximum words (for this example, it is 100 words) in the wordcloud() method.

我们在wordcloud()方法中传递了数据帧序列和最大字数(在本示例中为100个字)。

有用的表示方法 (Useful representation methods)

Texthero contains different methods from the representation module that help you map words into vectors using different algorithms such as TF-IDF, word2vec or GloVe. In this section I will show you how you can use these methods.

Texthero包含来自表示模块的不同方法,可帮助您使用TF-IDF,word2vec或GloVe等不同算法将单词映射为向量。 在本节中,我将向您展示如何使用这些方法。

特遣部队 (TF-IDF)

You can represent a text-based Pandas Series using TF-IDF. I created a new pandas series with two pieces of news content and represented them in TF_IDF features by using the tfidf() method.

您可以使用TF-IDF表示基于文本的Pandas系列。 我用两个新闻内容创建了一个新的熊猫系列,并使用tfidf ()方法将它们表示为TF_IDF功能。

# Create a new text-based Pandas Series.news = pd.Series(["mkuu wa mkoa wa tabora aggrey mwanri amesitisha likizo za viongozi wote mkoani humo kutekeleza maazimio ya jukwaa la fursa za biashara la mkoa huo", "serikali imetoa miezi sita kwa taasisi zote za umma ambazo hazitumii mfumo wa gepg katika ukusanyaji wa fedha kufanya hivyo na baada ya hapo itafanya ukaguzi na kuwawajibisha"])#convert into tfidf features 
hero.tfidf(news)

output: [0.187132760851739, 0.0, 0.187132760851739, 0....               [0.0, 0.18557550845969953, 0.0, 0.185575508459... dtype: object

输出:[0.187132760851739,0.0,0.187132760851739,0...。[0.0,0.18557550845969953,0.0,0.185575508459 ... dtype:object

NOTE: TF-IDF stands for term frequency-inverse document frequency.

注意: TF-IDF代表术语频率与文档频率成反比。

词频 (Term Frequency)

You can represent a text-based Pandas Series using the term_frequency() method. Term frequency (TF) is used to show how frequently an expression (term or word) occurs in a document or text content.

您可以使用term_frequency()方法表示基于文本的Pandas系列。 术语频率(TF)用于显示表达式(术语或单词)在文档或文本内容中出现的频率。

news = pd.Series(["mkuu wa mkoa wa tabora aggrey mwanri amesitisha likizo za viongozi wote mkoani humo kutekeleza maazimio ya jukwaa la fursa za biashara la mkoa huo", "serikali imetoa miezi sita kwa taasisi zote za umma ambazo hazitumii mfumo wa gepg katika ukusanyaji wa fedha kufanya hivyo na baada ya hapo itafanya ukaguzi na kuwawajibisha"])# Represent a text-based Pandas Series using term_frequency.
hero.term_frequency(news)

output: [1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, ...              [0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, ... dtype: object

输出:[1,0,1,0,1,0,1,0,0,0,0,1,1,0,0,... [0,1,0,1,0,1,0 ,1,1,1,1,0,0,1,1,... dtype:对象

K均值 (K-means)

Texthero can perform K-means clustering algorithm by using the kmeans() method. If you have an unlabeled text-based dataset, you can use this method to group content according to their similarities.

Texthero可以使用kmeans()方法执行K-means聚类算法。 如果您有未标记的基于文本的数据集,则可以使用此方法根据内容的相似性对内容进行分组。

In this example, I will create a new pandas dataframe called news with the following columns content,tfidf and kmeans_labels.

在此示例中,我将创建一个名为news的新的pandas数据框,其中包含以下列content,tfidf和kmeans_labels。

column_names = ["content","tfidf", "kmeans_labels"]news = pd.DataFrame(columns = column_names)

We will use only the first 30 pieces of cleaned content from our news_content dataframe and cluster them into groups by using the kmeans() method.

我们将仅使用news_content数据帧中的前30个已清理内容,并使用kmeans()方法将它们聚类为组。

# collect 30 clean content.
news["content"] = news_content.clean_content[:30]# convert them into tf-idf features.
news['tfidf'] = (news['content'].pipe(hero.tfidf)
)# perform clustering algorithm by using kmeans() 
news['kmeans_labels'] = (news['tfidf'].pipe(hero.kmeans, n_clusters=5).astype(str)
)

In the above source code, in the pipeline of the k-means method we passed the number of clusters which is 5. This means we will group these contents into 5 groups.

在上面的源代码中,在k-means方法的流水线中,我们传递的簇数为5。这意味着我们将这些内容分为5组。

Now the selected news content has been labeled into five groups.

现在,所选新闻内容已被标记为五个组。

# show content and their labels
news[["content","kmeans_labels"]].head()

PCA (PCA)

You can also use the pca() method to perform principal component analysis on the given Pandas Series. Principal component analysis (PCA) is a technique for reducing the dimensionality of your datasets. This increases interpretability but at the same time minimizes information loss.

您也可以使用pca()方法对给定的Pandas系列进行主成分分析。 主成分分析 ( PCA )是一种用于降低数据集维数的技术。 这增加了可解释性,但同时最大程度地减少了信息丢失。

In this example we use the tfidf features from the news dataframe and represent them into two components by using the pca() method. Finally we will show a scatterplot by using the scatterplot() method.

在此示例中,我们使用新闻数据帧中的tfidf功能,并使用pca()方法将它们表示为两个组件。 最后,我们将使用scatterplot()方法显示一个散点图。

#perform pca
news['pca'] = news['tfidf'].pipe(hero.pca)#show scatterplot
hero.scatterplot(news, 'pca', color='kmeans_labels', title="news")

结语 (Wrap up)

In this article, you've learned the basics of how to use the Texthero toolkit Python package in your NLP project. You can learn more about the methods available in the documentation.

在本文中,您学习了如何在NLP项目中使用Texthero工具包Python包的基础知识。 您可以在文档中了解有关可用方法的更多信息 。

You can download the dataset and notebook used in this article here: https://github.com/Davisy/Texthero-Python-Toolkit .

您可以在此处下载本文中使用的数据集和笔记本: https : //github.com/Davisy/Texthero-Python-Toolkit 。

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

如果您学习了新知识或喜欢阅读本文,请与他人分享,以便其他人可以看到。 在那之前,在下一篇文章中见! 也可以通过Twitter @Davis_McDavid与我联系

翻译自: https://www.freecodecamp.org/news/how-to-work-and-understand-text-based-dataset-with-texthero/

nlp文本数据增强


http://lihuaxi.xjx100.cn/news/237194.html

相关文章

ubuntu16.04 ROS安转及RVIZ启动

1、软件中心配置 首先打开软件和更新对话框&#xff0c;打开后按照下图进行配置&#xff08;确保你的"restricted"&#xff0c; "universe&#xff0c;" 和 "multiverse."前是打上勾的&#xff09;&#xff1a; 2、添加源 $ sudo sh -c echo &qu…

github README.md教程

github README.md教程 总结 github中README.md通过特殊字符标记和缩进来达到格式控制&#xff0c;也可以用HTML标签来实现格式控制。 教程一&#xff1a; Markdown 的目标是实现「易读易写」&#xff0c;兼容HTML。 但是&#xff0c;在 HTML 区块标签间的 Markdown 格式语法将不…

区块链的去中心化VS传统互联网的去中心化:技术与治理的双重困境

链客&#xff0c;专为开发者而生&#xff0c;有问必答&#xff01; 此文章来自区块链技术社区&#xff0c;未经允许拒绝转载。 区块链的去中心化VS传统互联网的去中心化&#xff1a;技术与治理的双重困境11 主要观点&#xff1a; 1、传统互联网经典的去中心化项目BitTorrent…

poj3009

一、题意&#xff1a;给定一个矩形区域&#xff0c;代表冰球场。每个单元格可有四种数值&#xff1a;2是冰球的起始位置&#xff1b;3代表冰球最后需要到达的位置&#xff1b;0代表空&#xff0c;球可通过&#xff1b;1代表障碍物&#xff0c;球碰撞一次后&#xff0c;1变成0&a…

如何设置Java Spring Boot JWT授权和认证

In the past month, I had a chance to implement JWT auth for a side project. I have previously worked with JWT in Ruby on Rails, but this was my first time in Spring. 在过去的一个月中&#xff0c;我有机会为辅助项目实现JWT auth。 我以前曾在Ruby on Rails中使用…

EOS共识机制——DPoS代理权益证明

链客&#xff0c;专为开发者而生&#xff0c;有问必答&#xff01; 此文章来自区块链技术社区&#xff0c;未经允许拒绝转载。 区块链共识机制与它的演进&#xff0c;是由于区块链式去中心化而且分布式的系统&#xff0c;必须要有一套放诸四海皆准类似宪法的规则&#xff0c;来…

docker的用法

Docker是开发人员和系统管理员构建&#xff0c;发布和运行分布式应用程序的开放平台&#xff0c;可以在笔记本电脑、数据中心、虚拟机还有云服务器上运行。 使用Docker工具来提高生产率的方法&#xff1a;本地依赖&#xff1a;你需要在本地系统上快速试用 magento 吗&#xff1…

前端面试的作品示例_如何回答任何技术面试问题-包括示例

前端面试的作品示例Technical interviews can be extremely daunting. From the beginning of each question to the end, its important to know what to expect, and to be aware of the areas you might be asked about. 技术面试可能会非常艰巨。 从每个问题的开始到结束&a…