本篇文章為大家展示了怎樣使用spark計算文檔相似度,內(nèi)容簡明扼要并且容易理解,絕對能使你眼前一亮,通過這篇文章的詳細介紹希望你能有所收獲。
創(chuàng)新互聯(lián)建站專業(yè)為企業(yè)提供尼河口網(wǎng)站建設、尼河口做網(wǎng)站、尼河口網(wǎng)站設計、尼河口網(wǎng)站制作等企業(yè)網(wǎng)站建設、網(wǎng)頁設計與制作、尼河口企業(yè)網(wǎng)站模板建站服務,10余年尼河口做網(wǎng)站經(jīng)驗,不只是建網(wǎng)站,更提供有價值的思路和整體網(wǎng)絡服務。
1、TF-IDF文檔轉換為向量
以下邊三個句子為例
羅湖發(fā)布大梧桐新興產(chǎn)業(yè)帶整體規(guī)劃 深化伙伴關系,增強發(fā)展動力 為世界經(jīng)濟發(fā)展貢獻中國智慧
經(jīng)過分詞后變?yōu)?/p>
[羅湖, 發(fā)布, 大梧桐, 新興產(chǎn)業(yè), 帶, 整體, 規(guī)劃]| [深化, 伙伴, 關系, 增強, 發(fā)展, 動力] [為, 世界, 經(jīng)濟發(fā)展, 貢獻, 中國, 智慧]
經(jīng)過詞頻(TF)計算后,詞頻=某個詞在文章中出現(xiàn)的次數(shù)
(262144,[10607,18037,52497,53469,105320,122761,220591],[1.0,1.0,1.0,1.0,1.0,1.0,1.0]) (262144,[8684,20809,154835,191088,208112,213540],[1.0,1.0,1.0,1.0,1.0,1.0]) (262144,[21159,30073,53529,60542,148594,197957],[1.0,1.0,1.0,1.0,1.0,1.0])
262144為總詞數(shù),這個值越大,不同的詞被計算為一個Hash值的概率就越小,數(shù)據(jù)也更準確。
[10607,18037,52497,53469,105320,122761,220591]分別代表羅湖, 發(fā)布, 大梧桐, 新興產(chǎn)業(yè), 帶, 整體, 規(guī)劃的向量值
[1.0,1.0,1.0,1.0,1.0,1.0,1.0]分別代表羅湖, 發(fā)布, 大梧桐, 新興產(chǎn)業(yè), 帶, 整體, 規(guī)劃在句子中出現(xiàn)的次數(shù)
經(jīng)過逆文檔頻率(IDF),逆文檔頻率=log(總文章數(shù)/包含該詞的文章數(shù))
[6.062092444847088,7.766840537085513,7.073693356525568,5.201891179623976,7.073693356525568,5.3689452642871425,6.514077568590145] [3.8750202389748862,5.464255444091467,6.062092444847088,7.3613754289773485,6.668228248417403,5.975081067857458] [6.2627631403092385,4.822401557919072,6.2627631403092385,6.2627631403092385,3.547332831909406,4.065538562973019]
其中[6.062092444847088,7.766840537085513,7.073693356525568,5.201891179623976,7.073693356525568,5.3689452642871425,6.514077568590145]分別代表羅湖, 發(fā)布, 大梧桐, 新興產(chǎn)業(yè), 帶, 整體, 規(guī)劃的逆文檔頻率
2、相似度計算方法
在之前學習《Mahout實戰(zhàn)》書中聚類算法中,知道幾種相似性度量方法
歐氏距離測度
給定平面上的兩個點,通過一個標尺來計算出它們之間的距離
平方歐氏距離測度
這種距離測度的值是歐氏距離的平方。
曼哈頓距離測度
兩個點之間的距離是它們坐標差的絕對值
余弦距離測度
余弦距離測度需要我們將這些點視為人原點指向它們的向量,向量之間形成一個夾角,當夾角較小時,這些向量都會指向大致相同方向,因此這些點非常接近,當夾角非常小時,這個夾角的余弦接近于1,而隨著角度變大,余弦值遞減。
兩個n維向量之間的余弦距離公式
谷本距離測度
余弦距離測度忽略向量的長度,這適用于某些數(shù)據(jù)集,但是在其它情況下可能會導致糟糕的聚類結果,谷本距離表現(xiàn)點與點之間的夾角和相對距離信息。
加權距離測度
允許對不同的維度加權從而提高或減小某些維度對距離測度值的影響。
3、代碼實現(xiàn)
spark ml有TF_IDF的算法實現(xiàn),spark sql也能實現(xiàn)數(shù)據(jù)結果的輕松讀取和排序,也自帶有相關余弦值計算方法。本文將使用余弦相似度計算文檔相似度,計算公式為
測試數(shù)據(jù)來源于12月07日-12月12日之間網(wǎng)上抓取,樣本測試數(shù)據(jù)量為16632條,
數(shù)據(jù)格式為:Id@==@發(fā)布時間@==@標題@==@內(nèi)容@==@來源。penngo_07_12.txt文件內(nèi)容如下:
第一條新聞是這段時間的一個新聞熱點,本文例子是計算所有新聞與第一條新聞的相似度,計算結果按相似度從高到低排序,最終結果保存在文本文件中。
使用maven創(chuàng)建項目spark項目
pom.xml配置
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.spark.penngo</groupId> <artifactId>spark_test</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>spark_test</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.0.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.0.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-core</artifactId> <version>2.0.0</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.5</version> </dependency> <!-- <dependency> <groupId>org.MongoDB</groupId> <artifactId>mongodb-driver</artifactId> <version>3.3.0</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.1</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.21</version> </dependency> --> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plugins> </build> </project>
SimilarityTest.java
package com.spark.penngo.tfidf; import com.spark.test.tfidf.util.SimilartyData; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.ml.feature.HashingTF; import org.apache.spark.ml.feature.IDF; import org.apache.spark.ml.feature.IDFModel; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.ml.linalg.BLAS; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.ml.linalg.Vectors; import org.apache.spark.sql.*; import org.lionsoul.jcseg.tokenizer.core.*; import java.io.File; import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.StringReader; import java.util.*; /** * 計算文檔相似度https://my.oschina.net/penngo/blog */ public class SimilarityTest { private static SparkSession spark = null; private static String splitTag = "@==@"; public static Dataset<Row> tfidf(Dataset<Row> dataset) { Tokenizer tokenizer = new Tokenizer().setInputCol("segment").setOutputCol("words"); Dataset<Row> wordsData = tokenizer.transform(dataset); HashingTF hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("rawFeatures"); Dataset<Row> featurizedData = hashingTF.transform(wordsData); IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); IDFModel idfModel = idf.fit(featurizedData); Dataset<Row> rescaledData = idfModel.transform(featurizedData); return rescaledData; } public static Dataset<Row> readTxt(String dataPath) { JavaRDD<TfIdfData> newsInfoRDD = spark.read().textFile(dataPath).javaRDD().map(new Function<String, TfIdfData>() { private ISegment seg = null; private void initSegment() throws Exception { if (seg == null) { JcsegTaskConfig config = new JcsegTaskConfig(); config.setLoadCJKPos(true); String path = new File("").getAbsolutePath() + "/data/lexicon"; System.out.println(new File("").getAbsolutePath()); ADictionary dic = DictionaryFactory.createDefaultDictionary(config); dic.loadDirectory(path); seg = SegmentFactory.createJcseg(JcsegTaskConfig.COMPLEX_MODE, config, dic); } } public TfIdfData call(String line) throws Exception { initSegment(); TfIdfData newsInfo = new TfIdfData(); String[] lines = line.split(splitTag); if(lines.length < 5){ System.out.println("error==" + lines[0] + " " + lines[1]); } String id = lines[0]; String publish_timestamp = lines[1]; String title = lines[2]; String content = lines[3]; String source = lines.length >4 ? lines[4] : "" ; seg.reset(new StringReader(content)); StringBuffer sff = new StringBuffer(); IWord word = seg.next(); while (word != null) { sff.append(word.getValue()).append(" "); word = seg.next(); } newsInfo.setId(id); newsInfo.setTitle(title); newsInfo.setSegment(sff.toString()); return newsInfo; } }); Dataset<Row> dataset = spark.createDataFrame( newsInfoRDD, TfIdfData.class ); return dataset; } public static SparkSession initSpark() { if (spark == null) { spark = SparkSession .builder() .appName("SimilarityPenngoTest").master("local[3]") .getOrCreate(); } return spark; } public static void similarDataset(String id, Dataset<Row> dataSet, String datePath) throws Exception{ Row firstRow = dataSet.select("id", "title", "features").where("id ='" + id + "'").first(); Vector firstFeatures = firstRow.getAs(2); Dataset<SimilartyData> similarDataset = dataSet.select("id", "title", "features").map(new MapFunction<Row, SimilartyData>(){ public SimilartyData call(Row row) { String id = row.getString(0); String title = row.getString(1); Vector features = row.getAs(2); double dot = BLAS.dot(firstFeatures.toSparse(), features.toSparse()); double v1 = Vectors.norm(firstFeatures.toSparse(), 2.0); double v2 = Vectors.norm(features.toSparse(), 2.0); double similarty = dot / (v1 * v2); SimilartyData similartyData = new SimilartyData(); similartyData.setId(id); similartyData.setTitle(title); similartyData.setSimilarty(similarty); return similartyData; } }, Encoders.bean(SimilartyData.class)); Dataset<Row> similarDataset2 = spark.createDataFrame( similarDataset.toJavaRDD(), SimilartyData.class ); FileOutputStream out = new FileOutputStream(datePath); OutputStreamWriter osw = new OutputStreamWriter(out, "UTF-8"); similarDataset2.select("id", "title", "similarty").sort(functions.desc("similarty")).collectAsList().forEach(row->{ try{ StringBuffer sff = new StringBuffer(); String sid = row.getAs(0); String title = row.getAs(1); double similarty = row.getAs(2); sff.append(sid).append(" ").append(similarty).append(" ").append(title).append("\n"); osw.write(sff.toString()); } catch(Exception e){ e.printStackTrace(); } }); osw.close(); out.close(); } public static void run() throws Exception{ initSpark(); String dataPath = new File("").getAbsolutePath() + "/data/penngo_07_12.txt"; Dataset<Row> dataSet = readTxt(dataPath); dataSet.show(); Dataset<Row> tfidfDataSet = tfidf(dataSet); String id = "58528946cc9434e17d8b4593"; String similarFile = new File("").getAbsolutePath() + "/data/penngo_07_12_similar.txt"; similarDataset(id, tfidfDataSet, similarFile); } public static void main(String[] args) throws Exception{ //window上運行 //System.setProperty("hadoop.home.dir", "D:/penngo/hadoop-2.6.4"); //System.setProperty("HADOOP_USER_NAME", "root"); run(); } }
運行結果,相似度越高的,新聞排在越前邊,樣例數(shù)據(jù)的測試結果基本滿足要求。data_07_12_similar.txt文件內(nèi)容如下:
上述內(nèi)容就是怎樣使用spark計算文檔相似度,你們學到知識或技能了嗎?如果還想學到更多技能或者豐富自己的知識儲備,歡迎關注創(chuàng)新互聯(lián)行業(yè)資訊頻道。
新聞標題:怎樣使用spark計算文檔相似度
標題鏈接:http://jinyejixie.com/article22/psgcjc.html
成都網(wǎng)站建設公司_創(chuàng)新互聯(lián),為您提供響應式網(wǎng)站、、虛擬主機、Google、營銷型網(wǎng)站建設、面包屑導航
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉載內(nèi)容為主,如果涉及侵權請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉載,或轉載時需注明來源: 創(chuàng)新互聯(lián)