版权说明 操作指南
首页 > 成果 > 详情

VSQ: Enabling efficient and verifiable similarity queries in blockchain databases

认领
导出
Link by DOI
反馈
分享
QQ微信 微博
成果类型:
期刊论文
作者:
Bo Yin*;Yihu Liu;Binyao Xu
通讯作者:
Bo Yin
作者机构:
[Bo Yin; Binyao Xu] School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China
School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
[Yihu Liu] School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China<&wdkj&>School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China
通讯机构:
[Bo Yin] S
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China
语种:
英文
关键词:
In recent years;with the rise of web 3.0 (Garcia et al.;Hendler;2009) applications such as meta-universe and digital collections;blockchain as the underlying technology has gained extensive attention from researchers. Blockchain is a distributed database in which nodes do not trust each other. It ensures data security through the use of a hash chain;consensus mechanism;and other technologies. The hash chain prevents data tampering;while the consensus mechanism ensures that all nodes store identical copies. Additionally;blockchain is characterized by decentralization and traceability. Due to these features;blockchain technology has found wide applications;particularly in fields like finance (Treleaven;Brown;& Yang;2017) and healthcare (Revathi and Manikandan;Wang et al.;where decentralized networks store increasing amounts of data. Consequently;the issue of data querying has become a focus of research. In a blockchain system;there are two types of nodes: full nodes and light nodes (as shown in Fig. 1). Full nodes store all data from the blockchain;including block headers and bodies;and they verify transactions and validate blocks. This results in increased storage demands on them. As of April 2024;the amount of data in the full nodes of Bitcoin is 564.04 GB;and the amount of data in the full nodes of Ethereum is 994 GB. On the other hand;light nodes only store block headers;which significantly reduces their storage pressure since the size of a block header is only 80 bytes. When users join the blockchain as a full node;they can easily query the data stored on the chain. However;this brings huge storage and computation pressure to the user. To alleviate this situation;users can join the blockchain as light nodes. In this case;a query request needs to be sent to the service provider (SP) that runs a full node. Since nodes do not trust each other;the query results returned by the full node may be tampered with or the complete query results may not be returned. Therefore;the user needs to verify the soundness and completeness of the query results. To address this issue;the miner constructs the ADS for the data in each block during the block-packing process. The root hash of the ADS is stored in the block header. The full node generates a verification object (VO) along with the results by querying the ADS. Both the VO and query results are then returned to the user. The user verifies the integrity of the query result through the VO and the root digest obtained from the block header. Fig. 1 shows an example where the user sends a query request Q to the SP. The SP queries ADS and returns VO and query result R to the user. In this paper;we investigate verifiable similarity queries within the context of blockchain networks. Our focus will be on two specific types of queries: similarity range queries and similarity top- k queries. In practical scenarios;users often seek to retrieve data objects that exhibit similarity to a given set of keywords of interest. For example;a user may initiate a similarity range query such as “Querying data objects with the similarity to {male;USA;California;programming enthusiasts;enjoy traveling} greater than 0.6”. Meanwhile;a similarity top- k query could involve a request for “Obtaining 5 data objects with the highest similarity to {female;USA;New York;yoga enthusiasts;enjoying reading}”. In the context of such similarity queries;it becomes imperative for the blockchain system to calculate the similarity between the user-provided keyword set and each data object stored within the blockchain. The efficient execution of similarity queries;while upholding the integrity and completeness of the query outcomes;holds significant importance for both the broader adoption of blockchain technology and the enhancement of user experience. Two examples of similarity queries are as follows. Example 1 Consider a platform for accessing literature that utilizes blockchain technology to ensure copyright protection and track the origin of the literature. On this platform;papers are published and can be retrieved and cited by other researchers. To help researchers find relevant literature quickly;a keyword query like Q s =“{ machine learning;smart grid;blockchain };number=50 ” can be used to return the 50 most similar literature items. Consider a platform for accessing literature that utilizes blockchain technology to ensure copyright protection and track the origin of the literature. On this platform;papers are published and can be retrieved and cited by other researchers. To help researchers find relevant literature quickly;a keyword query like Q s =“{ machine learning;smart grid;blockchain };number=50 ” can be used to return the 50 most similar literature items. Example 2 A blockchain-based smart healthcare system stores patients’ medical records and health insurance information. When seeking advice on medical treatment;patients may want to know the treatment options received by other patients with similar symptoms. For this;the patient can issue a query like Q s = “{ diabetes;hypertension;elderly patients };threshold=0.7 ” to find treatment plans of patients with a similarity greater than 0.7. The similarity threshold of 0.7 can be adjusted according to user needs. A blockchain-based smart healthcare system stores patients’ medical records and health insurance information. When seeking advice on medical treatment;patients may want to know the treatment options received by other patients with similar symptoms. For this;the patient can issue a query like Q s = “{ diabetes;hypertension;elderly patients };threshold=0.7 ” to find treatment plans of patients with a similarity greater than 0.7. The similarity threshold of 0.7 can be adjusted according to user needs. The similarity query can also be utilized in e-commerce-based blockchain platforms. For example;the non-fungible token (NFT) market is a popular blockchain application that can recommend digital collections based on user interest tags;such as profile pictures and games. In other e-commerce platforms;users can discover products with similar styles through similarity queries that rely on their purchase history or preferences. For the above example;an intuitive solution is to construct a Merkel hash tree (MHT) based on data objects in the block. However;the MHT does not support efficient data queries because it lacks a search key. Consequently;many existing query schemes combine the MHT with various database indexes;such as B+-trees;R-trees;and prefix trees;to enhance search efficiency. The keywords associated with data objects stored in the leaf nodes can serve as the search key. Despite this;the approach still suffers from low query efficiency;as it cannot directly identify the leaf nodes containing the relevant query results. During query processing at a node;it is necessary to examine every subset of the search key. Even if a subset that meets the query conditions is found;it may not correspond to an actual set residing within a descendant node. This limitation hinders effective node pruning during the query process. Therefore;our goal is to reduce the query time of ADS. The challenge is to improve query efficiency. Existing research on blockchain verifiable queries focuses on range queries (Wang;Xu;et al.;Xu et al.;Zhang et al.;keyword queries (Xu et al.;Zhang et al.;and existence queries (Dai et al.;2020). There is no relevant research providing solutions for verifiable similarity queries. In this paper;we propose a scheme VSQ to support verifiable similarity queries in blockchain databases. We focus on similarity range queries and similarity top- k queries. The goal is to enable efficient query processing and support the verifiability of the soundness and completeness of the query results. Specifically;we first propose a baseline solution that uses MHT as the base ADS and adds pointers to the leaf nodes to realize the query. Next;we improve the query efficiency and reduce the storage overhead using minhash instead of the original data in MHT as an upgrade solution (named baseline+) to the baseline solution. Finally;we propose VSQ;which improves query efficiency by introducing locality-sensitive hashing (LSH) and using merkle bucket tree (MBT) as the base ADS. The verifiability of query results is realized using MBT to store data. The main contributions of this paper are as follows: • To the best of our knowledge;this is the first work that addresses the verifiable similarity query in blockchain. We propose a verifiable query scheme based on the Merkle bucket tree;aiming at avoiding traversing all the sets and reducing the query time. • We propose the VSQ scheme for verifiable similarity queries;which combines minhash;LSH;and the Merkle bucket tree to improve query efficiency. Each leaf node of a Merkle bucket tree represents a hash bucket;which achieves the verifiability of the soundness and completeness of the query results. • We present the similarity query algorithm and the query result verification algorithm. The performance and security of the scheme are analyzed in detail. • We have conducted extensive experimental tests on the proposed method. The experimental results of the synthetic dataset show that for similarity range queries;the query time and VO size of our proposed VSQ are reduced by two orders of magnitude compared to the Baseline. For similarity top- k queries;the query time of VSQ is reduced by two orders of magnitude;and the VO size is reduced by three orders of magnitude compared to the Baseline. Experimental results of real datasets show that the query time and VO size for both queries are reduced by two orders of magnitude compared to the Baseline approach. To the best of our knowledge;this is the first work that addresses the verifiable similarity query in blockchain. We propose a verifiable query scheme based on the Merkle bucket tree;aiming at avoiding traversing all the sets and reducing the query time. We propose the VSQ scheme for verifiable similarity queries;which combines minhash;LSH;and the Merkle bucket tree to improve query efficiency. Each leaf node of a Merkle bucket tree represents a hash bucket;which achieves the verifiability of the soundness and completeness of the query results. We present the similarity query algorithm and the query result verification algorithm. The performance and security of the scheme are analyzed in detail. We have conducted extensive experimental tests on the proposed method. The experimental results of the synthetic dataset show that for similarity range queries;the query time and VO size of our proposed VSQ are reduced by two orders of magnitude compared to the Baseline. For similarity top- k queries;the query time of VSQ is reduced by two orders of magnitude;and the VO size is reduced by three orders of magnitude compared to the Baseline. Experimental results of real datasets show that the query time and VO size for both queries are reduced by two orders of magnitude compared to the Baseline approach. This paper is organized as follows. Section 2 introduces the related work on blockchain technology and verifiable queries. Section 3 introduces the preparatory knowledge and problem definition. Section 4 provides an in-depth discussion of the ADS based on the Merkle bucket tree and its design principles. Section 5 shows the experimental results to demonstrate the effectiveness and superior performance of the proposed method in this paper. Finally;the full paper is summarized in Section 6;and future research directions are explored. We conducted a comparison between the proposed VSQ;the Baseline scheme;and the Baseline+ scheme. Synthetic datasets. To generate a set of 5000 keywords;we utilized Python’s nltk library. Subsequently;we randomly assigned keywords from the set to each data object. The dataset size is in [128;16384]. The dimensions of the keyword set are 1500;and 3500. The default value is indicated by bolding. Real datasets. We also used three real-world
期刊:
Expert Systems with Applications
ISSN:
0957-4174
年:
2025
页码:
127815
机构署名:
本校为第一且通讯机构
院系归属:
计算机与通信工程学院
摘要:
Blockchain serves as the basis for secure and distributed applications like decentralized finance (DeFi). Light nodes can help reduce the storage and computation burden of full nodes in the blockchain by storing only block headers instead of full blocks. The resource constraint user runs a light node and sends the query request to the full node to find the answer. However, this approach raises security concerns, as query results from the full node may be altered or incomplete. In this paper, we explore verifiable similarity queries, commonly used in text mining, within the context of blockchai...

反馈

验证码:
看不清楚,换一个
确定
取消

成果认领

标题:
用户 作者 通讯作者
请选择
请选择
确定
取消

提示

该栏目需要登录且有访问权限才可以访问

如果您有访问权限,请直接 登录访问

如果您没有访问权限,请联系管理员申请开通

管理员联系邮箱:yun@hnwdkj.com