VSQ: Enabling efficient and verifiable similarity queries in blockchain databases

首页 > 成果 > 详情

认领

导出

Link by DOI

反馈

作者信息关键词期刊信息基础信息归属信息摘要

成果类型：

期刊论文

作者：

Bo Yin*;Yihu Liu;Binyao Xu

通讯作者：

Bo Yin

作者机构：

[Bo Yin; Binyao Xu] School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China

School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China

[Yihu Liu] School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China<&wdkj&>School of Control and Computer Engineering, North China Electric Power University, Beijing, 102206, China

通讯机构：

[Bo Yin] S

School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China

语种：

英文

关键词：

In recent years;with the rise of web 3.0 (Garcia et al.;Hendler;2009) applications such as meta-universe and digital collections;blockchain as the underlying technology has gained extensive attention from researchers. Blockchain is a distributed database in which nodes do not trust each other. It ensures data security through the use of a hash chain;consensus mechanism;and other technologies. The hash chain prevents data tampering;while the consensus mechanism ensures that all nodes store identical copies. Additionally;blockchain is characterized by decentralization and traceability. Due to these features;blockchain technology has found wide applications;particularly in fields like finance (Treleaven;Brown;& Yang;2017) and healthcare (Revathi and Manikandan;Wang et al.;where decentralized networks store increasing amounts of data. Consequently;the issue of data querying has become a focus of research. In a blockchain system;there are two types of nodes: full nodes and light nodes (as shown in Fig. 1). Full nodes store all data from the blockchain;including block headers and bodies;and they verify transactions and validate blocks. This results in increased storage demands on them. As of April 2024;the amount of data in the full nodes of Bitcoin is 564.04 GB;and the amount of data in the full nodes of Ethereum is 994 GB. On the other hand;light nodes only store block headers;which significantly reduces their storage pressure since the size of a block header is only 80 bytes. When users join the blockchain as a full node;they can easily query the data stored on the chain. However;this brings huge storage and computation pressure to the user. To alleviate this situation;users can join the blockchain as light nodes. In this case;a query request needs to be sent to the service provider (SP) that runs a full node. Since nodes do not trust each other;the query results returned by the full node may be tampered with or the complete query results may not be returned. Therefore;the user needs to verify the soundness and completeness of the query results. To address this issue;the miner constructs the ADS for the data in each block during the block-packing process. The root hash of the ADS is stored in the block header. The full node generates a verification object (VO) along with the results by querying the ADS. Both the VO and query results are then returned to the user. The user verifies the integrity of the query result through the VO and the root digest obtained from the block header. Fig. 1 shows an example where the user sends a query request Q to the SP. The SP queries ADS and returns VO and query result R to the user. In this paper;we investigate verifiable similarity queries within the context of blockchain networks. Our focus will be on two specific types of queries: similarity range queries and similarity top- k queries. In practical scenarios;users often seek to retrieve data objects that exhibit similarity to a given set of keywords of interest. For example;a user may initiate a similarity range query such as “Querying data objects with the similarity to {male;USA;California;programming enthusiasts;enjoy traveling} greater than 0.6”. Meanwhile;a similarity top- k query could involve a request for “Obtaining 5 data objects with the highest similarity to {female;USA;New York;yoga enthusiasts;enjoying reading}”. In the context of such similarity queries;it becomes imperative for the blockchain system to calculate the similarity between the user-provided keyword set and each data object stored within the blockchain. The efficient execution of similarity queries;while upholding the integrity and completeness of the query outcomes;holds significant importance for both the broader adoption of blockchain technology and the enhancement of user experience. Two examples of similarity queries are as follows. Example 1 Consider a platform for accessing literature that utilizes blockchain technology to ensure copyright protection and track the origin of the literature. On this platform;papers are published and can be retrieved and cited by other researchers. To help researchers find relevant literature quickly;a keyword query like Q s =“{ machine learning;smart grid;blockchain };number=50 ” can be used to return the 50 most similar literature items. Consider a platform for accessing literature that utilizes blockchain technology to ensure copyright protection and track the origin of the literature. On this platform;papers are published and can be retrieved and cited by other researchers. To help researchers find relevant literature quickly;a keyword query like Q s =“{ machine learning;smart grid;blockchain };number=50 ” can be used to return the 50 most similar literature items. Example 2 A blockchain-based smart healthcare system stores patients’ medical records and health insurance information. When seeking advice on medical treatment;patients may want to know the treatment options received by other patients with similar symptoms. For this;the patient can issue a query like Q s = “{ diabetes;hypertension;elderly patients };threshold=0.7 ” to find treatment plans of patients with a similarity greater than 0.7. The similarity threshold of 0.7 can be adjusted according to user needs. A blockchain-based smart healthcare system stores patients’ medical records and health insurance information. When seeking advice on medical treatment;patients may want to know the treatment options received by other patients with similar symptoms. For this;the patient can issue a query like Q s = “{ diabetes;hypertension;elderly patients };threshold=0.7 ” to find treatment plans of patients with a similarity greater than 0.7. The similarity threshold of 0.7 can be adjusted according to user needs. The similarity query can also be utilized in e-commerce-based blockchain platforms. For example;the non-fungible token (NFT) market is a popular blockchain application that can recommend digital collections based on user interest tags;such as profile pictures and games. In other e-commerce platforms;users can discover products with similar styles through similarity queries that rely on their purchase history or preferences. For the above example;an intuitive solution is to construct a Merkel hash tree (MHT) based on data objects in the block. However;the MHT does not support efficient data queries because it lacks a search key. Consequently;many existing query schemes combine the MHT with various database indexes;such as B+-trees;R-trees;and prefix trees;to enhance search efficiency. The keywords associated with data objects stored in the leaf nodes can serve as the search key. Despite this;the approach still suffers from low query efficiency;as it cannot directly identify the leaf nodes containing the relevant query results. During query processing at a node;it is necessary to examine every subset of the search key. Even if a subset that meets the query conditions is found;it may not correspond to an actual set residing within a descendant node. This limitation hinders effective node pruning during the query process. Therefore;our goal is to reduce the query time of ADS. The challenge is to improve query efficiency. Existing research on blockchain verifiable queries focuses on range queries (Wang;Xu;et al.;Xu et al.;Zhang et al.;keyword queries (Xu et al.;Zhang et al.;and existence queries (Dai et al.;2020). There is no relevant research providing solutions for verifiable similarity queries. In this paper;we propose a scheme VSQ to support verifiable similarity queries in blockchain databases. We focus on similarity range queries and similarity top- k queries. The goal is to enable efficient query processing and support the verifiability of the soundness and completeness of the query results. Specifically;we first propose a baseline solution that uses MHT as the base ADS and adds pointers to the leaf nodes to realize the query. Next;we improve the query efficiency and reduce the storage overhead using minhash instead of the original data in MHT as an upgrade solution (named baseline+) to the baseline solution. Finally;we propose VSQ;which improves query efficiency by introducing locality-sensitive hashing (LSH) and using merkle bucket tree (MBT) as the base ADS. The verifiability of query results is realized using MBT to store data. The main contributions of this paper are as follows: • To the best of our knowledge;this is the first work that addresses the verifiable similarity query in blockchain. We propose a verifiable query scheme based on the Merkle bucket tree;aiming at avoiding traversing all the sets and reducing the query time. • We propose the VSQ scheme for verifiable similarity queries;which combines minhash;LSH;and the Merkle bucket tree to improve query efficiency. Each leaf node of a Merkle bucket tree represents a hash bucket;which achieves the verifiability of the soundness and completeness of the query results. • We present the similarity query algorithm and the query result verification algorithm. The performance and security of the scheme are analyzed in detail. • We have conducted extensive experimental tests on the proposed method. The experimental results of the synthetic dataset show that for similarity range queries;the query time and VO size of our proposed VSQ are reduced by two orders of magnitude compared to the Baseline. For similarity top- k queries;the query time of VSQ is reduced by two orders of magnitude;and the VO size is reduced by three orders of magnitude compared to the Baseline. Experimental results of real datasets show that the query time and VO size for both queries are reduced by two orders of magnitude compared to the Baseline approach. To the best of our knowledge;this is the first work that addresses the verifiable similarity query in blockchain. We propose a verifiable query scheme based on the Merkle bucket tree;aiming at avoiding traversing all the sets and reducing the query time. We propose the VSQ scheme for verifiable similarity queries;which combines minhash;LSH;and the Merkle bucket tree to improve query efficiency. Each leaf node of a Merkle bucket tree represents a hash bucket;which achieves the verifiability of the soundness and completeness of the query results. We present the similarity query algorithm and the query result verification algorithm. The performance and security of the scheme are analyzed in detail. We have conducted extensive experimental tests on the proposed method. The experimental results of the synthetic dataset show that for similarity range queries;the query time and VO size of our proposed VSQ are reduced by two orders of magnitude compared to the Baseline. For similarity top- k queries;the query time of VSQ is reduced by two orders of magnitude;and the VO size is reduced by three orders of magnitude compared to the Baseline. Experimental results of real datasets show that the query time and VO size for both queries are reduced by two orders of magnitude compared to the Baseline approach. This paper is organized as follows. Section 2 introduces the related work on blockchain technology and verifiable queries. Section 3 introduces the preparatory knowledge and problem definition. Section 4 provides an in-depth discussion of the ADS based on the Merkle bucket tree and its design principles. Section 5 shows the experimental results to demonstrate the effectiveness and superior performance of the proposed method in this paper. Finally;the full paper is summarized in Section 6;and future research directions are explored. We conducted a comparison between the proposed VSQ;the Baseline scheme;and the Baseline+ scheme. Synthetic datasets. To generate a set of 5000 keywords;we utilized Python’s nltk library. Subsequently;we randomly assigned keywords from the set to each data object. The dataset size is in [128;16384]. The dimensions of the keyword set are 1500;and 3500. The default value is indicated by bolding. Real datasets. We also used three real-world

期刊：

Expert Systems with Applications

ISSN：

0957-4174

年：

2025

页码：

127815

DOI：

10.1016/j.eswa.2025.127815

机构署名：

本校为第一且通讯机构

院系归属：

计算机与通信工程学院

摘要：

Blockchain serves as the basis for secure and distributed applications like decentralized finance (DeFi). Light nodes can help reduce the storage and computation burden of full nodes in the blockchain by storing only block headers instead of full blocks. The resource constraint user runs a light node and sends the query request to the full node to find the answer. However, this approach raises security concerns, as query results from the full node may be altered or incomplete. In this paper, we explore verifiable similarity queries, commonly used in text mining, within the context of blockchai...

反馈

产权有误：本人成果被他人认领

数据有误：数据基本信息有误

归属有误：成果的院系归属、机构署名归属有误

其他原因：

验证码：

看不清楚，换一个

确定

取消

成果认领

标题：

用户	作者	通讯作者	--
	请选择	请选择	--

确定

取消

VSQ: Enabling efficient and verifiable similarity queries in blockchain databases

反馈

成果认领

提示

该栏目需要登录且有访问权限才可以访问