Skip to main content

Extracting collection from Sedna XML DB

This post is actually based on a kind of an epic fail story. Initially the task was just to rename a collection in Sedna XML DB. The solution is as primitive as using RENAME COLLECTION statement of Sedna Data Definition Language. But I'm probably too enthusiastic about writing Bash scripts in Linux. So I missed out single-statement solution and wrote a bunch of scripts to perform the same task via extracting-loading procedure. Anyway, it can still be quite valuable for more complex tasks like moving a collection between XML DB installations (e.g. from Production to Test environment) or merging collections. So my solution follows below.

Extracting a single file
It's always wise to modularize the code and divide a task into smaller parts. First, we need a script for extracting a single file. It need be parametrized with a file name and a collection name.

Also I address another essential problem here that is the safety of file names. It's not a common problem but we do have files named in XML DB using characters that are illegal for real file names (e.g. the right slash character '/'). It can be easily stored in Sedna XML DB though. The solution is to rename a file using current timestamp and to store a mapping to the original file name in a separate LOAD_FILE. This will be used later for loading files back to XML DB.
extract_file.sh:
#!/bin/sh
# External parameters
FILE_NAME=$1
COLLECTION_NAME=$2

TARGET_PATH=./products
LOAD_FILE=load_file.list
SEDNA_TERM=/appl/sedna/bin/se_term
XQUERY_STRING="doc(\"$FILE_NAME\",\"$COLLECTION_NAME\")"

# Creates a target path directories if needed
mkdir -p $TARGET_PATH
# Safe file name made up from a timestamp to avoid duplication
SAFE_FILE_NAME=`echo $(date +%Y%m%d%H%M%S%N).xml`

# Extracts a file from XML DB
$SEDNA_TERM -query $XQUERY_STRING -output $TARGET_PATH/$SAFE_FILE_NAME mydatabase
# Saves a mapping from a file path to a real file name
echo "'$TARGET_PATH/$SAFE_FILE_NAME' '$FILE_NAME'" >> $LOAD_FILE

Extracting a collection
Second part is writing a script for extracting a complete collection. It simply iterates over the files stored in XML DB collection and executes the above script on each file. I use the FILENAMES_XQUERY to retrieve the list of file names in collection.

Also I track the operation progress using the same approach with counter variable and print_status function that was initially used in this post.
extract_files.sh:
#!/bin/bash

# This function writes a status message to both stdout and $OUTPUT_FILE
function print_status {
  echo ">>> Extracted $counter files, time: `date`" | tee -a $OUTPUT_FILE
}

OUTPUT_FILE=extract_files.log
COLLECTION_NAME=products
SEDNA_TERM=/appl/sedna/bin/se_term
FILENAMES_XQUERY="for \$i in collection(\"products\") return document-uri(\$i)"

echo "" > $OUTPUT_FILE

counter=0
print_status

for file in `$SEDNA_TERM -query "$FILENAMES_XQUERY" mydatabase`
do
  # Exctacts a single file from collection
  ./extract_file.sh $file $COLLECTION_NAME

  let "counter = $counter + 1"
  if (( $counter % 100 == 0 )); then
    print_status
  fi
done

print_status

Loading a collection
The third part is loading files back to XML DB. It has similar iterative structure with the same tracking progress technique as above. The difference is that we iterate over lines in the LOAD_FILE. For each line file_path_and_name we execute LOAD statement to upload the file into a collection (pretty much the same code as in this post). If I needed to merge collections, I'd consider deleting existing file first or using some xquery to check whatever I want about the existing file.
load_files.sh:
#!/bin/bash

# This function writes a status message to both stdout and $OUTPUT_FILE
function print_status {
  echo ">>> Loaded $counter files, time: `date`" | tee -a $OUTPUT_FILE
}

OUTPUT_FILE=load_files.log
COLLECTION_NAME=ontoML/products
SEDNA_TERM=/appl/sedna/bin/se_term
LOAD_FILE=load_file.list

echo "" > $OUTPUT_FILE

counter=0
print_status

# Creates a collection (or shows error message if it already exists)
$SEDNA_TERM -query "CREATE COLLECTION '$COLLECTION_NAME'" mydatabase >> $OUTPUT_FILE

while read file_path_and_name
do
  # Uploads a single file to collection
  $SEDNA_TERM -query "LOAD $file_path_and_name '$COLLECTION_NAME'" mydatabase >> $OUTPUT_FILE

  let "counter = $counter + 1"
  if (( $counter % 100 == 0 )); then
    print_status
  fi
done <$LOAD_FILE

print_status

Results
Finally I used a simple script to combine everything:
run.sh:
#!/bin/sh

echo "### Started extraction at `date`"
./extract_files.sh
echo "### Finished extraction, started upload at `date`"
./load_files.sh
echo "### Finished upload at `date`"
These are generated files' snippets:
extract_files.log:
>>> Extracted 0 files, time: Fri Oct 26 16:34:11 CEST 2012
>>> Extracted 100 files, time: Fri Oct 26 16:34:18 CEST 2012
>>> Extracted 200 files, time: Fri Oct 26 16:34:25 CEST 2012
...
>>> Extracted 7777 files, time: Fri Oct 26 16:43:36 CEST 2012
load_file.list:
'./products/20121026163412406104000.xml' '74HC4538D.xml'
'./products/20121026163412463021000.xml' '74HC4538D/S206.xml'
'./products/20121026163412555510000.xml' '74HC4538D/S242.xml'
...
load_files.log:
>>> Loaded 0 files, time: Fri Oct 26 16:43:36 CEST 2012
UPDATE is executed successfully
Bulk load succeeded
...
Bulk load succeeded
>>> Loaded 7777 files, time: Fri Oct 26 17:11:18 CEST 2012
That's it. At least I got a nice article for the time I wasted instead of using RENAME COLLECTION statement :)

Comments

Popular posts from this blog

Connection to Amazon Neptune endpoint from EKS during development

This small article will describe how to connect to Amazon Neptune database endpoint from your PC during development. Amazon Neptune is a fully managed graph database service from Amazon. Due to security reasons direct connections to Neptune are not allowed, so it's impossible to attach a public IP address or load balancer to that service. Instead access is restricted to the same VPC where Neptune is set up, so applications should be deployed in the same VPC to be able to access the database. That's a great idea for Production however it makes it very difficult to develop, debug and test applications locally. The instructions below will help you to create a tunnel towards Neptune endpoint considering you use Amazon EKS - a managed Kubernetes service from Amazon. As a side note, if you don't use EKS, the same idea of creating a tunnel can be implemented using a Bastion server . In Kubernetes we'll create a dedicated proxying pod. Prerequisites. Setting up a tunnel.

Notes on upgrade to JSF 2.1, Servlet 3.0, Spring 4.0, RichFaces 4.3

This article is devoted to an upgrade of a common JSF Spring application. Time flies and there is already Java EE 7 platform out and widely used. It's sometimes said that Spring framework has become legacy with appearance of Java EE 6. But it's out of scope of this post. Here I'm going to provide notes about the minimal changes that I found required for the upgrade of the application from JSF 1.2 to 2.1, from JSTL 1.1.2 to 1.2, from Servlet 2.4 to 3.0, from Spring 3.1.3 to 4.0.5, from RichFaces 3.3.3 to 4.3.7. It must be mentioned that the latest final RichFaces release 4.3.7 depends on JSF 2.1, JSTL 1.2 and Servlet 3.0.1 that dictated those versions. This post should not be considered as comprehensive but rather showing how I did the upgrade. See the links for more details. Jetty & Tomcat. JSTL. JSF & Facelets. Servlet. Spring framework. RichFaces. Jetty & Tomcat First, I upgraded the application to run with the latest servlet container versio

Extracting XML comments with XQuery

I've just discovered that it's possible to process comment nodes using XQuery. Ideally it should not be the case if you take part in designing your data formats, then you should simply store valuable data in plain xml. But I have to deal with OntoML data source that uses a bit peculiar format while export to XML, i.e. some data fields are stored inside XML comments. So here is an example how to solve this problem. XML example This is an example stub of one real xml with irrelevant data omitted. There are several thousands of xmls like this stored in Sedna XML DB collection. Finally, I need to extract the list of pairs for the complete collection: identifier (i.e. SOT1209 ) and saved timestamp (i.e. 2012-12-12 23:58:13.118 GMT ). <?xml version="1.0" standalone="yes"?> <!--EXPORT_PROGRAM:=eptos-iso29002-10-Export-V10--> <!--File saved on: 2012-12-12 23:58:13.118 GMT--> <!--XML Schema used: V099--> <cat:catalogue xmlns:cat=