eLink Nomad Offline ConfigurationThis is a featured page

Introduction


The eLink system provides for a 'nomad' configuration, which comprises a secondary, child database server which is periodically synchronised to the primary, parent database. This child database may be running on a mobile computer, with intermittant connectivity to the parent database. The intention of this design is to allow the user of the child database to have continous access to a local, secondary (nomad) copy of the elink database, whilst ensuring that the two databases remain synchronised.

The eLink documentation regarding the setup of a nomad system is quite confusing, particularly as it is split between two documents:
  1. the eLink Nomad Configuration document describes the principles of the system
  2. the eLink Service Applications document includes a description of the offline replication service, which is helpful in automating the replication process.

In this document, we are assuming that a suitable child database has been created as described in the eLink Nomad configuration document, and that this database has been successfully copied to the child elink server. It is further assumed that this child server is operating correctly, as is the parent elink server.

At this point, the main issue for administrators is to define the data pathways for the data to be passed between the two databases/servers. There are three main possibilities provided for within the elink offline replication service.
  1. via an FTP server
  2. via an email account
  3. via a shared folder

All of these solutions have their merits and their drawbacks. Having tried the offline replication service with all three types of approach, it seems clear that the FTP server approach is the best supported. However, due to limitations of the elink offline replication tool, it will be

Tools required to setup offline replication:
  1. eLink replication console - this is a standalone console which can be used to instigate the elink part of the replication process manually. This process can be run on both the primary and secondary servers.
  2. eLink offline replication tool - this is a tool which needs to be run on both the parent and child servers, where it accepts a data packet from the other (parent or child) eLink server, processes it locally through the eLink replication process and then send the resultant data packet back to the other eLink server.
  3. an FTP client, such as Filezilla
It is recommended that the replication is setup in a phased manner, as follows:
  1. Creation of Child Database.
  2. Initial, manual online replication between parent and child databases on parent elink server.
  3. Installation of eLink etc on child server.
  4. Copying of Child database to child server.
  5. Creation of replication infrastructure.
  6. Initial, manual offline replication
  7. Development of automated replication
  8. Using Final system
These phases are described in the sections below. Before that, we describe the principles on which the system depends, followed by a description of the tools used to build the offline replication system, together with a description of the role they play in the system. Then we discuss the architecture of the offline replication system being described here.

Offline Replication Principles

It is necessary to establish a two-way communication pathway to allow the synchronisation data to flow. The offline replication data comprises two packet types:
  1. the GLB packets, which flow from the parent server to the child server.
  2. the LOC packets, which flow from the child server to the parent server.
These data packets essentially list the changes that have occured to the database since the previous packet. So, it is important that the packets are processed in the correct order. So, every data packet is time-stamped. It is also important for these two streams to be synchronised. So, the synchronisation process runs as follows:

  1. The parent database generates a GLB packet
  2. The GLB packet is transferred to the child database
  3. The child database processes the GLB packet
  4. The child database generates a LOC packet
  5. The LOC packet is transferred to the parent database
  6. The parent database processes the LOC packet
  7. The parent database generates another GLB packet
  8. The process then repeats endlessly from Step 2.
So, in principle, there is only, ever one data packet which is live in the system. The location of that packet determines which step of the sychronisation process is being performed. From the perspective of the two servers, they act like tennis players who receive a ball from their opponent, and then hit it back to them again.

There are two issues that arise from the above system. Firstly, how to start the offline replication. Secondly, what happens if a data packet goes astray.

Initialisation

It should be noted that in the above process, the first step is only every performed once. What distinguishes it from step 7, where the next GLB packet is generated is that step 7 is performed in response to step 6 (the receipt and processing of the previous LOC packet), whereas in step 1 there is no previous LOC packet to process. In our tennis analogy, step 1 corresponds to the serve, where the ball is put into play by the one of the players. Step 1 is a manually instigated process. Once step 1 has been performed, the remainder of the process can be automated.

For step 1 to work, we have to tell the parent database server to generate a GLB packet without waiting for a previous LOC packet. The eLink documentation calls the normal operation of the parent database server 'forced synchronisation', and there is a flag which can be set to tell the system to run in 'forced synchronisation' mode - or not.

Fault Recovery

If the offline replication process fails for some reason, it is likely that replication will simply cease. There are three possible reasons why this could happen:
  1. A data packet is lost in the ether
  2. A data packet becomes corrupted
  3. A data packet is processed out of sequence
When the process fails, it is tempting to try to locate the erroneous data packet and reissue it so that the process can continue. Indeed, the offline replication system retains a history of the data packets sent and received by each of the servers. However, it can be difficult to retrieve the correct packet. The main problem that will occur is attempting to restart the process is to get the time-stamps of the new data packets back in sequence. Any gaps in the time stamps will be recognised by the servers, causing those packets to be rejected. So, it is not simply a case of repeating step 1. We actually need to wind the synchronisation time back a bit. The elink offline replication system includes facilities to back date the replication and the corresponding timestamps.

There is one other important consideration when attempting to recover from a fault condition. It is important to clear out any data packets in the system before attempting to restart it.

So, sumarising the above, in the event of a fault developing in the offline replication process, the following steps should be undertaken:

  1. Delete any data packets remaining in the system, both on the parent and child database servers or at any point in between.
  2. Set the synchronisation time back a bit (typically one day), on the parent server
  3. Take the parent server out of forced sychronisation mode
  4. Cause the parent server to generate the first GLB packet
  5. Put the parent server back into forced synchronisation mode, before the first LOC packet arrives back from the child database

Offline Replication Tools

This section gives a brief description of the tools needed to build and maintain an offline replication system.

elink Replication Console

This is a stand-alone tool that causes manual replication to take place. On the parent database server, this tool can be used to:
  1. Control forced sychronisation mode
  2. Control the timestamps of the sychronisation packets.
  3. Accept a LOC packet
  4. Run the replication on the parent database
  5. Generate a GLB packet
On the child database server, this tool can be used to:
  1. Accept a GLB packet
  2. Run the replication on the child database
  3. Generate a LOC packet
If the offline replication process fails, it is necessary to use the replication console on the parent database server to restart it. One other important aspect of the replication console is that it gives a full log of the replication, indicating the nature of any problems that arise. This can be useful for debugging and diagnostic purposes. For this reason, it is recommended that this tool is used during the development of an offline replication system, only replacing it with the offline replication tool when the architecture has been made operational.

eLink Offline Replication Tool

This tool needs to be run on both the parent and child database servers. On either server, it:
  1. detects the arrival on a data packet from the other server
  2. submits that data packet to the replication system
  3. runs the replication system
  4. posts the outgoing data packet back in the direction of the other server
So, this tool is one of the building blocks of the automated offline replication system. However, is diagnostics are more limited than the replication console, so it can be more difficult to determine why it fails, when a failure occurs.

The offline replication tool can be configured to work with several communications mediums. Unfortunately, we found none of them to be satisfactory for our purposes. However, the use of a 'network drive' does provide a mechanism for operation. This will be explained later in the offline replication architecture.

DOS FTP Client

We use the DOS FTP client, built into the windows command line interface to transfer the data packets between the two servers. This client is suitable for automation, but is not very user-friendly.

Graphical FTP Client

A Graphical FTP client, such as Filezilla, can be helpful on both the parent and child database servers as a means of manually transferring data between the two servers. A Graphical FTP client is helpful during the development of the automated replication system because it is much more user-friendly, but is not suitable for use in automation as they do not have the full range of command necessary for automation.

Miscellaneous Tools

In addition to the above, some DOS scripting is required, along with knowledge of how to create scheduled tasks under windows. We also make use of the curl scripting system on our parent database server. This was necessary as the DOS FTP client would not operate correctly for some strange reason.

Offline Replication Architecture

The following diagram illustrates the generic architecture of the proposed system:

Generic Offline Replication Architecture
In the above diagram, the Offline replication system is shows as comprising three servers, the parent database server, the child database server and an FTP server. On the parent database server, we find the parent elink database, together with a local replication folder and a remote replication folder. These three data stores are connected by a replication process, a file copy process and an FTP client. A similar architecture is found on the child database server.

The above diagram is intended to be fairly generic, in terms of the actual processes used to make the connections between the data stores. We will show further diagrams that will be more specific later.

Perhaps the surprising thing in this diagram is the need for both a local and a remote replication folder on both servers. Why have two folders? Surely one would suffice. The reason for two folders is that the eLink Offline replication tool is not fully functional, and cannot cope with all possible network configurations. Actually, if the eLink Offline replication tool could be made to work as it is intended to work, it would be able to fulfill the needs of all the processes on the system, and the architecture would reduce to the following:

Simple Replication Architecture

In this simplified architecture, the eLink Offline Replication tool provides all the connectivity necessary to make the system operational. Indeed, this is the intended archecture for use with the eLink Offline replication tool in FTP mode, and a similar architecture holds for email and network drive modes. However, experience has shown that the eLink Offline replication tool may not work for all configurations. So, we have had to fall back to the generic solution shown in the first diagram.

When building a particular configuration, it is worth building it in phases, starting with a fully manual system, and evolving it towards a fully automated system. The fully manual system can be used to
  1. validate the network infrastructure,
  2. check the interprocess communications is working correctly,
  3. ensure that the parent and child databases can be synchronised through offline replication
  4. Avoid race conditions and other timing difficulties that can arise in a fully automated system
Having built and tested the manual system, it is possible to start the process of automation by deploying the eLink Offline replication tool to fulfill part of the process. Once that can be shown to be working, the next part of the system can be developed, until the entire system is automated.

Creation of Child Database

This topic is covered in the eLink Nomad Configuration document.

Initial, manual online replication between parent and child databases on parent elink server

This topic is covered in the eLink Nomad Configuration document.

Installation of eLink etc on child server

This topic is covered in the eLink Nomad Configuration document.

Copying of Child database to child server

This topic is covered in the eLink Nomad Configuration document.

Creation of replication infrastructure

The following elements need to be created:
  1. An FTP server site and account. This could be part of an existing website. It is important to note the access and login details.
  2. An FTP folder on the FTP server site. It is recommended that this folder should be created specifically for this purpose, rather than using some other folder on the FTP server as this simplifies identification of the offline replication data packets.
  3. A local replication folder on the parent server.
  4. A remote replication folder on the parent server.
  5. A local replication folder on the child server.
  6. A remote replication folder on the child server.
Having created the above elements, it is necessary to configure the elink replication console and FTP client ready for operation.

On the parent server, start the offline replication administrator tool, login to the parent database and then specify a network drive for the data transfer, giving the path to the remote replication folder on the parent server as the server directory. Then under replication, specify the local replication folder. DO NOT ATTEMPT TO START REPLICATION AT THIS TIME.

On the child server, start the elink offline replication administrator tool, connect to the child database, then repeat the above setup for the child database, specifying the remote replication folder and local replication folder on the child server. DO NOT ATTEMPT TO START REPLICATION AT THIS TIME.

On the parent server, start the replication console, login to the parent database, and choose list view. Then select the child database to be replicated and:
  1. select offline replication
  2. clear forced synchronisation
DO NOT ATTEMPT REPLICATION AT THIS TIME.

On the child server, start the replication console, login to the child database. DO NOT ATTEMPT REPLICATION AT THIS TIME.

Note that the elink replication administrator will create 'Inbox' and 'Sent' folders underneath the local replication folder, both on the parent and child servers. These folders are used to hold data packets that were previously transferred.

Start the graphical FTP tool on the parent database server, and connect to the FTP site. In the FTP tool, open the remote replication folder on the parent server and open the FTP folder on the FTP site.

Start the graphical FTP tool on the child database server, and connect to the FTP site. In the FTP tool, open the remote replication folder on the child server and open the FTP folder on the FTP site.

Everything should now be ready for a manual, offline sychronisation.

Initial, manual offline replication

It is helpful and instructive to run through the replication process in a fully manual manner for a few cycles to ensure that all parts of the process work correctly, and that all data pathways are properly established. The process is as follows:
  1. On the parent database server, go to the replication console->run view, and execute the replication. This should create a GLB packet in the local replication folder of the parent database server.
  2. In the parent database server's replication console, select list view and activate forced synchronisation, in anticipation of receipt of the first LOC packet.
  3. Using windows explorer, move the newly created GLB file from the local replication folder to the remote replication folder, ensuring that the orginal file is deleted by ensuring the file is moved rather than copied. Actually, it is important to move the files at every stage of this process rather than copying them. A data packet file should only exist in one location at any given time. This becomes important when as the replication process is automated because the presence of old data will confuse the replication processes.
  4. The GLB file should now be visible from the FTP client running on the parent database server. Use this FTP client to copy the file to the FTP folder on the FTP server. Then delete the GLB file from the remote directory on the parent database server.
  5. The GLB file should now be visible from the FTP client running on the child database server. Use this FTP client to copy the file to the child database server's remote replication folder. Then delete the file in the FTP folder using the FTP client.
  6. Using windows explorer on the child database server, move the GLB file from the remote replication folder to the local replication folder.
  7. On the child database server, go to the replication console->run view and execute the replication. This should remove the GLB packet from the local replication folder, replacing it with a LOC packet. At this point, we reverse the direction of data transfer, moving the LOC packet back in the direction of the parent database server, as follows.
  8. Move the LOC packet to the remote replication folder on the child server.
  9. FTP the LOC packet to the FTP folder, deleting the one in the remote replication folder.
  10. On the parent server, FTP the LOC packet from the FTP folder on the FTP site to the remote replication folder on the parent server. Delete the LOC packet on the FTP server.
  11. On the parent server, move the LOC packet to the local replication folder on the parent server.
  12. On the parent database server, go to the replication console->run view, and execute the replication. This should process the LOC packet and replace it with another a GLB packet in the local replication folder of the parent database server.
  13. The process can now be repeated from step 3 onwards.
It is worth running through the complete cycle several times to ensure both replications are working correctly.

Development of automated replication

Having setup the basic system, and shown it to be operational by a manual walk-through, we can now start to automate the system.

Using the Replication Administrator

It is important to stop the above manual sequence at step 10, such that a LOC packet is sitting on the remote replication folder of the parent database server.

At this point, try starting the automatic replication on the parent database server by pressing the run button in the replication administrator on the parent database server. It should effectively execute steps 11 and 3, replacing the LOC packet with a new GLB packet.

Now the manual process can be followed to transfer the packet to the child server (steps 4 and 5).

Activating the automatic replication on the child database server (by pressing the run button on the replication administrator running on the child database server) will cause steps 6-8 to occur automatically, replacing the GLB packet with a LOC packet. Executing steps 9 and 10 manually will bring the LOC packet back to the parent database server.

At this point, the replication will automatically repeat steps 11 and 3. And so the process will continue.

Child Database Server FTP Client Script


We are now going to prepare a script that will perform steps 5 and 9 of the data transfer process. Note that we only need to write a script that runs through the process once. We will later use the windows scheduled task tool to repeat runs of this script at periodic intervals.

For this, we need to use a scriptable FTP client. A graphical FTP client is not suitable for this purpose as it cannot be automated. So, we are going to use the basic FTP client provided by the windows command line interpreter. This FTP client accepts commands from the keyboard, but we can prescript them in a file. A typical FTP script would look like:

open ftphost.com
username
password
binary
recv file
quit

Actually, we are going to be a bit more clever than that, by having our DOS script write the FTP script for us:

> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo send %temploc% %loc%
>> script.ftp echo quit

We can then execute the script by the command:

ftp -s:script.ftp > nul

The benefit of this approach is that we can define the site name, user name, password etc, once in the DOS script. Then we can use these for all the FTP commands to be executed.

The basic variables that define the setup are found in the first few lines of the script:

set ftpsite=www.ftphost.com
set user=username
set password=password
set folder=public_html/mbw2/elinknomad
set glb=MANAGER000268_GLB.pak
set loc=MANAGER000268_LOC.pak

rem we use temporary file names to minimise risk of race conditions
set temploc=temploc.pak
set tempglb=tempglb.pak

We assume that the script is stored in the remote folder of the child database server, and will be executed from there. So, we do not have to specify that folder in the script. The folder given in the fourth variable is the folder on the FTP server.

The glb and loc variables define the names of the two packets we will be passing through the system. The temploc and tempglb variables define the names of temporary files used during the data transfer. We use these temporary files to minimise the risk of race conditions occuring during the data transfer. Before discussing the potential race conditions, lets look at the first data transfer, where we move a GLB file from the FTP site to the remote replication folder...

rem ----------- GLB file handling --------------
rem attempt to copy GLB file from FTP site
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo recv %glb% %tempglb%
>> script.ftp echo quit
ftp -s:script.ftp > nul

rem if tempglb exists, rename it and delete FTP site copy
if not exist %tempglb% goto noglbfile
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo delete %glb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
ren %tempglb% %glb%

:noglbfile

This script attempts to use the FTP client to download the GLB file from the FTP site. If it suceeds, it deletes the GLB file from the FTP site (again using the FTP client). We test for success or failure by looking for the downloaded file in the remote replication folder after attempting the download. This is the point where a race condition exists. Remember that the Offline Replication Service is running, and could grab the GLB file from this folder at any time. So, it could do so between the moment we complete the download and before we execute the file presence test. If this was to happen, the file presence test would fail, and we would deduce that there was no GLB file downloaded. Consequently, we would fail to delete the GLB file from the FTP site. So, on the next run of this script, we would download the same GLB file again - out of sequence, causing the replication to fail because the packets would be out of sequence.

In order to avoid this replication, we download the GLB file as a temporary file, a file that will not be grabbed by the replication service. We can then test for the presence of this file after the download attempt has finished. If successful, we can attempt to delete the GLB file on the FTP site, and rename the temporary file to the correct file name.

The next part of the script is responsible for uploading a LOC file to the FTP site, deleting the original LOC file on successful completion. The process is similar to that described, and can be seen in the final script given below.

There are two more things worth adding to the script: logging and exception reporting. Once the script is complete, it is desireable that it runs silently, unless a problem occurs. If a problem occurs, we need to be able to track down where the problem originated from, so that any faults can be remedied. For this reason, some logging is desireable. This script creates a log file called ftplog.txt. The completed script is as follows:

@echo off
set ftpsite=www.ftphost.com
set user=username
set password=password
set folder=public_html/mbw2/elinknomad
set glb=MANAGER000268_GLB.pak
set loc=MANAGER000268_LOC.pak
rem we use temporary file names to minimise risk of race conditions
set temploc=temploc.pak
set tempglb=tempglb.pak
type ftplog.txt | find "" /v /c > ftpcount.txt
set /a cnt=0
for /f %%a in ('type ftpcount.txt') do set /a cnt=%%a
rem echo log has %cnt% lines
if %cnt% LEQ 100 goto done
ren ftplog.txt ftplogbak.txt
> ftplog.txt echo Started new log %DATE% %TIME%
:done
ren ftplog.txt ftplog1.txt
> ftplog.txt echo Started at %DATE% %TIME%
rem ----------- GLB file handling --------------
rem attempt to copy GLB file from FTP site
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo recv %glb% %tempglb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
rem if tempglb exists, rename it and delete FTP site copy
if not exist %tempglb% goto noglbfile
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo delete %glb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
ren %tempglb% %glb%
>> ftplog.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
> good.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
:noglbfile
rem ------------ LOC file handling -------------
rem rename LOC file (if any) to temp LOC file
if not exist %loc% goto nolocfile
rem copy the LOC file to the ftp site
ren %loc% %temploc%
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo send %temploc% %loc%
>> script.ftp echo quit
ftp -s:script.ftp > nul
del %temploc%
>> ftplog.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> good.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
:nolocfile
>>ftplog.txt type ftplog1.txt
del ftplog1.txt

The logging records four things:
  1. The time of each execution of the script
  2. When a GLB packet is handled
  3. When a LOC packet is handled.
  4. A record of the last successful run, when a data transfer occurred (stored in good.txt)
At present, there is no exception reporting in the script. This will be discussed later. The log file is constructed so that the most recent entries are put at the start of the file. It is also restricted to 100 executions. Once that number is exceeded, the log file is backed up and restarted so that it does not grow too large.

Once the script has been written, it is best to test it directly from a DOS command line window. It should run correctly with its progress shown in the window. It should also be possible to repeat its execution any number of times, independent of what data packets are flowing through the system. After each run, check that the data packet(s), if any, have been handled correctly and that the logs are sensible.

Parent Database Server FTP Client Script


This script is very similar to the child database server FTP client script, except that the direction of data packet flow is reversed. In our particular case, we found difficulties in making the DOS FTP client work correctly on the parent server, so instead we used a scripting language called CURL to perform the FTP operations. The function is the same, only the mechanism is different. Again, we have included logging facilities in the script.

set ftpsite=ftp://www.ftphost.com
set user=username:password
set folder=public_html/mbw2/elinknomad
set glb=MANAGER000268_GLB.pak
set loc=MANAGER000268_LOC.pak
rem we use temporary file names to minimise risk of race conditions
set temploc=temploc.pak
set tempglb=tempglb.pak

type ftplog.txt | find "" /v /c > ftpcount.txt
set /a cnt=0
for /f %%a in ('type ftpcount.txt') do set /a cnt=%%a
rem echo log has %cnt% lines
if %cnt% LEQ 100 goto done
ren ftplog.txt ftplogbak.txt
> ftplog.txt echo Started new log %DATE% %TIME%
:done
ren ftplog.txt ftplog1.txt
> ftplog.txt echo Started at %DATE% %TIME%
rem ----------- LOC file handling --------------
rem attempt to copy loc file from FTP site
curl %ftpsite%/%folder%/%loc% -o %temploc% --user %user% > nul
rem if temploc exists, rename it and delete FTP site copy
if not exist %temploc% goto nolocfile
curl %ftpsite% -X "DELE %folder%/%loc%" --user %user% > nul
ren %temploc% %loc%
>> ftplog.txt echo Copied LOC file from FTP server to parent at %DATE% %TIME%
> good.txt echo Copied LOC file from FTP server to parent at %DATE% %TIME%
:nolocfile
rem ------------ GLB file handling -------------
rem rename GLB file (if any) to temp glb file
if not exist %glb% goto noglbfile
ren %glb% %tempglb%
curl -T %tempglb% %ftpsite%/%folder%/%glb% --user %user% > nul
del %tempglb%
>> ftplog.txt echo Copied GLB file from parent to FTP server at %DATE% %TIME%
> good.txt echo Copied GLB file from parent to FTP server at %DATE% %TIME%
:noglbfile
>>ftplog.txt type ftplog1.txt
del ftplog1.txt

The same debugging technique can be use for this script as the child database FTP script.

Script Task Scheduling

Once the scripts are running correctly from the command line, they can be attached to windows scheduled tasks - tasks that run periodically at regular intervals. When they are initially setup like this, they will produce a black window on screen while they run. This is helpful during testing as a visual indication that all is ok, but this is obviously not desireable in the final system. The simple answer is to run the scripts as a different user. In our application, the Linkserver user would seem appropriate and sensible. Having set the scheduled tasks up, monitor their continued operation for a time to ensure there are no permission issues or other difficulties.

Offline Replication Services Setup

Once the system is running with the scripts automated, it is desireable to run the the offline replication tool as a service rather than from the administrators console. The eLink manuals state that this can be done by the command:

"Offline Replication Service.exe" /install

This command will install the service, but may not necessarily start that service. Use the windows services tools to check the status of the service once installed, and to start it if necessary.

Data Logging

The FTP scripts log their activity as described earlier. The elink offline replication tools log their behaviour in the logs folder under the elink installation folder. The offline replication service log is especially useful, but it may not appear for a while until it is sufficiently large to force its appearance.

Fault Tolerance

Some attempts have been made to improve the fault tolerance of the system. In particular, the child server FTP script has been enhanced to capture the last good packet sent in either direction, with a view that this packet could be reissued if the system stops operating. This script has also been enhanced to detect loss of operation, so that fault conditions can be detected. The enhanced script looks like this:

set ftpsite=www.xxxxx.com
set user=username
set password=password
set folder=public_html/mbw2/elinknomad
set glb=MANAGER000268_GLB.pak
set loc=MANAGER000268_LOC.pak
rem we use temporary file names to minimise risk of race conditions
set temploc=temploc.pak
set tempglb=tempglb.pak
rem we retain a copy of the last packet passing through this part of the system here
set lastloc=lastloc.pak
set lastglb=lastglb.pak
type ftplog.txt | find "" /v /c > ftpcount.txt
set /a cnt=0
for /f %%a in ('type ftpcount.txt') do set /a cnt=%%a
rem echo log has %cnt% lines
if %cnt% LEQ 100 goto done
ren ftplog.txt ftplogbak.txt
> ftplog.txt echo Started new log %DATE% %TIME%
:done
rem the file lastrun.txt holds a count of how many executions have occurred since the last successful run
rem this count is reset to zero when a data transfer takes place
set /a last_run=0
for /f %%a in ('type lastrun.txt') do set /a last_run=%%a
set /a last_run = last_run + 1
> lastrun.txt echo %last_run%
ren ftplog.txt ftplog1.txt
> ftplog.txt echo Started at %DATE% %TIME%
rem ----------- GLB file handling --------------
rem attempt to copy GLB file from FTP site
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo recv %glb% %tempglb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
rem if tempglb exists, rename it and delete FTP site copy
if not exist %tempglb% goto noglbfile
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo delete %glb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
copy %tempglb% %lastglb%
del %lastloc%
ren %tempglb% %glb%
>> ftplog.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
> good.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
>lastrun.txt echo 0
set /a last_run=0
:noglbfile
rem ------------ LOC file handling -------------
rem rename LOC file (if any) to temp LOC file
if not exist %loc% goto nolocfile
rem copy the LOC file to the ftp site
ren %loc% %temploc%
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo send %temploc% %loc%
>> script.ftp echo quit
ftp -s:script.ftp > nul
copy %temploc% %lastloc%
del %lastglb%
del %temploc%
>> ftplog.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> good.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
>lastrun.txt echo 0
set /a last_run=0
:nolocfile

>>ftplog.txt type ftplog1.txt
del ftplog1.txt

Having done this, some experiments were conducted to see the effect of reissuing the last good packet, with somewhat mixed results. Repeating the last good GLB packet had a favourable effect - a new LOC packet was generated in response, and appeared to be accepted by all parts of the system.

However, when the last good LOC packet was repeated, the parent server became confused by the presence of the previous GLB packet, and refused to generate another GLB packet. The earlier GLB packet then confused the child server because it was out of sequence. It is not exactly clear how this situation arose, but could actually have been due to the first test causing two GLB packets to be present different parts of the system, when the later one caught up with the earlier one, the system stopped.

As can be seen from the above experiments, it is difficult to correct a replication fault by injecting the last good packet back into the system. However, the above fault may have been due to the injection of a packet into a working system, rather than into a system that already had a fault. Other experiments have shown there are occasions when injecting the packet will correct the replication fault. But, the only sure-fire way of restarting a faulty replication is to follow the procedure outlined earlier. This is, by nature a manual process. Further consideration will be given to automating the process.

Loss of Communications

With the use of mobile broadband, loss of commiunications between the child database server and the internet can be quite common. It is important that the replication system is very tolerant of this condition, recovering from it quite easily. If communications are lost in the manner, the child database server is unable to receive GLB packets from the FTP site, and is also unable to send LOC packets to the FTP site. The first of these conditions is not a problem with the above script. But the above script fails to take the latter condition into account. After attempting to send the LOC packet to the FTP server, it assumes the attempt was successful and renames the LOC packet accordingly. Consequently, if the transfer fails because of loss of internet access, the script will not attempt to resend the packet when communications are re-established. The simple answer to this problem is that we need to know if the FTP send succeeded or not.

Looking at the output from this FTP send script, when things are working, we get:
ftp> Connected to xxxx.com.
open www.xxxxxx.com
220---------- Welcome to Pure-FTPd [TLS] ----------
220-You are user number 1 of 50 allowed.
220-Local time is now 07:28. Server port: 21.
220 You will be disconnected after 15 minutes of inactivity.
User (castlegatetech.com:(none)):
331 User xxxxx OK. Password required
230-User xxxx has group access to: yyyy
230 OK. Current restricted directory is /
ftp> BINARY
200 TYPE is now 8-bit binary
ftp> cd public_html/mbw2/elinknomad
250 OK. Current directory is /public_html/mbw2/elinknomad
ftp> send temploc.pak MANAGER000268_LOC.pak
200 PORT command successful
150 Connecting to port 49261
226-File successfully transferred
226 0.783 seconds (measured here), 39.96 Kbytes per second
ftp: 32044 bytes sent in Seconds Kbytes/sec.
ftp> 0.0032044000.00quit
221-Goodbye. You uploaded 32 and downloaded 0 kbytes.
221 Logout.

But if the script fails use to internet problems, we get:

ftp> Unknown host xxxx.
ftp> open xxxx
Invalid command.
ftp>xxxx
Invalid command.
ftp>password
Not connected.
ftp> BINARY
Not connected.
ftp> cd public_html/mbw2/elinknomad
Not connected.
ftp> send temploc.pak MANAGER000268_LOC.pak
quit

So, what we need to do is capture the results from the script, and then look for something that says the transfer worked. The obvious line is:

226-File successfully transferred

It seems sensible to look for a success indicator rather than a fault indicator, as it is possible there are many failure modes which could cause the transfer to fail. Looking for a positive outcome would be more fault tolerant, especially as the internet connection could fail at any time, including during the actual transfer.

Here is an extract of script that can process the result from the FTP command:

set ftpworked=0
for /f "delims=}" %%a in ('type ftpresult.log') do (
rem echo %%a
if "%%a"=="226-File successfully transferred" (
copy %temploc% %lastloc%
del %lastglb%
del %temploc%
>> ftplog.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> good.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> %child_LOC_status% echo %DATE% %TIME%
>lastrun.txt echo 0
set /a last_run=0
set ftpworked=1
)
)
rem if the ftp command failed, we need to make the loc file available again for a retry
if %ftpworked%==0 ren %temploc% %loc%

So the revised script now becomes:


set ftpsite=www.xxxxx.com
set user=username
set password=password
set folder=public_html/mbw2/elinknomad
set glb=MANAGER000268_GLB.pak
set loc=MANAGER000268_LOC.pak
rem we use temporary file names to minimise risk of race conditions
set temploc=temploc.pak
set tempglb=tempglb.pak
rem we retain a copy of the last packet passing through this part of the system here
set lastloc=lastloc.pak
set lastglb=lastglb.pak
type ftplog.txt | find "" /v /c > ftpcount.txt
set /a cnt=0
for /f %%a in ('type ftpcount.txt') do set /a cnt=%%a
rem echo log has %cnt% lines
if %cnt% LEQ 100 goto done
ren ftplog.txt ftplogbak.txt
> ftplog.txt echo Started new log %DATE% %TIME%
:done
rem the file lastrun.txt holds a count of how many executions have occurred since the last successful run
rem this count is reset to zero when a data transfer takes place
set /a last_run=0
for /f %%a in ('type lastrun.txt') do set /a last_run=%%a
set /a last_run = last_run + 1
> lastrun.txt echo %last_run%
ren ftplog.txt ftplog1.txt
> ftplog.txt echo Started at %DATE% %TIME%
rem ----------- GLB file handling --------------
rem attempt to copy GLB file from FTP site
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo recv %glb% %tempglb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
rem if tempglb exists, rename it and delete FTP site copy
if not exist %tempglb% goto noglbfile
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo delete %glb%
>> script.ftp echo quit
ftp -s:script.ftp > nul
copy %tempglb% %lastglb%
del %lastloc%
ren %tempglb% %glb%
>> ftplog.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
> good.txt echo Copied GLB file from FTP server to nomad at %DATE% %TIME%
>lastrun.txt echo 0
set /a last_run=0
:noglbfile
rem ------------ LOC file handling -------------
rem rename LOC file (if any) to temp LOC file
if not exist %loc% goto nolocfile
rem copy the LOC file to the ftp site
ren %loc% %temploc%
> script.ftp echo open %ftpsite%
>> script.ftp echo %user%
>> script.ftp echo %password%
>> script.ftp echo BINARY
>> script.ftp echo cd %folder%
>> script.ftp echo send %temploc% %loc%
>> script.ftp echo quit
ftp -s:script.ftp > ftpresult.log
set ftpworked=0
for /f "delims=}" %%a in ('type ftpresult.log') do (
rem echo %%a
if "%%a"=="226-File successfully transferred" (
copy %temploc% %lastloc%
del %lastglb%
del %temploc%
>> ftplog.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> good.txt echo Copied LOC file from nomad to FTP server at %DATE% %TIME%
> %child_LOC_status% echo %DATE% %TIME%
>lastrun.txt echo 0
set /a last_run=0
set ftpworked=1
)
)
rem if the ftp command failed, we need to make the loc file available again for a retry
if %ftpworked%==0 ren %temploc% %loc%
:nolocfile

>>ftplog.txt type ftplog1.txt
del ftplog1.txt

Enhancements


Whilst this system is essentially a simple communications channel, getting two computers to communicate reliably is never simple. There are actually many faults that could interrupt communications. In this system, we are actually using three computers! Examples of faults include:
  1. Faults on Parent Server
  2. Faults on Child Server
  3. Faults on the FTP Server
  4. Faults in communications Channels
On each of the servers, there are multiple potential causes of problems. The simple answer to some of these faults may be to reboot the appropriate server. But if the fault has actually caused a serious interruption to the replication system, additional intervention may be required.

Actually recovery from failure conditions can only be achieved from an understanding of which part of the system has failed. For that to happen, the user has to explore the log files and processes on both the parent and child database servers. It would also helpful to alert the user in the event of replication ceasing. These enhancements are described in Enhanced Replication Monitoring.














mbw
mbw
Latest page update: made by mbw , Jan 13 2010, 10:00 AM EST (about this update About This Update mbw Edited by mbw

603 words added
1 word deleted

view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.